SAN FRANCISCO: Last July, Google made an eight-word change to its privacy policy that represented a significant step in its race to build the next generation of artificial intelligence.
Buried thousands of words into its document, Google tweaked the phrasing for how it used data for its products, adding that public information could be used to train its AI chatbot and other services.
The subtle change was not unique to Google. As companies look to train their AI models on data that is protected by privacy laws, they’re carefully rewriting their terms and conditions to include words like “artificial intelligence”, “machine learning” and “generative AI”.
Some changes to terms of service are as small as a few words. Others include the addition of entire sections to explain how generative AI models work, and the types of access they have to user data. Snap, for instance, warned its users not to share confidential information with its AI chatbot because it would be used in its training, and Meta alerted users in Europe that public posts on Facebook and Instagram would soon be used to train its large language model.
Those terms and conditions – which many people have long ignored – are now being contested by some users who are writers, illustrators and visual artists and worry that their work is being used to train the products that threaten to replace them.
“We’re being destroyed already left, right and center by inferior content that is basically trained on our stuff, and now we’re being discarded,” said Sasha Yanshin, a YouTube personality and co-founder of a travel recommendation site.
This month, Yanshin canceled his Adobe subscription over a change to its privacy policy. “The hardware store that sells you a paintbrush doesn’t get to own the painting that you make with it, right?” he said.
To train generative AI, tech companies can draw from two pools of data – public and private. Public data is available on the web for anyone to see, while private data includes things like text messages, emails and social media posts made from private accounts.
Public data is a finite resource, and a number of companies are only a few years away from using all of it for their AI systems. But tech giants like Meta and Google are sitting on a trove of private data that could be 10 times the size of its public counterpart, said Tamay Besiroglu, an associate director at Epoch, an AI research institute.
That data could amount to “a substantial advantage” in the AI race, Besiroglu said. The problem is gaining access to it. Private data is mostly protected by a patchwork of federal and state privacy laws that give users some sort of licensing over the content they create online, and companies can’t use it for their own products without consent.
In February, the Federal Trade Commission warned tech companies that changing privacy policies to retroactively scrape old data could be “unfair or deceptive.”
AI training could eventually use the most personal kinds of data, like messages to friends and family. A Google spokesperson said a small test group of users, with permission, had allowed Google to train its AI on some aspects of their personal emails.
Some companies have struggled to balance their hunger for new data with users’ privacy concerns. In June, Adobe faced backlash on social media after it changed its privacy policy to include a phrase about automation that many of its customers interpreted as having to do with AI scraping.
The company explained the changes with a pair of blog posts, saying customers had misunderstood them. On June 18, Adobe added explanations to the top of some sections of its terms and conditions.
“We’ve never trained generative AI on customer content, taken ownership of a customer’s work or allowed access to customer content beyond legal requirements,” Dana Rao, Adobe’s general counsel and its chief trust officer, said in a statement.
This year, Snap updated its privacy policy about data collected by My AI, its AI chatbot that users can have conversations with.
A Snap spokesperson said the company gave “upfront notices” about how it used data to train its AI with the opt-in of its users.
In September, the social platform X added a single sentence to its privacy policy about machine learning and AI. The company did not return a request for comment.
Last month, Meta alerted its Facebook and Instagram users in Europe that it would use publicly available posts to train its AI starting Wednesday, inciting some backlash. It later paused the plans after the European Center for Digital Rights brought complaints against the company in 11 European countries.
In the United States, where privacy laws are less strict, Meta has been able to use public social media posts to train its AI without such an alert. The company announced in September that the new version of its large language model was trained on user data that its previous iteration had not been trained on.
Meta has said its AI did not read messages sent between friends and family on apps like Messenger and WhatsApp unless a user tagged its AI chatbot in a message.
“Using publicly available information to train AI models is an industrywide practice and not unique to our services,” a Meta spokesperson said in a statement.
Many companies are also adding language to their terms of use that protects their content from being scraped to train competing AI.
Yanshin said that he hoped regulators could act fast in creating protections for small businesses like his against AI companies, and that traffic to his travel website had fallen 95% since it began competing with AI aggregators.
“People are going to sit around debating the pros and cons of stealing data because it makes a nice chatbot," he said. “In three, four, five years’ time, there might not be entire segments of this creative industry because we’ll just be decimated.” – The New York Times