The Internet is becoming awash in words and images generated by artificial intelligence.
Sam Altman, OpenAI’s CEO, wrote in February that the company generated about 100 billion words per day – 1 million novels’ worth of text, every day, an unknown share of which finds its way onto the Internet.
AI-generated text may show up as a restaurant review, a dating profile or a social media post. And it may show up as a news article, too: NewsGuard, a website that tracks online misinformation, recently identified over a thousand websites that churn out error-prone AI-generated news articles.
In reality, with no foolproof methods to detect this kind of content, much will simply remain undetected.
All this AI-generated information can make it harder for us to know what’s real. And it also poses a problem for AI companies as they develop more and more powerful models, which involve digesting volumes of data so vast that they are approaching the size of the Internet itself.
As these companies trawl the web for new data to train their next models on – an increasingly challenging task – they’re likely to ingest some of their own AI-generated content, creating an unintentional feedback loop in which what was once the output from one AI becomes the input for another.
In the long run, this cycle may pose a threat to AI itself. Research has shown that when generative AI is trained on a lot of its own output, it can get a lot worse.
Imagine a chatbot offering medical advice that is trained on medical conditions that were “hallucinated” by a previous AI – or one that offers legal advice trained on fictitious rulings that it encountered online. While this is a simplified example, it illustrates a problem on the horizon.
Just as a copy of a copy can drift away from the original, when generative AI is trained on its own content, its output can also drift away from reality, growing further apart from the original data that it was intended to imitate.
In a paper published in July in the journal Nature, a group of researchers in Britain showed how this process results in a narrower range of AI output over time – an early stage of what they called “model collapse.”
If only some of the training data were AI-generated, the decline would be slower or more subtle. But it would still occur, researchers say, unless the synthetic data was complemented with a lot of new, real data.
This problem is not just confined to text. Another team of researchers at Rice University studied what would happen when the kinds of AI that generate images are repeatedly trained on their own output – a problem that could already be occurring as AI-generated images flood the web.
They found that glitches and image artifacts started to build up in the AI’s output, eventually producing distorted images with wrinkled patterns and mangled fingers.
“You’re kind of drifting into parts of the space that are like a no-fly zone,” said Richard Baraniuk, a professor at Rice who led the research on AI image models.
The researchers found that the only way to stave off this problem was to ensure that the AI was also trained on a sufficient supply of new, real data.
While selfies are certainly not in short supply on the Internet, there could be categories of images where AI output outnumbers genuine data, they said.
For example, AI-generated images in the style of Vincent van Gogh could outnumber actual photographs of van Gogh paintings in AI’s training data, and this may lead to errors and distortions in future AI output in that style. (The researchers said these problems are hard to identify early on, in part because the leading AI models are closed to outside scrutiny.)
All of these problems arise because AI-generated data is a poor substitute for the real thing.
This is sometimes easy to see, like when chatbots state absurd facts or when AI-generated hands have too many fingers. But the differences that lead to model collapse aren’t necessarily obvious – and they can be difficult to detect.
Why it matters
This doesn’t mean generative AI will grind to a halt anytime soon.
The companies that make these tools are aware of these problems, and they will notice if their AI systems start to deteriorate in quality.
But it may slow things down. As existing sources of data dry up or become contaminated with AI “slop,” researchers say it makes it harder for newcomers to compete.
AI-generated words and images are already beginning to flood social media and the wider web. They are even hiding in some of the data sets used to train AI, the Rice researchers found.
“The web is becoming increasingly a dangerous place to look for your data,” said Sina Alemohammad, a graduate student at Rice who studied how AI contamination affects image models.
Big players will be affected, too. Computer scientists at New York University found that when there is a lot of AI-generated content in the training data, it takes more computing power to train AI – which translates into more energy and more money.
“Models won’t scale anymore as they should be scaling,” said Julia Kempe, the NYU professor who led the work.
Ways out
Perhaps the biggest takeaway of this research is that high-quality, diverse data is valuable and hard for computers to emulate.
One solution, then, is for AI companies to pay for the data instead of scooping it up from the Internet, ensuring both human origin and high quality.
OpenAI and Google have made deals with some publishers or websites to use their data to improve AI. (The New York Times sued OpenAI and Microsoft last year, alleging copyright infringement. OpenAI and Microsoft say their use of the content is considered fair use under copyright law.)
Google and OpenAI are working on AI “watermarking” tools, which introduce hidden patterns that can be used to identify AI-generated images and text.
But watermarking text is challenging, researchers say, because the watermarks can’t always be reliably detected and can easily be subverted (they may not survive being translated into another language, for example).
AI slop is not the only reason that companies may need to be wary of synthetic data. Another problem is that there are only so many words on the Internet.
Some experts estimate that the largest AI models have been trained on just a few percent of the available pool of text on the Internet. They project that these models may run out of public data to sustain their current pace of growth within a decade.
“These models are so enormous that the entire Internet of images or conversations is somehow close to being not enough,” Baraniuk said.
To meet their growing data needs, some companies are considering using today’s AI models to generate data to train tomorrow’s models. But researchers say this can lead to unintended consequences.
There are certain contexts where synthetic data can help AIs learn – for example, when output from a larger AI model is used to train a smaller one, or when the correct answer can be verified, like the solution to a math problem.
And new research suggests that when humans curate synthetic data (for example, by ranking AI answers and choosing the best one), it can alleviate some of the problems of collapse.
But for now, there’s no replacement for the real thing.
Degenerative AI
In one example, the researchers trained a large language model on its own sentences over and over again, asking it to complete the same prompt after each round.
Initial AI output
When they asked the AI to complete a sentence that started with “To cook a turkey for Thanksgiving, you...,” even at the outset, the AI “hallucinates.”
To cook a turkey for Thanksgiving, you have to prepare it in the oven. You can do this by adding salt and pepper to the turkey, then cooking it in the oven until it is golden brown on the inside. This method has been shown to be effective reducing the risk of heart disease due to its use as a substitute for other foods...
AI output after two generations
But when the researchers further trained it on its own sentences, it got a lot worse. After two generations, it started simply printing long lists.
To cook a turkey for Thanksgiving, you have to be able to eat it all at once. However, if you don’t have enough time to do so, you can use other foods such as potatoes, beans, rice, meat, fish, vegetables, fruits, nuts, seeds, etc. In some cases, this may be due to the fact that...
AI output after four generations
And after four generations, it began to repeat phrases incoherently.
To cook a turkey for Thanksgiving, you need to know what you are going to do with your life if you don’t know what you are going to do with your life if you don’t know what you are going to do with your life if you don’t know what you are going to do with your life if you don’t know what you are going to do with your life... – ©2024 The New York Times Company