Meta has quietly unleashed a new web crawler to scour the Internet and collect data en masse to feed its AI model.
The crawler, named the Meta External Agent, was launched last month according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes”, all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.
A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data.
Meta, the parent company of Facebook, Instagram, and Whatsapp, updated a corporate website for developers with a tab disclosing the existence of the new scraper in late July, according to a version history found using the Internet Archive. Besides updating the page, Meta has not publicly announced the new crawler.
A Meta spokesman said the company has had a crawler under a different name “for years,” although this crawler – dubbed Facebook External Hit – “has been used for different purposes over time, like sharing link previews.”
“Like other companies, we train our generative AI models on content that is publicly available online,” the spokesman said. “We recently updated our guidance regarding the best way for publishers to exclude their domains from being crawled by Meta’s AI-related crawlers.”
Scraping web data to train AI models is a controversial practice that has led to numerous lawsuits by artists, writers, and others, who say AI companies used their content and intellectual property without their consent. Some AI companies like OpenAI and Perplexity have struck deals in recent months that pay content providers for access to their data (Fortune was among several news providers that announced a revenue-sharing deal with Perplexity in July).
Flying under the radar
While close to 25% of the world’s most popular websites now block GPTBot, only 2% are blocking Meta’s new bot, data from Dark Visitors shows.
In order for a website to attempt to block a web scraper, it must deploy robots.txt, a line of code added to a codebase, in order to signal to a scraper bot that it should ignore that site’s information. However, typically the specific name of a scraper bot needs to be added as well in order for robots.txt to be respected. That’s difficult to accomplish if the name has not been openly disclosed. An operator of a scraper bot can also simply choose to ignore robots.txt – it is not enforceable or legally binding in any way.
Such scrapers are used to pull mass amounts of data and written text from the web, to be used as training data for generative AI models, also referred to as large language models or LLMs, and related tools. Meta’s Llama is one of the largest LLMs available, and it powers things like Meta AI, an AI chat bot that now appears on various Meta platforms. While the company did not disclose the training data used for the latest version of the model, Llama 3, its initial version of the model used large data sets put together by other sources, like Common Crawl.
Earlier this year, Mark Zuckerberg, Meta’s co-founder and longtime CEO, boasted on an earnings call that his company's social platforms had amassed a data set for AI training that was even “greater than the Common Crawl”, an entity that has scraped roughly 3 billion web pages each month since 2011.
The existence of the new crawler suggests Meta's vast trove of data may no longer be enough however, as the company continues to work on updating Llama and expanding Meta AI. LLMs typically need new and quality training data to keep improving in functionality. Meta is on track to spend up to US$40bil (RM174.68bil) this year, mostly on AI infrastructure and related costs. – Fortune.com/The New York Times