Stanford researchers gave a popular artificial intelligence chatbot a language test.
They asked the bot in Vietnamese to write a traditional poem in the form known as “song thất lục bát” that follows a pattern of lines made up of seven, seven, six, then eight words. When the bot spit out an answer, it wrote a poem but didn’t follow the format.
The team tried a different prompt, asking what the proper Vietnamese word was for a mother’s younger brother, and it responded with the words for a father’s younger and older siblings.
These flaws are not unique to Claude 3.5, the chatbot by the AI company Anthropic that the researchers queried, but they illustrate some of the ways in which AI can get language outside of standard American English wrong.
While the use of AI has exploded in the West, much of the rest of the world has been left out of the conversation since most of the technology is trained in English. AI experts worry that the language gap could exacerbate technological inequities and that it could leave many regions and cultures behind.
A delay of access to good technology of even a few years “can potentially lead to a few decades of economic delay,” said Sang Truong, a doctoral candidate at the Stanford Artificial Intelligence Laboratory at Stanford University on the team that built and tested a Vietnamese language model against others.
The tests his team ran found that AI tools across the board could get facts and diction wrong when working with Vietnamese, likely because it is a “low-resource” language by industry standards, which means that there aren’t sufficient data sets and content available online for the AI model to learn from.
Low-resource languages are spoken by tens and sometimes hundreds of millions of people around the world, but they yield less digital data because AI tech development and online engagement is centered in the United States and China. Other low-resource languages include Hindi, Bengali and Swahili, as well as lesser-known dialects spoken by smaller populations around the world.
An analysis of top websites by W3Techs, a tech survey company, found that English makes up more than 60% of the internet’s language data. While English is widely spoken globally, native English speakers make up about 5% of the population, according to Ethnologue, a research organization that collects language data. Mandarin and Spanish are other examples of languages with a significant online presence and reliable digital data sets.
Academic institutions, grassroots organizations and volunteer efforts are playing catch-up to build resources for speakers of languages who aren’t as well represented in the digital landscape.
Lelapa AI, based in Johannesburg, is one such company leading efforts on the African continent. The South African-based startup is developing multilingual AI products for people and businesses in Africa.
“I think it’s such a dangerous concept that people need to assimilate to a different culture and have to take on different cultures in order to have access to progress,” said Pelonomi Moiloa, CEO and co-founder of Lelapa AI.
The company is less focused on scale than on community-specific solutions, she said. It is crafting its products to be more resource-efficient, cost-effective and to be used primarily on speech-to-speech communication in the local languages, which make the technology more accessible to African people.
“Large companies like Google, Apple, OpenAI, for example, have not necessarily trained their models for tools that serve these markets,” Chinasa T. Okolo, a fellow at the Center for Technology Innovation at the Brookings Institution, said about communities with low-resource languages. “They don’t provide enough market value for them to do so.”
A communications officer for OpenAI said the company releases AI systems steadily to more groups of people and that its latest model supports more than 50 languages. Google pointed to its projects focusing on AI development for underrepresented languages, including a “1,000 languages” initiative, announced in 2022, to build language models for the 1,000 most-spoken languages in the world. Apple said it, too, has developed products to support a range of languages.
The consequences of the language gap in AI tools can be numerous. The technology has potential to increase productivity and change workplaces, but without reliable data in local languages, some regions of the world could miss out on the economic benefits, according to AI experts. The exclusion of low-resource languages could also lead to cultural bias in AI products.
AI’s lack of knowledge in low-resource languages has the potential to raise security concerns as well. Sara Hooker, the head of Cohere for AI, the nonprofit research arm of the startup Cohere, said some users could bypass the safety measures of AI products by asking questions in other languages.
“You can easily, for example, still get very dangerous instructions about how to build a bomb just by switching to a different language,” Hooker said.
Hooker’s team at Cohere for AI launched a broad model and data set for multilingual AI, called Aya, in February. It includes 101 languages and relies on the volunteer efforts of more than 3,000 independent researchers. But Hooker said that even a project that big wasn’t a solution to the language lag.
She said that in AI, the industry is often focused on the latest model and how it performs, “but in this particular topic, it’s also reshaping the ecosystem as a whole,” adding that the gap will widen unless researchers from around the world are involved as AI develops further and at a rapid pace.
While the issue is obvious for many in the industry, the solutions are complicated. Large-language models, or LLMs, which are used in technology to communicate in human language, require large banks of high-quality data, often collected from the internet and not easily accessible for low-resource languages. Truong equated building an LLM to teaching a newborn: There may be 20,000 books with lessons in English, but there are just five in Vietnamese.
The disparity is so large in some regions that governments have stepped in to back efforts to build their own language models. This spring, the Nigerian government promised to back the tech startup Awarri in building a model for local languages. Both Iceland’s government and the Welsh government work with OpenAI to improve ChatGPT’s understanding of the native languages there.
“The language gap is really important in terms of access, but it is also just really important to help reenergize people’s sense of pride in who they are, where they come from,” Moiloa of Lelapa AI said.
Sanmi Koyejo, the head of Stanford Trustworthy AI Research at Stanford University, said including more languages in all AI products is also important to capture cultural nuances and diverse perspectives.
Koyejo pointed to a Stanford study that fed questions from Pew Research to AI chatbots to gauge their biases. He said the chatbot’s answers most closely matched views of people in California, where much of the technology is being developed.
“Culture is a big aspect of this,” he said. “You lose something if you’re only seeing the Internet slash US-centric version of the world.” – The New York Times