The creator of an audio deepfake of US President Joe Biden urging people not to vote in last week’s New Hampshire primary has been suspended by ElevenLabs, according to a person familiar with the matter. ElevenLabs’ technology was used to make the deepfake audio, according to Pindrop Security Inc, a voice-fraud detection company that analyzed it.
ElevenLabs was made aware of Pindrop’s findings and is investigating, the person said. Once the deepfake was traced to its creator, that user’s account was suspended, said the person, asking not to be identified because the information isn’t public.
ElevenLabs, a startup that uses artificial intelligence software to replicate voices in more than two dozen languages, said in a statement that it couldn’t comment on specific incidents. But added, "We are dedicated to preventing the misuse of audio AI tools and take any incidents of misuse extremely seriously.”
Earlier last week, ElevenLabs announced an US$80mil (RM378.20mil) financing round from investors including Andreessen Horowitz and Sequoia Capital. Chief executive officer Mati Staniszewski said the latest financing gives his startup a US$1.1bil (RM5.20bil) valuation.
In an interview, Staniszewski said that audio that impersonate voices without permission would be removed. On its website, the company says it allows voice clones of public figures, like politicians, if the clips "express humor or mockery in a way that it is clear to the listener that what they are hearing is a parody.”
The fake robocall of Biden urging people to save their votes for the US elections in November has alarmed disinformation experts and elections officials alike. Not only did it illustrate the relative ease of creating audio deepfakes, but also hints at the potential of bad actors to use the technology to keep voters away from the polls.
A spokesperson for the New Hampshire Attorney General said at the time that the messages appeared "to be an unlawful attempt to disrupt the New Hampshire Presidential Primary Election and to suppress New Hampshire voters.” The agency has opened an investigation.
Users who want to clone voices on ElevenLabs must use a credit card to pay for the feature. It isn’t clear if ElevenLabs passed this information to New Hampshire authorities.
Bloomberg News received a copy of the recording on Jan 22 from the Attorney General’s office and tried to determine which technology was used to create it. Those efforts included running it through ElevenLabs own "speech classifier” tool, which is supposed to show if the audio was created using artificial intelligence and ElevenLabs’ technology. The recording showed a 2% likelihood of being synthetic or created using ElevenLabs, according to the tool.
Other deepfake tools confirmed it was a deepfake but couldn’t detect the technology behind the audio.
Pindrop’s researchers cleaned the audio by removing background noise, silence and breaking the audio into 155 segments of 250 milliseconds each for deep analysis, Pindrop’s founder Vijay Balasubramaniyan said in an interview. The company then compared the audio to a database of other samples it has collected from more than 100 text-to-speech systems that are commonly used to produce deepfakes, he said.
The researchers concluded that it was almost certainly created with ElevenLabs’ technology, Balasubramaniyan said.
In an ElevenLabs’ support channel on Discord, a moderator indicated on a public forum that the company’s speech classifier can’t detect its own audio unless it’s analysing the raw file, a point echoed by Balasubramaniyan. With the Biden call, the only files available for immediate analysis were recordings of the phone call, he said, explaining that it made it more difficult to analyze because bits of metadata were removed and it was more difficult to detect wavelengths.
Siwei Lyu, a professor at the University of Buffalo who specializes in deepfakes and digital media forensics, also analyzed a copy of the deepfake and ran it through ElevenLabs’ classifier, concluding that it was likely made with that company’s software, he told Bloomberg News. Lyu said ElevenLabs’ classifier is one of the first he checks when trying to determine an audio deepfake’s origins because the software is so commonly used.
"We’re going to see a lot more of this with the general election coming,” he said. "This is definitely a problem everyone should be aware of.”
Pindrop shared a version of the audio that its researchers had scrubbed and refined with Bloomberg News. Using that recording, ElevenLabs’ speech classifier concluded that it was an 84% match with its own technology.
Voice-cloning technology enables a "crazy combination of scale and personalisation” that can fool people into thinking they are hearing local politicians or high-ranking elected officials, Balasubramaniyan said, describing it as "a worrisome thing.”
Tech investors are throwing money at AI startups developing synthetic voices, videos and images in the hope it will transform the media and gaming industry.
Staniszewski said in the interview that his 40-person company had five people devoted to handling content moderation. "Ninety-nine percent of use cases we are seeing are in a positive realm,” the CEO said. With its funding announcement, the company also shared that its platform had generated more than 100 years of audio in the past 12 months. – Bloomberg