But these aren’t new questions to the engineers at Boston-area companies at the leading edge of vocal simulation tech.
Cambridge-based Modulate and VocalID in Belmont each make software that can imitate human voices. The chief executives of both companies say they’re well aware of potential abuses and are establishing a coalition of software companies to set ethical standards for the industry. Meanwhile, Nuance Communications of Burlington, which pioneered software to enable computers to understand human speech, is focused on building products that can identify fake computer-generated speech.
For most people, there’s no need to panic about synthetic speech. “If the question is, could this happen to an ordinary person,” said Modulate chief executive Mike Pappas, “I think the answer is still pretty vehemently no.”
That’s because it takes a lot of audio data to build an accurate model of someone’s voice — several hours’ worth, according to Brett Beranek, general manager of Nuance’s security and biometrics business unit. Unless you’re a podcaster or YouTube video maven with lots of publicly available recordings of your voice, you’re safe. But that’s cold comfort for celebrities and politicians who’ve spoken into microphones all their lives.
In addition, Beranek warned that someday computers will need just a few minutes of someone’s speech to work up a decent simulation, making ordinary people a lot more vulnerable. “Our position is, it’s just a matter of time,” Beranek said.
VocalID, which started with medical applications, works with businesses to automate advertising voice-overs or customer service call centers. Instead of hiring voice actors to create each new announcement, a VocalID customer can choose from a library of vocal styles created through AI analysis of real human speech. For example, if a company wanted announcements read in a female contralto voice with a Southern accent, it would just feed the text into the program, which generates the desired audio.
Modulate offers a very different product: simulated voices for use in real-time online video streaming. Gamers who broadcast live streams on services like Twitch or Facebook can choose from a library of realistic voices in place of their own. The Modulate software instantly translates the player’s speech into the new simulated voice. To people watching the stream, a 25-year-old girl could sound like a middle-aged man, or vice versa.
Neither Modulate nor VocalID are in the business of imitating celebrity voices. But both companies know that their underlying technology could be abused. So the two teamed up in 2019 to form the AiTHOS Coalition, to act as a synthetic media watchdog.
“The purpose of AiTHOS is to hold each other accountable and to uphold our commitment to fair, equitable, and ethical use of synthetic media technology,” said Rupal Patel, chief executive of VocalID and professor of computer science at Northeastern University.
Pappas said that 15 other synthetic media companies are taking part in discussions with AiTHOS, and that he expects them to become official members of the group.
Meanwhile, both companies are taking care to ensure their systems aren’t abused. Neither of them sells to the general public, only to enterprises. In addition, both Modulate and VocalID insert digital “watermarks” into all synthetic audio files, making it possible for forensic investigators to prove that a recording wasn’t created by a human voice.
That’s not good enough for Nuance’s Beranek, who notes that criminals could create a synthetic audio program without a watermarking feature. So Nuance, which was recently acquired for $19.7 billion by Microsoft, is focused on ways to detect synthetic voices, whether watermarked or not. It’s a vital task for Nuance, which makes voice identification software that’s used by banks and other businesses to identify customers. The company must ensure that this software can’t be tricked.
So Nuance attacks a key weakness of synthetic speech: its sameness.
“There is this variability in the human voice that is a telltale sign that it’s an actual human,” said Beranek. If a real person says the same word 20 times, it’ll sound slightly different every time. But according to Beranek, voice imitation software will make the same sounds in the same way, every time. Human ears might not detect this, but Nuance’s software can pick it up and flag the speech as artificial.
Still, Beranek expects synthetic voice software to keep getting better, and harder to spot. “This is a constant struggle to stay ahead of these malicious actors,” he said. “One of the key messages that we communicate to our customers is, don’t sit on your hands.”