It’s not easy to make a synthetic voice sound human. We’ve become very used to the dulcet tones of voice assistants such as Apple’s Siri and Amazon’s Alexa, but whether it’s because of their mispronunciations, awkward pauses or relentless cheeriness, we instinctively know the difference between these computerised approximations of a human voice and a real one.
But what if we couldn’t tell the difference? Synthetic voices have, in the last couple of years, become increasingly sophisticated. Deep learning techniques have given algorithms a much better handle on the way human beings speak, and they can now instruct synthetic voices to express themselves with ever greater nuance.
Earlier this month, Reuters reported on how a start-up in Los Angeles had constructed a synthetic voice avatar from the voice of a local DJ, Andy Chanley. That “robot DJ” version of Andy can now deliver written lines in a way that’s hard to distinguish from the real thing. Chanley himself, having spent three decades broadcasting, is delighted that his voice will live on, and it’s clear that across the fields of entertainment, broadcasting and marketing, synthetic voices will become normal.
What perhaps couldn’t have been predicted, however, is our attachment to human voices, and how computerised voices can disconcert us when they masquerade as the real thing.
“Voice is so personal, so human,” says Jon Stine of the Open Voice Network, an organisation developing ethical standards for voice technology. “It's biometric. It identifies us uniquely.
"It can be used to infer our age, our health, educational level, ethnicity. When friends say hello, we know who they are! It's a very precious element in our life, and we must treat it with the respect it deserves.”
Earlier this year, that respect was deemed to have been sullied by filmmaker Morgan Neville, when he admitted to using AI to reconstruct the voice of the chef Anthony Bourdain for use in a documentary about his life and death. This fact wasn’t disclosed in the film, and those watching would never have known had it not been admitted later. But when it was, it provoked a fierce debate about ethics.
“In my opinion, people reacted to it in an adverse fashion because it almost feels offensive to that individual's lack of ability to control their persona,” says technology ethicist David Polgar. “Is this something he would have wanted? So suddenly we feel vulnerable to being manipulated, but we also feel vulnerable because it means we can’t trust our own judgement [when hearing these voices]. We need to be able to trust our ears.”
Nevertheless, the global text-to-speech market, worth around $2 billion last year, is projected to grow threefold by 2028. It’s largely driven by consumer demand for content, and the difficulty of meeting demand because of the limits of traditional ways of working.
Voices synthesised from celebrities could be used globally, in any context, without the celebrity having to personally record those messages. Dubbing could easily be fixed in movies. Actors and voiceover artists could have their voices localised with different accents, even different languages. The worlds of advertising, education, virtual reality and even health could see significant benefits.
And yet the technology has a way to go. There are still inherent difficulties in creating a synthetic voice that doesn’t prompt the “uncanny valley” effect, where the listener has the sense that something isn’t quite right. That’s because the written word and the spoken word are very different things.
“When you use text-to-speech, AI needs to guess how to say it,” says Alex Serdiuk, chief executive of Respeecher, a synthetic voice company in Ukraine. “And this AI is extremely limited to what emotions it can guess. Also, speech doesn’t just consist of words. Whispering, or singing, sighing or screaming – these things cannot be converted using text-to-speech in any way, and they’re very important parts of our speech.”
Respeecher’s elegant solution to the problem is using what it calls “speech-to-speech” technology, where one person’s voice, complete with all its nuances, is transformed into that of another. The technology was recently used in The Mandalorian, a Star Wars spin-off series, to provide a voice for the young Luke Skywalker. No one knew until they were told later.
As with any form of AI, these innovations can be put to nefarious uses. “It often takes the general public a longer while to fully recognise a problem when it's already been incorporated into society – and that's a problem,” says Polgar.
Artificial voices have already been used to perpetrate telephone scams, where people are persuaded to part with money in the belief that they’re speaking to someone they trust. As the quality of these voices improves, our susceptibility to these scams will increase.
Voice actors and radio broadcasters have become concerned for their livelihoods, and as Polgar notes, such voices can make the public feel vulnerable. Organisations such as Open Voice Network are busy constructing ethical frameworks for the technology, but what of those who simply don’t adhere to them?
“In most countries, it's just not legal to use someone's intellectual property – ie their likeness, their voice – to produce something without their consent,” says Serdiuk. “But the first and most important goal is to educate societies that these technologies exist, that they will fall into the wrong hands, and will be misused. So we should start treating this information differently.”