Humans are unable to accurately detect over a quarter of deepfake speech samples, a study has shown.
The study, published in PLOS ONE, is the first to examine the human capacity to distinguish artificially generated speech in languages other than English.
Deepfakes, the term for synthetic media that impersonates real persons' voices or appearances, belong to the realm of generative artificial intelligence.
This form of AI utilises machine learning to teach an algorithm the patterns and features of a data set – such as a video or audio recording of a real person – enabling it to replicate original sounds or visuals.
Once requiring thousands of voice samples, cutting-edge pre-trained algorithms can now reproduce a person's voice with only a three-second audio clip.
These open-source algorithms are not only readily accessible but also easy to train, potentially within a few days, even for a person with limited expertise.
In a notable development, tech company Apple has recently introduced a software feature for its iPhone and iPad devices that can clone a user's voice using 15 minutes of audio.
The University College London researchers conducted their study by creating deepfake speech samples in English and Mandarin using a text-to-speech algorithm.
This algorithm was trained on two public data sets and used to generate 50 deepfake speech samples in each language.
The samples were intentionally different from the ones used for training the algorithm to avoid it from simply duplicating the original input.
To assess humans' ability to discern real from fake, these artificially generated samples were played alongside genuine samples to a pool of 529 participants.
Participants could only accurately identify the fake speech 73 per cent of the time, a figure that only marginally improved after they underwent training on how to recognise deepfake speech.
“Our findings confirm that humans are unable to reliably detect deepfake speech, whether or not they have received training to help them spot artificial content,” lead author of the study, Kimberly Mai from UCL Computer Science, said.
“Considering the samples we used were created with comparatively old algorithms, it begs the question if humans would fare worse in detecting deepfake speech produced using more advanced technology now and in the future.”
The researchers now aim to develop superior automated speech detectors to counter the threats posed by artificially generated audio and imagery.
While generative AI audio technology can yield benefits, such as enhancing accessibility for people with speech impairments or those who may lose their voice due to illness, concerns are mounting about its potential misuse by criminals and nations to cause harm to people and societies.
“AI-powered detectors are a common way to detect speech deepfakes. During training, they see lots of examples of real and fake speech,” Ms Mai told The National.
“Through this process, the detectors learn patterns that make synthesised speech distinguishable from real examples.
“Our results indicate we should not be too reliant on current iterations of AI-powered detectors.
“Although they are good at identifying examples of deepfake speech similar to samples seen during training, for example, if the speaker identity is the same, their performance can decline when there are changes to the test audio, for example, the speaker identity is different, or the environment is noisier.”
A case in point from 2019, an incident where a British energy company's chief executive was tricked into transferring hundreds of thousands of pounds to a fraudulent supplier using a deepfake recording of his superior's voice.
“With generative artificial intelligence technology getting more sophisticated and many of these tools openly available, we’re on the verge of seeing numerous benefits as well as risks,” senior author of the study, Lewis Griffin of UCL Computer Science, said.
“While it is crucial for governments and organisations to develop strategies to address misuse of these tools, we should also acknowledge the positive possibilities that are on the horizon.”
Ms Mai said on developing and deploying deepfake speech detectors: “Because the deepfake speech detectors can be sensitive to changes in audio, it is important to evaluate them in various situations, for example, different speakers, noisier environments or varying accents, to minimise false positives and negatives.”