Meta unveils its first speech-translation system for unwritten languages

The company will make its technology available to the AI community to allow other researchers to build on its work

Meta said the dependence of speech-to-speech translation models on text limits their efficiency. AFP
Powered by automated translation

Facebook parent company Meta has released its first speech-to-speech translation system for spoken languages.

Developed under Meta’s Universal Speech Translator (UST) project, the system focuses on developing artificial intelligence systems that provide speech-to-speech translation across all languages.

Meta develops first speech translation system for unwritten languages

Meta develops first speech translation system for unwritten languages

Meta's AI researchers built translation systems for the Hokkien language — one of Taiwan’s official languages that is widely spoken within the Chinese diaspora but lacks a standard written form, the company said.

To allow other independent researchers to develop their own speech-to-speech models using Meta's technology, the California-based company has open-sourced the Hokkien translation model and released the data sets and speech matrix.

“Until now, AI translation has focused on written languages. Yet, of the more than 7,000 living languages, more than 40 per cent of languages are primarily oral and do not have a standard or widely known writing system,” Meta said in a blog.

“We plan to use our Hokkien translation system as part of a universal speech translator and will open source our model, code and training data for the AI community to enable other researchers to build on this work.”

The latest AI-driven technology allows Hokkien speakers to have conversations with people who speak English.

However, the technology can be extended to other unwritten languages and eventually will work in real time, Meta said, adding that more than 8,000 hours of Hokkien speech had been mined, together with the corresponding English translations.

While the model is still work in progress and can only translate one complete sentence at a time, “it is a step towards a future where simultaneous translation between languages is possible”, Meta said.

“We are releasing the speech matrix, a large corpus of speech-to-speech translations mined with Meta’s innovative data mining technique called Laser, which will enable researchers to create their own speech-to-speech translation systems and build on our work.”

Speech-to-speech translation systems have been developed over the past several years with top technology companies such as Alphabet and Microsoft rolling out similar products in the past.

Meta faced a number of limitations when developing direct speech-to-speech translation, including data gathering, model design and evaluation.

Most speech translation systems use text as an intermediary step. For example, speech in one language is first converted to text, then translated to text in the desired language and finally input into a text-to-speech system to generate audio.

This makes speech-to-speech translations dependent on text in ways that limit their efficiency and make them difficult to scale to languages that are primarily oral, Meta said.

Meanwhile, the direct speech-to-speech translation models enable the translation of languages that don’t have standardised writing systems.

This speech-based approach could lead to faster and more efficient translation systems as they will not require the additional steps of converting speech to text, translating it and then generating speech in the desired language.

Spoken communications can also help to break down barriers and bring people together wherever they are located — even in the metaverse, Meta said.

“AI research is helping to break down language barriers — both in the real world and the metaverse.

“In the future, all languages, whether written or unwritten, may no longer be an obstacle to mutual understanding. We look forward to contributing to this future of seamless communication,” the company said.

The metaverse is a digital space that allows users to communicate and move virtually in their three-dimensional avatars or digital representations.

Described as a successor to the internet, it is a set of immersive spaces shared by users, in which they can interact, innovate and engage other people who are not in the same physical location.

Updated: October 26, 2022, 7:16 AM