Few can doubt the Arabic language’s significance in the world around us. There are nearly as many native Arabic speakers as there are North Americans, and they outnumber native French speakers six to one. Arabic is an official language at the UN, and the liturgical language of the world’s 1.8 billion Muslims.
On the internet, however, its presence is surprisingly scarce – almost non-existent. Despite it being the fourth-most common language among netizens (mirroring the offline world), less than 1 per cent of online content is published in Arabic. More websites are written in Czech, a language spoken by a population the size of the UAE’s.
With the advent of new AI technology that can generate text, speech, images and other media, often to a level that matches or possibly surpasses human ability, the online presence of languages has suddenly become more relevant than ever. “Generative AI” is expected to transform the digital and real worlds alike, and its most common form is the large language model (LLM), which, as the name suggests, produces coherent content by training on vast amounts of data – usually drawn from the internet – in a given language. The more data available for training, the better the model. It is easy to see, then, why English seems set to dominate the AI revolution, and why the race is on for those who want to safeguard a future for other languages to catch up.
This week, the position of the Arabic language got a boost with the roll-out of Jais, an open-source bilingual Arabic-English LLM developed in the UAE. Jais’s developers – a team drawn from Abu Dhabi AI firm G42, Mohamed bin Zayed University of Artificial Intelligence and US tech firm Cerebras Systems – say their LLM is now the most accurate one available in Arabic.
Impressively, Jais can operate in multiple Arabic dialects – a skill that speakers of the language will know to be critical for widespread adoption and success. Arabic is often referred to by linguists as a “macrolanguage”, owing to the extreme variations across these dialects. Jais’s developing ability to generate content across them, along with Modern Standard Arabic and English, could one day help to strengthen translation services, bolster the Arabic education sector and drive more digital adoption in the Arab world.
The greatest challenge for Jais, of course, is the limited online Arabic material on which to train. But Andrew Jackson, chief executive of the G42 unit involved in Jais, says overcoming this obstacle is a major focus of the team’s work.
“We’re spearheading an initiative to collect more Arabic data from offline sources,” he told The National. “So this has already kicked off in earnest and this is the first method that we will employ to boost Arabic.”
Developing an Arabic LLM to a level where it bears all the promise of English-language counterparts like ChatGPT will be a monumental task. It is perhaps little wonder that Jais is named after the UAE’s highest mountain. But if the summit of its potential can be reached, it could transform life in the Arab world and ensure that one of humanity’s great ancient languages has a permanent place in its future.