A new Abu Dhabi-developed artificial intelligence large language model for Arabic has been unveiled, aiming to bring one of the world's most widely used languages into the AI mainstream.
Jais, an open-source bilingual Arabic-English model, was developed by Inception, a unit of Abu Dhabi AI company G42, Mohammed bin Zayed University of Artificial Intelligence and Silicon Valley-based Cerebras Systems.
The developers said Jais is more accurate than other existing LLMs for Arabic. It is available to download on machine learning platform Hugging Face.
The launch of Jais is a further step towards encouraging the scientific and computing communities to focus more on non-English LLMs, similar to efforts made in Japan and India, Andrew Jackson, chief executive of Inception, told The National.
“We see Jais really becoming very useful in generative use cases, such as generating responses to questions, generating documents, translations, emails and even providing advice and recommendations,” he said.
It captures the linguistic nuances of various Arabic dialects and can comprehend language, context and cultural references, “making it more accurate and contextually relevant than other models”, the companies said.
Jais – a nod to the UAE's highest peak in Ras Al Khaimah – has been developed for government use and the financial, energy, climate and healthcare sectors.
Several public and private organisations in the UAE have signed on as Jais launch partners, including the Ministry of Foreign Affairs, the Ministry of Industry and Advanced Technology, the Department of Health – Abu Dhabi, ADNOC, Etihad Airways, FAB and e&, the technology conglomerate formerly known as Etisalat.
Jais is trained on the Condor Galaxy, the “world's largest AI supercomputer”, launched by G42 and Cerebras in July, using 116 billion Arabic tokens and 279 billion English tokens. It is being continuously expanded as more Arabic content is collected to generate new instruction sets, the companies said.
Tokens are the building blocks of language for an LLM, the basic unit of text or code used to process and generate language and other parts of the code.
Arabic is one of the most widespread languages worldwide, spoken by more than 400 million people, according to WorldData. It is the official language in 22 countries and is partly spoken in 11 others. Its online presence is minuscule, however, with about 1 per cent of Arabic content available online, according to data presented by the companies.
Mr Jackson said Jais would help to boost this figure.
“We're spearheading an initiative to collect more Arabic data from offline sources. So this has already kicked off in earnest and this is the first method that we will employ to boost Arabic,” Mr Jackson said.
“We are also looking into new ways to synthesise Arabic and translate existing English to Arabic and improve Arabic conversion … we're a long way off, but I think we have to be very optimistic and really push the needle forward.”
Organisations have long used AI but gained significant momentum with the advent of generative AI, made popular by Microsoft-backed OpenAI's ChatGPT.
Overall, it has created a new battlefront in the tech sector, with companies vying to get a head start and broaden their scope in generative AI.
The availability of LLMs would help companies in their efforts, especially as developers continuously improve AI capabilities.
“Speed performance is important to developers, not only because it lets them bring new models to the community or into production or to the market more quickly, but because it allows data scientists and ML researchers to quickly bring up and iterate on different models,” Mr Jackson said.