Lost in translation: Why machine learning finds Arabic challenging
A new Google grant and research at the University of Sharjah show the language is gaining traction
Machine learning is one of the fastest growing and most transformative technologies in the world. Yet as it increasingly caters to English and Chinese speakers - it leaves people and economies to play catch up at a critical time in its development.
"Arabic is falling behind," Professor Ashraf Elnagar, the coordinator of the machine learning and Arabic language processing research group at the University of Sharjah, told The National.
The global machine learning market is projected to grow from $7.3 billion (Dh26.7bn) in 2020 to $30.6bn in 2024, according to a 2019 report by Market Research Future. The pandemic has dented these outlays, but data scientists are still in demand.
There are currently about 75,000 active job listings worldwide for those with machine learning skills on LinkedIn, the majority listed in the US, Asia and Western Europe.
There are numerous examples of machine learning popping up in everyday life: Netflix's recommendations or the playlists generated by Spotify; Siri or Alexa responding to a spoken request for the local headlines; or a credit card company sending a text alert about potentially fraudulent activity.
Machine learning is also being used by businesses to generate consumer insights and improve customer service, reduce costs and to automate processes.
But chances are, the data being used for any of these activities - especially as they become more advanced - is in Chinese or English, or possibly Spanish or French, the most popular languages fueling this artificial intelligence boom, according to Prof Elnagar.
The challenges are twofold: the complexity of the language and the amount of resources and research being put into its development.
Arabic is "structurally ambiguous", according to Prof Elnagar, lacking capital letters to indicate proper nouns or the start of sentences, for example.
There are also three different types that are recognised, including classical, as in the Quran; the modern standard, which is conversational and seen on TV; and colloquial, which is "gaining traction on social media. It has its own population, its own customers and it is on the rise" and has 20 dialects, Prof Elnagar said.
The variability and ambiguity in meaning make Arabic very challenging to train machines to make human-like decisions when they are reading it.
John Lillywhite, the digital transition lead at Al Bawaba, a Jordanian media company, recently helped the company win a grant from Google to tackle this challenge.
The title of his pitch: 'Why Can’t Machines Read Arabic at Scale?'
With financial support from Google, his team will work to make one of the largest Arabic language news archives in the Middle East searchable.
There are commercial and scholarship upsides to making this database searchable, opening it up to third-party publishers and researchers who can extract meaning and insight from the archive. Once the tool is developed, which is expected to take around 18 months, it can be used by other publishing platforms and websites to improve the filtering and finding of Arabic content.
This would be a major milestone for Arabic media - and is a facet of news reading that native English speakers take for granted.
"It would be great if machines could tell the difference between a peace process story and a sporting event in Arabic," Mr Lillywhite told The National. "We’re not there yet."
But his work is a sign of progress in the field of machine learning and Arabic, among others. The UAE's dedicated AI university, the Mohamed bin Zayed University of Artificial Intelligence, will swing open its doors next January, and the University of Sharjah is seeing greater interest from entrepreneurs and the private sector to find commercial applications for its research.
"I think we will see huge strides in the next decade," Prof Elnagar said. "It is extremely promising."
Published: May 25, 2020 08:00 AM