How do you talk to a robot in Arabic? Is it best to address it in the formal language of news broadcasters or as you would speak to a friend?
It is a philosophical question at the heart of what language means and what Arabic is today.
“I usually joke about how even in the Arab imagination we have a black hole,” said Nizar Habash, programme head of computer science at New York University Abu Dhabi.
“How do I talk to a computer or a robot? What would I actually say to it? In what dialect? How would it answer back?”
Mr Habash leads a lab with a seemingly simple goal: to make everyday Arabic understandable to machines.
Arabic has two forms, the formal literary language called Fusha and myriad dialects, which are often mutually unintelligible. Dialect is the language of daily life but has a lower status.
This second-class standing means everyday technology such as predictive text and speech recognition still do not work well in spoken Arabic.
The NYUAD lab plans to change this.
This year, it will release text prediction software for Gulf Arabic using a collection of 200,000 words compiled last year. The collection, called the Gumar Corpus, opens the door for predictive text, speech recognition and speech synthesis in the dialect.
This is good news for Arabic speakers who want Alexa’s Arabic voice to sound like a neighbour instead of a literature professor.
The development of dialect in computing has not been welcomed by everybody. Formal Arabic still lags behind English and many believe it should be the priority, not dialect.
“People, just by default, think dialects are just bad Arabic,” Mr Habash said. “It’s such an insult to all of this wonderful culture that is celebrated and enjoyed but at the same time denied status.”
There are also technological barriers. Machines can learn languages by comparing identical documents in two languages or similar texts in different languages about the same topic, such as news stories. But news stories and government papers are written in formal Arabic and there are few comparative texts in dialects.
The variety of spellings in dialect is another obstacle.
For Mr Habash, the need for more programming in dialect was self-evident. He was raised in several countries in which different dialects of Arabic are spoken.
The Palestinian was born in Iraq and grew up in Lebanon, Syria, the Soviet Union and Tunisia. At 17, he moved to the US to study linguistics and computer engineering as an undergraduate.
Programming in dialect was common sense to Mr Habash because it is the language of daily life.
Social media increased the use of written dialect, because it is the language of choice for texting.
“And of course, you know, when it comes to people who cannot read or write, they only have dialect,” he said.
“It is the dominant form in the spoken space, so we have to deal with whatever that means.
“Our goal is to develop a better understanding of the data to build better applications. It’s not to make political statements. Our goal, from a technology point of view, is just to try and catch up with what’s happening in other languages technologically.”
The building blocks of language, found in romantic novels
To do this, the building blocks of language are needed: words.
Each word must be manually labelled, or annotated, with descriptors such as tense and gender. With hundreds of thousands of examples, a computer can teach itself the language.
The more examples are used, the better the prediction.
“People are so fixated about algorithms when they do AI but they don’t ask where the data for algorithms comes from,” Mr Habash said.
“If your data is not done in a proper, consistent way, you’re going to get garbage in and garbage out.”
Formal Arabic has about a million annotated words. The Egyptian dialect, spoken by about 98 million people and a vast diaspora, has 400,000 annotated words.
Levantine Arabic has about 50,000 annotated words and Gulf Arabic has 200,000 annotated words, thanks to the NYUAD project.
To compile its collection of words, the Gumar project had to find non-copyrighted text in dialect, and a lot of it.
Researchers hit the jackpot when they found a directory of 1,200 romantic novels written by anonymous women. The genre was popular in the blogosphere before the rise of social media.
The public directory had more than 100 million words in Gulf Arabic.
The task of annotation began. This is a long process in Arabic, because most vowels are not written and readers decipher words by context. A single written word in Arabic, on average, has three meanings, seven pronunciations and 12 interpretations.
For a computer to guess a word’s vowels and pronunciation, it must first derive meaning from context.
Annotating 200,000 words took three Egyptian linguistics in Alexandria, all former Gulf residents, eight months. This was finished last August. Meanwhile, NYUAD researchers began to train computers to distinguish and translate between dialects.
The politics of language equality
The Madar programme, a collaboration with researchers at Carnegie Mellon University in Qatar, creates comparable data for different dialects.
Its creators have built a 47,000-word lexicon for dialects from 25 different cities, sourcing material from travel books.
Resources from the Gumar and Madar projects are free to university researchers and available for commercial licensing.
Dialect databases matter because they make technology accessible to all, said Mona Diab, a computer science professor at George Washington University.
“You’re basically giving people first-hand access to information, so I think that’s one of the most important and impactful aspects of dialect and technology,” said Prof Diab, a specialist in natural language processing.
“You won’t need to have an education to understand what’s happening.”
This hit home for Prof Diab when she was a girl in Egypt. Her uncles lived on the Arabian Peninsula during the First Gulf War and her illiterate grandmother relied on her grandchildren to translate televised news about the conflict into her dialect because she couldn't understand the formal Arabic on the broadcast.
“How do you guarantee fairness and equality in the data that you’re using?” she asked.
“How do you use that to create better technology and how do you use that to democratise knowledge?”
Technological investment in dialect requires government support. Otherwise, the Arab world could be left behind.
AI Arabic research is led by the West. If Arabs do not do it themselves, there can be unintended consequences, Prof Diab said.
“There’s always a cultural dimension and a nuance that is going to be missed if you’re not native to the culture. It’s not just about language, it’s about identity. It’s now an opportunity to define our identity outside an occidental or outside perspective.”
Funding for Arabic dropped as western countries reduced their military presence in the Middle East, said Khaled Shaalan, a professor of computer science at the British University in Dubai.
“We are behind because Arabic needs a lot of resources, a lot of investment, and this has become very low,” Prof Shaalan said.
“For example, the United States and many other places stopped funding projects. At the time that there was war, yes, they were interested. But now they have switched to other languages.
“We have the technology now, the computer capacity to do language processing. All we need is the funding to train the career researchers who will work on this. It needs effort.”