Researchers at NYUAD will release text prediction software for Gulf Arabic using a 200,000 word collection of Gulf vocabulary completed last year. Mathew Kurian / The National
Researchers at NYUAD will release text prediction software for Gulf Arabic using a 200,000 word collection of Gulf vocabulary completed last year. Mathew Kurian / The National
Researchers at NYUAD will release text prediction software for Gulf Arabic using a 200,000 word collection of Gulf vocabulary completed last year. Mathew Kurian / The National
Researchers at NYUAD will release text prediction software for Gulf Arabic using a 200,000 word collection of Gulf vocabulary completed last year. Mathew Kurian / The National

How do you say 'Hey Siri' in spoken Arabic? NYUAD lab is working on Gulf dialects to find the answer


  • English
  • Arabic

How do you talk to a robot in Arabic? Is it best to address it in the formal language of news broadcasters or as you would speak to a friend?

It is a philosophical question at the heart of what language means and what Arabic is today.

“I usually joke about how even in the Arab imagination we have a black hole,” said Nizar Habash, programme head of computer science at New York University Abu Dhabi.

“How do I talk to a computer or a robot? What would I actually say to it? In what dialect? How would it answer back?”

Mr Habash leads a lab with a seemingly simple goal: to make everyday Arabic understandable to machines.

Our goal, from a technology point of view, is just to try and catch up with what's happening in other languages technologically

Arabic has two forms, the formal literary language called Fusha and myriad dialects, which are often mutually unintelligible. Dialect is the language of daily life but has a lower status.

This second-class standing means everyday technology such as predictive text and speech recognition still do not work well in spoken Arabic.

The NYUAD lab plans to change this.

This year, it will release text prediction software for Gulf Arabic using a collection of 200,000 words compiled last year. The collection, called the Gumar Corpus, opens the door for predictive text, speech recognition and speech synthesis in the dialect.

This is good news for Arabic speakers who want Alexa’s Arabic voice to sound like a neighbour instead of a literature professor.

Nizar Habash, the programme head of computer science at NYUAD, heads a lab that makes everyday Arabic understandable to machines. Pawan Singh / The National
Nizar Habash, the programme head of computer science at NYUAD, heads a lab that makes everyday Arabic understandable to machines. Pawan Singh / The National

The development of dialect in computing has not been welcomed by everybody. Formal Arabic still lags behind English and many believe it should be the priority, not dialect.

“People, just by default, think dialects are just bad Arabic,” Mr Habash said. “It’s such an insult to all of this wonderful culture that is celebrated and enjoyed but at the same time denied status.”

There are also technological barriers. Machines can learn languages by comparing identical documents in two languages or similar texts in different languages about the same topic, such as news stories. But news stories and government papers are written in formal Arabic and there are few comparative texts in dialects.

The variety of spellings in dialect is another obstacle.

For Mr Habash, the need for more programming in dialect was self-evident. He was raised in several countries in which different dialects of Arabic are spoken.

The Palestinian was born in Iraq and grew up in Lebanon, Syria, the Soviet Union and Tunisia. At 17, he moved to the US to study linguistics and computer engineering as an undergraduate.

Our goal, from a technology point of view, is just to try and catch up with what's happening in other languages technologically

Programming in dialect was common sense to Mr Habash because it is the language of daily life.

Social media increased the use of written dialect, because it is the language of choice for texting.

“And of course, you know, when it comes to people who cannot read or write, they only have dialect,” he said.

“It is the dominant form in the spoken space, so we have to deal with whatever that means.

“Our goal is to develop a better understanding of the data to build better applications. It’s not to make political statements. Our goal, from a technology point of view, is just to try and catch up with what’s happening in other languages technologically.”

The building blocks of language, found in romantic novels

To do this, the building blocks of language are needed: words.

Each word must be manually labelled, or annotated, with descriptors such as tense and gender. With hundreds of thousands of examples, a computer can teach itself the language.

The more examples are used, the better the prediction.

“People are so fixated about algorithms when they do AI but they don’t ask where the data for algorithms comes from,” Mr Habash said.

“If your data is not done in a proper, consistent way, you’re going to get garbage in and garbage out.”

Formal Arabic has about a million annotated words. The Egyptian dialect, spoken by about 98 million people and a vast diaspora, has 400,000 annotated words.

Nizar Habash works at his office at the NYUAD campus on Saadiyat Island. Pawan Singh / The National
Nizar Habash works at his office at the NYUAD campus on Saadiyat Island. Pawan Singh / The National

Levantine Arabic has about 50,000 annotated words and Gulf Arabic has 200,000 annotated words, thanks to the NYUAD project.

To compile its collection of words, the Gumar project had to find non-copyrighted text in dialect, and a lot of it.

Researchers hit the jackpot when they found a directory of 1,200 romantic novels written by anonymous women. The genre was popular in the blogosphere before the rise of social media.

The public directory had more than 100 million words in Gulf Arabic.

The task of annotation began. This is a long process in Arabic, because most vowels are not written and readers decipher words by context. A single written word in Arabic, on average, has three meanings, seven pronunciations and 12 interpretations.

For a computer to guess a word’s vowels and pronunciation, it must first derive meaning from context.

Annotating 200,000 words took three Egyptian linguistics in Alexandria, all former Gulf residents, eight months. This was finished last August. Meanwhile, NYUAD researchers began to train computers to distinguish and translate between dialects.

The politics of language equality

The Madar programme, a collaboration with researchers at Carnegie Mellon University in Qatar, creates comparable data for different dialects.

Its creators have built a 47,000-word lexicon for dialects from 25 different cities, sourcing material from travel books.

"All we need is the funding," says Khaled Shaalan, a professor of Computer Science at the British University in Dubai. Antonie Robertson / The National
"All we need is the funding," says Khaled Shaalan, a professor of Computer Science at the British University in Dubai. Antonie Robertson / The National

Resources from the Gumar and Madar projects are free to university researchers and available for commercial licensing.

Dialect databases matter because they make technology accessible to all, said Mona Diab, a computer science professor at George Washington University.

We are behind because Arabic needs a lot of resources, a lot of investment and this has become very low

“You’re basically giving people first-hand access to information, so I think that’s one of the most important and impactful aspects of dialect and technology,” said Prof Diab, a specialist in natural language processing.

“You won’t need to have an education to understand what’s happening.”

This hit home for Prof Diab when she was a girl in Egypt. Her uncles lived on the Arabian Peninsula during the First Gulf War and her illiterate grandmother relied on her grandchildren to translate televised news about the conflict into her dialect because she couldn't understand the formal Arabic on the broadcast.

“How do you guarantee fairness and equality in the data that you’re using?” she asked.

“How do you use that to create better technology and how do you use that to democratise knowledge?”

Technological investment in dialect requires government support. Otherwise, the Arab world could be left behind.

AI Arabic research is led by the West. If Arabs do not do it themselves, there can be unintended consequences, Prof Diab said.

“There’s always a cultural dimension and a nuance that is going to be missed if you’re not native to the culture. It’s not just about language, it’s about identity. It’s now an opportunity to define our identity outside an occidental or outside perspective.”

Funding for Arabic dropped as western countries reduced their military presence in the Middle East, said Khaled Shaalan, a professor of computer science at the British University in Dubai.

“We are behind because Arabic needs a lot of resources, a lot of investment, and this has become very low,” Prof Shaalan said.

“For example, the United States and many other places stopped funding projects. At the time that there was war, yes, they were interested. But now they have switched to other languages.

“We have the technology now, the computer capacity to do language processing. All we need is the funding to train the career researchers who will work on this. It needs effort.”

LIGUE 1 FIXTURES

All times UAE ( 4 GMT)

Friday
Nice v Angers (9pm)
Lille v Monaco (10.45pm)

Saturday
Montpellier v Paris Saint-Germain (7pm)
Bordeaux v Guingamp (10pm)
Caen v Amiens (10pm)
Lyon v Dijon (10pm)
Metz v Troyes (10pm)

Sunday
Saint-Etienne v Rennes (5pm)
Strasbourg v Nantes (7pm)
Marseille v Toulouse (11pm)

Baftas 2020 winners

BEST FILM

  • 1917 - Pippa Harris, Callum McDougall, Sam Mendes, Jayne-Ann Tenggren
  • THE IRISHMAN - Robert De Niro, Jane Rosenthal, Martin Scorsese, Emma Tillinger Koskoff
  • JOKER - Bradley Cooper, Todd Phillips, Emma Tillinger Koskoff
  • ONCE UPON A TIME… IN HOLLYWOOD - David Heyman, Shannon McIntosh, Quentin Tarantino
  • PARASITE - Bong Joon-ho, Kwak Sin-ae

DIRECTOR

  • 1917 - Sam Mendes
  • THE IRISHMAN - Martin Scorsese
  • JOKER - Todd Phillips
  • ONCE UPON A TIME… IN HOLLYWOOD - Quentin Tarantino
  • PARASITE - Bong Joon-ho

OUTSTANDING BRITISH FILM

  • 1917 - Sam Mendes, Pippa Harris, Callum McDougall, Jayne-Ann Tenggren, Krysty Wilson-Cairns
  • BAIT - Mark Jenkin, Kate Byers, Linn Waite
  • FOR SAMA - Waad al-Kateab, Edward Watts
  • ROCKETMAN - Dexter Fletcher, Adam Bohling, David Furnish, David Reid, Matthew Vaughn, Lee Hall
  • SORRY WE MISSED YOU  - Ken Loach, Rebecca O’Brien, Paul Laverty
  • THE TWO POPES - Fernando Meirelles, Jonathan Eirich, Dan Lin, Tracey Seaward, Anthony McCarten

FILM NOT IN THE ENGLISH LANGUAGE

  • THE FAREWELL - Lulu Wang, Daniele Melia
  • FOR SAMA - Waad al-Kateab, Edward Watts
  • PAIN AND GLORY - Pedro Almodóvar, Agustín Almodóvar
  • PARASITE - Bong Joon-ho
  • PORTRAIT OF A LADY ON FIRE - Céline Sciamma, Bénédicte Couvreur

LEADING ACTRESS

  • JESSIE BUCKLEY - Wild Rose
  • SCARLETT JOHANSSON - Marriage Story
  • SAOIRSE RONAN - Little Women
  • CHARLIZE THERON - Bombshell
  • RENÉE ZELLWEGER - Judy

LEADING ACTOR

  • LEONARDO DICAPRIO - Once Upon a Time… In Hollywood
  • ADAM DRIVER - Marriage Story
  • TARON EGERTON - Rocketman
  • JOAQUIN PHOENIX - Joker
  • JONATHAN PRYCE - The Two Popes

SUPPORTING ACTOR

  • TOM HANKS - A Beautiful Day in the Neighborhood
  • ANTHONY HOPKINS - The Two Popes
  • AL PACINO - The Irishman
  • JOE PESCI - The Irishman
  • BRAD PITT - Once Upon a Time… in Hollywood

SUPPORTING ACTRESS

  • LAURA DERN - Marriage Story
  • SCARLETT JOHANSSON - Jojo Rabbit
  • FLORENCE PUGH - Little Women
  • MARGOT ROBBIE - Bombshell
  • MARGOT ROBBIE - Once Upon a Time… in Hollywood

ADAPTED SCREENPLAY

  • THE IRISHMAN - Steven Zaillian
  • JOJO RABBIT - Taika Waititi
  • JOKER - Todd Phillips, Scott Silver
  • LITTLE WOMEN - Greta Gerwig
  • THE TWO POPES - Anthony McCarten

ORIGINAL SCREENPLAY

  • BOOKSMART - Susanna Fogel, Emily Halpern, Sarah Haskins, Katie Silberman
  • KNIVES OUT - Rian Johnson
  • MARRIAGE STORY - Noah Baumbach
  • ONCE UPON A TIME… IN HOLLYWOOD - Quentin Tarantino
  • PARASITE - Han Jin Won, Bong Joon ho

DOCUMENTARY

  • AMERICAN FACTORY - Steven Bognar, Julia Reichert
  • APOLLO 11 - Todd Douglas Miller
  • DIEGO MARADONA - Asif Kapadia
  • FOR SAMA - Waad al-Kateab, Edward Watts
  • THE GREAT HACK - Karim Amer, Jehane Noujaime

OUTSTANDING DEBUT BY A BRITISH WRITER, DIRECTOR OR PRODUCER

  • BAIT - Mark Jenkin (Writer/Director), Kate Byers, Linn Waite (Producers)
  • FOR SAMA - Waad al-Kateab (Director/Producer), Edward Watts (Director)
  • MAIDEN - Alex Holmes (Director)
  • ONLY YOU - Harry Wootliff (Writer/Director)
  • RETABLO - Álvaro Delgado-Aparicio (Writer/Director)

ANIMATED FILM

  • FROZEN 2 - Chris Buck, Jennifer Lee, Peter Del Vecho
  • KLAUS - Sergio Pablos, Jinko Gotoh
  • A SHAUN THE SHEEP MOVIE: FARMAGEDDON - Will Becher, Richard Phelan, Paul Kewley
  • TOY STORY 4 - Josh Cooley, Mark Nielsen

CASTING

  • JOKER - Shayna Markowitz
  • MARRIAGE STORY - Douglas Aibel, Francine Maisler
  • ONCE UPON A TIME… IN HOLLYWOOD - Victoria Thomas
  • THE PERSONAL HISTORY OF DAVID COPPERFIELD - Sarah Crowe
  • THE TWO POPES - Nina Gold

EE RISING STAR AWARD (voted for by the public)

  • AWKWAFINA
  • JACK LOWDEN
  • KAITLYN DEVER
  • KELVIN HARRISON JR.
  • MICHEAL WARD

CINEMATOGRAPHY

  • 1917 - Roger Deakins
  • THE IRISHMAN - Rodrigo Prieto
  • JOKER - Lawrence Sher
  • LE MANS ’66 - Phedon Papamichael
  • THE LIGHTHOUSE - Jarin Blaschke

EDITING

  • THE IRISHMAN - Thelma Schoonmaker
  • JOJO RABBIT - Tom Eagles
  • JOKER - Jeff Groth
  • LE MANS ’66 - Andrew Buckland, Michael McCusker
  • ONCE UPON A TIME… IN HOLLYWOOD - Fred Raskin

COSTUME DESIGN

  • THE IRISHMAN - Christopher Peterson, Sandy Powell
  • JOJO RABBIT - Mayes C. Rubeo
  • JUDY - Jany Temime
  • LITTLE WOMEN - Jacqueline Durran
  • ONCE UPON A TIME… IN HOLLYWOOD - Arianne Phillips

PRODUCTION DESIGN

  • 1917 - Dennis Gassner, Lee Sandales
  • THE IRISHMAN - Bob Shaw, Regina Graves
  • JOJO RABBIT - Ra Vincent, Nora Sopková
  • JOKER - Mark Friedberg, Kris Moran
  • ONCE UPON A TIME… IN HOLLYWOOD - Barbara Ling, Nancy Haigh

SOUND

  • 1917 - Scott Millan, Oliver Tarney, Rachael Tate, Mark Taylor, Stuart Wilson
  • JOKER - Tod Maitland, Alan Robert Murray, Tom Ozanich, Dean Zupancic
  • LE MANS ’66 - David Giammarco, Paul Massey, Steven A. Morrow, Donald Sylvester
  • ROCKETMAN - Matthew Collinge, John Hayes, Mike Prestwood Smith, Danny Sheehan
  • STAR WARS: THE RISE OF SKYWALKER - David Acord, Andy Nelson, Christopher Scarabosio, Stuart Wilson, Matthew Wood

ORIGINAL SCORE

  • 1917 - Thomas Newman
  • JOJO RABBIT - Michael Giacchino
  • JOKER - Hildur Guđnadóttir
  • LITTLE WOMEN - Alexandre Desplat
  • STAR WARS: THE RISE OF SKYWALKER - John Williams

SPECIAL VISUAL EFFECTS

  • 1917 - Greg Butler, Guillaume Rocheron, Dominic Tuohy
  • AVENGERS: ENDGAME - Dan Deleeuw, Dan Sudick
  • THE IRISHMAN - Leandro Estebecorena, Stephane Grabli, Pablo Helman
  • THE LION KING - Andrew R. Jones, Robert Legato, Elliot Newman, Adam Valdez
  • STAR WARS: THE RISE OF SKYWALKER - Roger Guyett, Paul Kavanagh, Neal Scanlan, Dominic Tuohy

MAKE UP & HAIR

  • 1917 - Naomi Donne
  • BOMBSHELL - Vivian Baker, Kazu Hiro, Anne Morgan
  • JOKER - Kay Georgiou, Nicki Ledermann
  • JUDY - Jeremy Woodhead
  • ROCKETMAN - Lizzie Yianni Georgiou

BRITISH SHORT FILM

  • AZAAR - Myriam Raja, Nathanael Baring
  • GOLDFISH - Hector Dockrill, Harri Kamalanathan, Benedict Turnbull, Laura Dockrill
  • KAMALI - Sasha Rainbow, Rosalind Croad
  • LEARNING TO SKATEBOARD IN A WARZONE (IF YOU’RE A GIRL) - Carol Dysinger, Elena Andreicheva
  • THE TRAP - Lena Headey, Anthony Fitzgerald

BRITISH SHORT ANIMATION

  • GRANDAD WAS A ROMANTIC - Maryam Mohajer
  • IN HER BOOTS - Kathrin Steinbacher
  • THE MAGIC BOAT  - Naaman Azh