• | 9:00 am

Arabic is hard. That’s precisely why AI needs to get it right

Arabic is still missing from the global AI revolution

Arabic is hard. That’s precisely why AI needs to get it right
[Source photo: Krishna Prasad/Fast Company Middle East]

Voice assistants can book your flights, reorder groceries, and tell you the weather. But ask them a question in Arabic, and the response is often a blank stare. Or worse, a wildly off-target guess. 

For all the hype around AI’s global potential, Arabic has been left behind. Many companies have blamed the language’s complexity.

That’s starting to change. Across the region, governments are funding Arabic-first AI models. Startups are designing tools with native speakers in mind. Even global tech firms are beginning to pay attention. The momentum is building, although not fast enough.

WHY ARABIC IS SO CHALLENGING FOR AI

Joe Devassy, Director, Strategic Alliances, KPMG Lower Gulf, believes Arabic is one of AI’s most linguistically complex languages because of its rich syllable structure, context-dependent syntax, and multiple dialects.

He says, “Unlike English, Arabic words are often formed by inserting root letters into specific patterns, creating a wide variety of word forms from a single root. Therefore, classical Arabic is known for its rich vocabulary and semantic precision.”

This linguistic depth is evident in the fact that the word “lion” in English has 100 to 300 equivalents in Arabic, each with its own nuance and usage.

The challenge grows further with the divide between Modern Standard Arabic (MSA) and the many spoken dialects across the region, such as Emirati Arabic, which are still underrepresented in digital datasets. Devassy explains, “Across the Middle East, it’s common for people to code-switch between MSA, local dialects, and even English, depending on the context, adding another layer of complexity for AI systems.”

The Arabic script itself presents additional challenges. 

Being naturally cursive, it requires tailored approaches for text processing. The omission of short vowels, which is socially accepted, makes tasks like text-to-speech synthesis and sentiment analysis more difficult. Devassy adds, “These issues aren’t unique to Arabic. They’re shared by other languages with similarly cursive scripts and rich vocabularies.”

Christophe Zoghbi, Founder and CEO of Zaka, also emphasizes that Arabic presents a unique set of challenges shaped by linguistic, structural, and socio-cultural factors.

One major issue is that Arabic is a morphologically rich language. A single root can generate many words through complex inflection, conjugation, and derivation. This creates data sparsity and makes tokenization, converting words into numbers for AI models, more difficult.

Another concern is diglossia. MSA is used in formal settings, while spoken dialects vary widely across regions and are sometimes mutually unintelligible. This makes it hard for AI systems to generalize effectively or even identify which form of Arabic is used.

Zoghbi highlights the absence of standardized spelling in dialectal Arabic, particularly on social media and in informal texts. This inconsistency adds noise to datasets and limits the effectiveness of traditional NLP methods.

THE BIGGEST LIMITATION 

The biggest limitation with current large language models (LLMs) is clear, says Hasan Abusheikh, Co-founder of CNTXT.

“They weren’t built for Arabic. They’re applying English rules to a completely different system.”

Even when these models are fine-tuned on translated Arabic data, they often miss the mark. Abusheikh explains that the intent, tone, and nuance, especially across dialects, are lost in translation. The sentence may be technically correct, but the meaning often falls flat.

A major issue lies in the data itself. It is messy, limited, and overly focused on MSA. Dialects, which dominate daily conversations, are scarcely represented in existing datasets.

He notes, “That’s why we stopped trying to retrofit global models and built our Arabic first stack. It wasn’t optional. It was necessary. Most LLMs today don’t have the foundation to handle Arabic as it needs to be handled.”

Ahmed Serag, Chief AI Officer at Weill Cornell Medicine, echoes this perspective, pointing out that most LLMs treat Arabic as an afterthought. Their training data is heavily skewed toward English, making their Arabic capabilities shallow. Even when Arabic is included, it is with little exposure to regional dialects or informal usage. The result is outputs that may appear correct on the surface but lack cultural nuance, contextual awareness, or emotional resonance.

Mainstream models often underperform in Arabic, especially because they lack high-quality, diverse training data. Devassy notes that in countries like the UAE, where official documents, media, and business transactions often combine MSA, dialects, and English, existing models struggle to maintain consistency in tone and translation. Emirati Arabic, for example, is rarely seen in training datasets.

“AI models sometimes produce literal translations that miss cultural context or idiomatic meaning,” says Devassy. “This becomes especially problematic in applications like customer service bots or legal document processing.”

There’s a structural bias in LLMs, where English dominates and Arabic, particularly domain-specific content, remains underrepresented. Zoghbi says, “While MSA sees moderate support, dialects are often poorly handled.” 

He also points to Arabic script and right-to-left structure as technical hurdles. These challenges impact tokenization, affecting the quality of tasks like generation and summarization.

IT’S NOT JUST WHAT YOU SAY, BUT HOW

Translation is only the surface layer, says Serag. True understanding requires cultural and contextual grounding, knowing not just what someone is saying but why and how. Indirect speech, politeness strategies, and idiomatic expressions are integral to Arabic communication, and a model that merely translates without grasping these subtleties risks misinterpreting both intent and tone.

Abusheikh agrees, emphasizing that if you’re lucky, translation might only get you 12% to 14% of the way there.

“You’ll get surface-level output, but the meaning will often be off or worse, lost completely. Why? Because translation isn’t understood. Arabic is loaded with intent, indirect meaning, and emotion. It’s not about what is said. It’s about why, how, and to whom.”

He points out that this can’t be solved by simply plugging in a translator. It requires building systems that think in Arabic and not just render it.

According to Devassy, for AI to genuinely understand Arabic, it must also capture cultural and contextual nuances. In multilingual, multi-dialect environments, systems must interpret humor, indirect cues, and varying levels of politeness to avoid miscommunication.

He explains that understanding regional customs, such as seasonal greetings or domain-specific language in areas like healthcare or legal services, is critical beyond grammar and vocabulary.

“Translation must be paired with cultural literacy and discourse-level comprehension for AI to function meaningfully,” Devassy says. “Without this deeper sensitivity, even flawlessly translated words risk missing the intent, tone, and social significance embedded within the language.”

HOW RESEARCHERS ARE CLOSING THE GAP

Researchers across the region are adopting several effective strategies to improve Arabic NLP and generative performance. One of the most critical steps is curating high-quality local datasets from news, legal, medical, and dialectal sources, an effort that requires accurate annotation and contextual understanding.

Many researchers are now training Arabic-specific models from scratch or fine-tuning multilingual LLMs such as mBERT or LLaMA on Arabic corpora. Zoghbi explains that local linguists and annotators play a crucial role in this process, helping to capture the rich linguistic nuances that off-the-shelf models often miss.

Serag adds, “We’ve been exploring fine-tuning smaller, efficient models on domain-specific Arabic data.” Rather than relying on global models to eventually adapt, his team is building tools grounded in real-world regional data.

“At the same time, we’re seeing promising Arabic-first or Arabic-capable models emerging, like Allam, Falcon, and Fanar,” he says. These models are already raising the bar for Arabic generative capabilities.

Benchmarking tools like ARBERT, MARBERT, and Jais further support this progress by providing consistent evaluation frameworks. On the technical front, tools such as multilingual transformers and retrieval-augmented generation (RAG) are being used to fill in domain-specific knowledge gaps, especially when applied within culturally sensitive pipelines.

Efforts to develop Arabic-first models are accelerating in the UAE and the wider MENA region. Devassy points to the Arabic-focused LLM Jais, developed by G42 and Cerebras, which has made notable progress by training on a blend of Modern Standard Arabic, regional dialects, and English.

He explains that techniques like transfer learning, where models initially trained in English are fine-tuned using Arabic data, have been particularly effective. Complementary tools like syntactic segmentation and diacritization algorithms are also helping to resolve common challenges related to sound structure and script ambiguity.

BUILDING BETTER DATASETS LOCALLY

Local datasets are essential for building AI systems that accurately reflect the diverse ways Arabic is used across regions and contexts. However, creating them presents significant challenges.

“Annotating data, especially dialectal or domain-specific content, is costly and resource-intensive,” says Zoghbi. There are also ongoing concerns around privacy and data ownership, as many datasets are collected without clear user consent or defined usage rights. He adds that the Arabic NLP ecosystem remains fragmented, lacking centralized, accessible resources compared to English. “Addressing these gaps will require more collaboration, ideally led by governments and academic institutions to create and open-source high-quality Arabic datasets.”

Building locally grounded Arabic datasets is critical for linguistic precision, cultural relevance, and data sovereignty. Devassy explains that relying on international datasets risks misrepresenting local dialects and norms, which often vary based on geography. “Locally developed corpora are essential to power applications like healthcare AI, Arabic-first virtual assistants, and government services,” he says.

Yet the complexity of the Arabic script, its diacritics, and the need for accurate annotation demand substantial resources. “We are seeing regional initiatives like Jais, ArabicTreebank, and projects by KAUST, MBZUAI, and SDAIA that are steadily addressing these gaps,” Devassy adds.

To improve the quality of Arabic AI models, efforts must be aligned with real-world needs and business use cases. When Arabic is a core requirement, purpose-built models trained on contextually relevant data can make a significant impact. Synthetic data generation is emerging as a promising strategy, creating artificial, high-quality datasets to supplement existing resources and improve model training.

For Abusheikh, the core issue is simple: “If the data isn’t local, the model won’t be either.” He emphasizes that language is deeply tied to culture, and elements like sarcasm, tone, and intent can’t be captured through simple translation or generic labeling. “You need annotators who live the language to truly understand its nuances.”

He says, “Language is deeply tied to culture, and elements like sarcasm, intent, and tone can’t be captured through translation or generic labeling. You need annotators who live the language to truly understand its nuances.”

Still, the challenges are steep. Only around 15% of Arabic text online is clean enough for model training, compared to more than 50% for English. Dialectal variation and contextual ambiguity add further complexity, and qualified annotators are scarce and expensive, often costing five to six times more than for English.

“The bottom line is that if the data isn’t right, everything else fails,” says Abusheikh. “That’s why much of the engineering effort goes into building robust data pipelines, not just improving the models.”

CAN ARABIC AI ACHIEVE NUANCE AND RELIABILITY?

“Yes, but it will take targeted effort,” says Serag. 

While English has had a two-decade head start, with access to web-scale data, deep research investment, and a robust open-source ecosystem, Arabic has the potential to catch up, especially as the region places AI at the center of its strategic agenda.

Devassy says that Arabic-language AI can reach similar levels of nuance and reliability as English models if supported by sustained, long-term, and locally grounded efforts. “The success of models like Jais demonstrates that with enough diverse, high-quality data and dialect coverage, AI can handle Arabic’s complexity,” he says.

Still, the path is steeper. Arabic lacks the historical digital depth of English and is fragmented across dialects, which makes generalization and evaluation more difficult. However, continued investment in AI research and infrastructure is accelerating progress in the UAE and the region.

To get there, experts say the focus must be on overcoming dialect bias, building better evaluation benchmarks, and involving native speakers in dataset development. These are the foundational steps needed to close the gap and unlock the full potential of Arabic-language AI.

THE PUSH TO MAKE AI ‘FLUENT’

Technically, advancing Arabic natural language processing is a challenge, says Zoghbi, but it’s about much more than innovation. “Culturally, it’s a matter of inclusion and digital sovereignty,” he notes. “Our language must be part of the AI future.” Economically, Arabic fluency in AI is critical for enabling businesses, governments, and communities across the region to harness the technology for education, healthcare, productivity, and more. “Ignoring Arabic fluency in AI,” he adds, “widens the digital divide between English-speaking and Arabic-speaking communities.”

For Serag, building Arabic-fluent AI is not just an academic pursuit. It’s a question of equity, sovereignty, and unlocking economic opportunity. Across much of the Middle East, Arabic is the language of government, law, and identity. Developing AI that understands and communicates in Arabic isn’t just about convenience. It’s about preserving cultural heritage while delivering inclusive and accessible digital services.

Devassy agrees. Arabic-capable AI is a critical enabler of national initiatives like Dubai’s Smart City and is transformative across sectors such as tourism, healthcare, and finance. Without robust Arabic NLP, he warns, AI tools risk excluding native speakers and delivering substandard experiences. Fluent AI isn’t just a competitive advantage in a region with more than 400 million Arabic speakers. It’s a strategic imperative.

Over 150 million Arabic speakers have had to adapt to digital systems that don’t reflect their language or lived experience for decades. “If AI can’t understand your language, it can’t understand your needs,” says Abusheikh. “That creates a fundamental disconnect.”

But the opportunity is clear. When AI communicates like the people it’s built to serve, adoption increases, trust deepens, and impact grows. With GCC nations alone projected to invest over $150 billion into AI by 2025, the push to make AI fluent in Arabic is not just urgent. It’s essential.

A TRULY ARABIC FIRST AI MODEL?

A truly Arabic-first AI model would do more than simply process language. It would understand it in all its depth and diversity. “It would be trained on a broad corpus encompassing all major Arabic dialects alongside MSA,” explains Zoghbi. “It should generate and comprehend formal and informal speech naturally, embedding cultural context and regional knowledge. Just as crucially, it must be transparent, open-source, and inclusive, designed to welcome community contributions and evolve with real-world use.”

While regional initiatives are accelerating, realizing this vision will require greater collaboration, sustained funding, and coordinated leadership across MENA.

According to Devassy, a truly Arabic-first model would natively handle the language’s intricacies, from diacritics and complex syllable structures to context-aware translation. But its fluency would go beyond linguistics. It would grasp cultural idioms, religious references, and the code-switching patterns typical in conversations across the UAE and the broader region. Such understanding would enable applications like accurate voice synthesis, emotion-aware interactions, and sentiment analysis.

For Serag, an Arabic-first model isn’t simply an English model translated into Arabic. It’s one built from the ground up. It’s trained on Arabic data, evaluated against Arabic-specific tasks, and capable of moving fluidly between dialects and MSA. “It will understand the structure of a religious sermon, the tone of a customer complaint, and the rhythm of a poem. With continued regional collaboration and growing awareness, we’ll shape a future where Arabic is supported and fully embedded in AI systems.”

  Be in the Know. Subscribe to our Newsletters.

ABOUT THE AUTHOR

Karrishma Modhy is the Managing Editor at Fast Company Middle East. She enjoys all things tech and business and is fascinated with space travel. In her spare time, she's hooked to 90s retro music and enjoys video games. Previously, she was the Managing Editor at Mashable Middle East & India. More

More Top Stories:

FROM OUR PARTNERS