• | 9:00 am

The Middle East scores big in building Arabic AI models despite challenges—what’s next?

Experts say that the next generation of language models in the Middle East region is expected to be better, faster, and more efficient

The Middle East scores big in building Arabic AI models despite challenges—what’s next?
[Source photo: Krishna Prasad/Fast Company Middle East]

The generative AI revolution has secrets: its huge cost and the training data for the large language models (LLMs) that underpin them comes from a massive slice of the internet that the language model has to pore through to form the basis of its “knowledge.”

According to Statista, 52.1% of web content was in English, with Arabic text accounting for only 0.6% of the content on the World Wide Web. This shortage of access to sufficient high-quality Arabic language content makes it difficult for developers to pre-train LLMs. However, the Middle East has new Arabic AI models developed over the past two years, including the UAE’s open-source model, Falcon, that outperformed tech giants such as Meta on neutral language tests. This innovation was taken to new heights when Jais, developed by G42 and Cerebras, offered significant advancements for Arabic LLMs. 

Saudi Arabia is collaborating with Tonomus and Huawei to enhance the development of Arabic AI and LLMs. Recently, Huawei launched an Arabic LLM in Egypt amid the region’s rising demand for generative AI.

“While efforts to improve Arabic models are ongoing, the primary challenge is the smaller training data set available than in other languages,” says Dr. Leonid Zhukov, VP of Data Science BCG X and Director of BCG Global AI Institute. By pooling resources and data, collaboration could be key in addressing this limitation.”

COLLABORATION IS KEY

The age-old adage “Two heads are better than one” holds true in building language models, empowering them to heighten their adherence to factual data and refine their decision-making. 

“To elevate Arabic LLMs, a collaboration between tech companies, academic institutions, and local governments is crucial,” says Abdallah Abu Sheikh, Founder of Astra Tech and CEO of Botim. “This can help pool resources and expertise to create richer, more representative datasets. 

He adds, “Additionally, a more robust infrastructure and computational resources tailored to handle the complexity of Arabic language processing are needed.”

The crux of the problem with Arabic LLMs lies in the inconsistency of several dialects. Dialectal variations across different regions add another layer of complexity, leading to potential inaccuracies and flawed reasoning. 

“Taking Arabic language models to the next level is challenging due to data availability. Arabic is a rich and complex language with many dialects, but unfortunately, many dialects lack sufficient digitized linguistic resources,” says Dr. Hakim Hacid, Acting Chief Researcher – AI Cross-Center Unit, Technology Innovation Institute (TII.) 

Arabic text often contains cultural references and nuances that can be difficult for a model to grasp, especially if it is trained on data from other languages. There is also a lack of resources, whereby, as compared to English, there are fewer annotated datasets, corpora, and pre-trained models available for Arabic LLMs. This makes it more challenging to develop and train effective models.

“Collaboration is essential,” adds Dr. Hacid. “By working with experts from various fields and regions, we can gather more comprehensive data, improve model accuracy, and ensure that these models do justice to the diversity of the Arabic-speaking population.”

ORTHOGRAPHIC VARIATION

Another challenge is orthographic variation, as Arabic script is written without spaces between words, and there are different conventions for representing vowels and other sounds. This can make it challenging to accurately tokenize (split into words) Arabic text, which is crucial in building Arabic LLMs.

There’s a solution, however. Longer context and new advances in zero-shot learning (model learns to translate into another language without seeing an example) are valuable tools, says Imed Zitouni, Director of Engineering at Google. “With these techniques, 110 new languages have been added to Google Translate, including Cantonese and Tamazight. Partnership with linguists and native speakers remains critical.”

Furthermore, training in multiple languages enhances model performance, especially with cross-language parallel translation. Abu Sheikh says, “Developing specialized tools and frameworks tailored to the linguistic and cultural characteristics of the Arabic-speaking world can further enhance the capabilities of these models.”

USE OF SYNTHETIC DATA

A common solution researchers are investigating is self-training models. The synthetic data used to train machine learning models can play a crucial role in the future of LLMs and bridge some gaps where there isn’t enough accurate data. However, the use of artificial data must be carefully evaluated to ensure its quality and relevance.

A key step is having stronger data filtering and cleaning processes to help minimize biases from inception. Dr. Hacid suggests implementing robust verification and fact-checking mechanisms and building advanced reasoning capabilities into the models. 

“These efforts will make LLMs more fair, reliable, and trustworthy across various applications. It’s about creating a technology people can depend on and feel confident using safely.”

“Language models will always inherit the limitations of the data used for their training. Research shows that while a certain amount of synthetic data could boost the performances of LLMs, its usage should remain controlled and must not be excessive,” he adds.

Ensuring ethical standards requires a multi-faceted approach. Abu Sheikh suggests having diverse and representative training datasets that reflect various demographics and cultural contexts, adopting explainable AI techniques to make AI operations more understandable and accountable, continuously auditing AI systems for biases and unethical behavior, establishing clear governance frameworks, and collaborating with regulatory bodies.

However, Dr. Zhukov believes that a more plausible scenario involves larger models teaching smaller models. Many researchers currently use GPT-4 to generate training sets or automatically evaluate model performance. However, he says, “Self-training has limitations. When regenerating data, you do not add new information beyond the original training set. Losing this loop might not significantly improve model performance or intelligence.”

NEXT-GEN LLMS

Experts say that the next generation of language models in the Middle East region is expected to be better, faster, and more efficient. It will be characterized by increased localization and cultural sensitivity, as well as fully multimodal, integrating text, speech, and visual data. It will also support a broader range of languages, reflecting the region’s linguistic diversity. 

Zitouni says the future direction for LLMs is building a universal AI agent that can be helpful in everyday life. These agents would interact with people seamlessly and share a first-person perspective. Models would be able to speak to and control other software.

Contextual understanding and personalization are the natural steps forward, says Dr. Hacid. The process will involve fine-tuning, prompt engineering, and better automation of user feedback capture. 

To get value from generative AI, the path forward lies in large action models (LAMs), which, unlike LLMs, combine language understanding with logic and reasoning to execute various tasks, says Abu Sheikh. This will allow for a transition from click-based to prompt-based interactions, with users expected to interact with AI systems through natural language prompts rather than traditional clicks or taps.

“It’s a very fast-moving technology, but what’s ultimately clear is that today’s versions are inefficient,” says Dr. Zhukov, adding the next advancement of LLMs will be “understanding the world we don’t have now.” “What’s most exciting is that you can imagine all kinds of applications from that evolution of technology.”

Looking ahead, we may end up with a few dozen widely used foundational Arabic LLMs, thousands of LAMs, and smaller language models providing valuable insights. These AI models can be immensely useful and are the key to unlocking the real power of generative AI for business.

  Be in the Know. Subscribe to our Newsletters.

ABOUT THE AUTHOR

Suha Hasan is a correspondent at Fast Company Middle East. More

More Top Stories:

FROM OUR PARTNERS