Back to blog

Sep 05 2023

Jais: a New Pinnacle in Open Arabic NLP

Invited guest blog by Dr Andrew Jackson, CEO of Inception, a G42 company and Cerebras strategic partner.

Executive Summary

We introduce Jais, a new state-of-the-art Arabic-centric large language model (LLM). The model achieves the world’s best performance for an open Arabic model and its performance on English is comparable to the world-leading open English models, despite having been trained on less data. As such, Jais is a significant technological achievement and a new, valuable contribution to both the open-source and the enterprise AI language modeling communities.

Jais is a 13-billion parameter generative pre-trained transformer (GPT) model which is the result of a collaboration between Inception, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), the world’s first graduate-research university dedicated to AI, and Cerebras. The model was trained on a collection of Arabic, English, and code data on Condor Galaxy 1 (CG-1), the recently announced multi-exaFLOP AI supercomputer built by G42 and Cerebras.1

Jais achieves better performance on Arabic language tasks than any known open-source mono- or multi-lingual LLM. Interestingly, Jais also yields English language performance comparable to world-leading models such as LLaMA 2, despite having been trained on significantly fewer English language tokens. Jais represents a significant step forward in open Arabic language modeling and in the development of methods and capabilities relevant to multi-lingual model development in general.

The following sections of this paper provide an introduction to LLMs and Jais, an overview of the model architecture, the training data and the methods, performance evaluation methods and results, a technical description of the CG-1 AI training platform, and some perspective on the potential impact for Arabic NLP and other sovereign language modeling applications.

Large language models (LLMs) have transformed how we use technology in a fundamental way across all aspects of our daily lives. AI-powered LLMs have become key players in advancing how we understand, create, and manipulate language – domains that lie at the core of modern communication.

One remarkable aspect of these LLMs is their ability to understand and to follow instructions. This breakthrough has transformed how users interact with LLMs. Instead of a simple back-and-forth, there’s now a dynamic exchange where LLMs can understand and carry out complex instructions with impressive skills.

The impact of this progress is extensive. LLMs can now smoothly handle a wide range of commands, from simple to complex. This opens up many possibilities, letting users use LLMs as partners to solve increasingly difficult tasks. This shift matters not just for tech-savvy people, but also for those with less access to technology, thus democratizing advanced problem-solving and learning.

Tied closely to these advanced AI systems is the ongoing concern about language diversity. An Arabic LLM becomes crucial here, helping preserve the richness and the diversity of the Arabic language and its various forms. Arabic is spoken by more than 400 million people in the world. It is a rich and complex language, with many dialects and variations. Arabic also has a unique writing system, which can pose difficulties for optical character recognition and text normalization. A major challenge in training Arabic LLMs is that Arabic has a limited amount of high-quality data available for training large language models, compared to English. One potential approach to address this challenge is to leverage a robust machine translation system in conjunction with a powerful English LLM to emulate an Arabic-enabled LLM. This solution, however, is not without limitations. The performance of such a system would inevitably be constrained by the capabilities of the underlying translation system, potentially leading to suboptimal language generation and understanding. Moreover, a translation-augmented English LLM might lack the profound cultural insights and nuances ingrained within native Arabic texts. This absence of cultural knowledge could impede the model’s ability to authentically capture the essence of Arabic language use in various contexts. In addition, the adoption of such a system would entail inheriting the usage rights and the licensing restrictions associated with the English LLM that forms its foundation. This could raise concerns related to data privacy, intellectual property rights, and accessibility, which are important factors to consider in the development and the deployment of large language models.

As we explore new opportunities, we must carefully handle AI ownership. The key choice is whether to adopt externally developed LLMs or to nurture indigenous AI capability. We acknowledge the appeal of quick integration and advancement, but we must be aware of the potential risks. Finding a balance between using external solutions and developing sovereign AI capabilities is vital. The rise of Arabic LLMs is a significant milestone in our technological advancement, enabling us to generate sovereign AI capability in the UAE, while accommodating various advancements in the field.

We introduce a family of Arabic-centric LLMs with 13B parameters, named Jais and Jais-chat. The models were trained on the Condor Galaxy 1 (CG-1), the recently announced, state-of-the-art AI supercomputer co-developed by G42 and Cerebras Systems. The Jais family of models, built upon a purpose-built dataset, are infused with cultural and linguistic nuances of the Arabic language and overcome the barrier of data availability, aiming for a world where technology bridges gaps instead of exacerbating divides.

This white paper presents an overview of our experience training Jais. It summarizes the process of training the model, the challenges that emerged, and the innovations explored in the journey of training the world’s first and largest bilingual Arabic-English LLM.

Technical Overview

We developed a powerful large language model (LLM) with 13 billion parameters, based on the GPT-3 architecture. The effort of this LLM is towards enhancing Arabic language capabilities while incorporating English text and code. Trained on bilingual and code data, Jais is able to handle code-mixed content where Arabic and English intermingle within the same context or sentence. It also allows the model to reason across languages and to leverage the knowledge from English sources into the Arabic linguistic space and vice versa.

However, unlike previous massively multilingual LLMs, such as BLOOM or mT0, which contain more than 50 languages, we do not include languages aside from Arabic and English in any significant percentage. Neither do we relegate Arabic to a minority in the pretraining dataset. Instead, Arabic data constitutes ~33% of the pretraining. This choice of mixing two languages attains the best of both worlds. The LLM is highly fluent in Arabic, with linguistic capability as well as cultural awareness and sensitivity, while at the same time being on par in terms of reasoning capability and world knowledge that have been observed in recent English and code LLMs.

Building upon established model architectures and training procedures, we investigate the impact of variations along several different dimensions:

1. We improve the representation of Arabic in the model with a new bilingual vocabulary.
2. We dramatically extend the model context size using state-of-the-art positional embeddings.
3. We incorporate recent innovations in activation functions, inspired by the LLaMA family of models.
4. We use hyperparameter optimization recipes that allow for highly efficient use ofcompute.
5. We establish a method to compensate for the limited availability of Arabic pretraining data (as compared to English).

Architecture:

The model architecture is based on a decoder-only transformer design similar to industry-leading conversational models such as Claude, ChatGPT, and Bard. Specifically, this model consists of 40 decoder layers, 40 attention heads, and a model dimension of 5120, for a total of 13 billion parameters.

A multilingual vocabulary containing 84,992 unique tokens, assigning equal importance to English and Arabic.
LLMs typically split up rarely seen words into more common sub-words or tokens. The performance of a vocabulary in a specific language is the average number of word tokens in that language will be split into. More tokens per word typically mean longer sequences resulting in higher inference and training costs for the LLM. Our multilingual vocabulary achieves a lower number of tokens per word and therefore smaller average sequence length in both languages as compared to current open-source LLM vocabularies.

ALiBi Positional Encodings

Positional encodings provide information about the word order to transformer-based LLM models. This is especially crucial in a morphologically rich language such as Arabic where word choices are highly influenced by the surrounding context. Arabic’s flexibility in word order requires a clear grasp of the grammatical implications of word positions. At the same time, there is inherent word ambiguity, especially in online text, which often omits diacritics.

We use the ALiBi positional encodings, inspired by the LLaMA family of models. ALiBi encodes the position of a word relative to the other words in the context, and so allows the model to be used with longer text sequences than ever seen during training. This enables training to be faster and less memory-intensive, while unlocking the power of larger contexts during inference.

SwiGLU Activation Function

Activation functions play a pivotal role in Large Language Models (LLMs), allowing the model to grasp complex linguistic patterns for a nuanced understanding of language. We use the SwiGLU activation functions, as in the LLaMA model family.

Efficient hyperparameter discovery using Maximal Update Parameterization

Hyperparameters are model configurations whose choice greatly impacts the performance of the trained model. Examples include batch size, learning rate, and learning schedule. Selecting the best set of hyperparameters for a large model is prohibitively expensive since it involves training multiple variations, using different hyperparameter settings, in search of the best combination.

In our 13B-parameter Jais model, we bypass this large computational cost by incorporating Maximum Update Parameterization (muP) into the model architecture. muP allows the findings of a hyperparameter search on smaller models to be transferred more reliably to a larger model saving significant amounts of compute. muP enables this transfer of hyper-parameters by controlling the initialization, the learning-rates, and the activation magnitudes in a way that is independent of the model’s layer width. Therefore, as the model size increases, and the model width increases, muP automatically adjusts the hyper-parameters to compensate as opposed to traditional methods that are less stable and require finding new hyper-parameters empirically.

Pretraining:

Pretraining an LLM involves training it on billions of words of diverse text from various domains sourced from the Web and other sources. This foundational training equips the model with a robust understanding of the language(s) while enhancing its factual and cultural awareness. Existing LLMs such as LLaMA are predominantly trained on a single language, typically English. While these models demonstrate remarkable proficiency in linguistic and reasoning capabilities, their abilities do not extend well to languages they were not trained on, e.g., Arabic. Moreover, these models’ knowledge about the Arabic world, which is primarily found in native Arabic text, is fairly limited. To address this, we pretrain our model on the most extensive Arabic dataset available, while further extending it with English data.

Pretraining data

To collect Arabic data, we use multiple sources including web pages, Wikipedia articles, news articles, Arabic books, and social network content. To augment this dataset, we add data translated from selected English sources to Arabic using an in-house machine translation (MT) system. We target, in particular, high-quality English resources such as the English Wikipedia and English books. Specific filtering techniques were also developed to avoid translating sources containing embedded code, or text that is not well structured.

Pretraining data preprocessing

Preprocessing is a vital step towards the development of high-quality LLMs (Large Language Models). It involves a series of stages, including filtering, normalizing, and cleaning the pretraining data. To ensure the highest quality of Arabic content, we implemented standard preprocessing techniques. Passing our raw dataset through this data processing pipeline yielded an Arabic dataset of 72 billion tokens. To further extend our dataset, we augmented it using data from the Pile English dataset along with some GitHub code. This forms the basis of our combined dataset that was used in pre-training.

Our preprocessing pipeline for Arabic is depicted in Figure 1. The raw data is primarily sourced from publicly available databases, such as Abu El Khair or BAAI, as well as through in-house web scraping and machine translation of high-quality English sources. Given that some of these sources might already be preprocessed or tokenized for NLP applications, it is essential to standardize our input. To ensure consistency across the board, we subject all data to an initial detokenization step (which leaves non-tokenized input unchanged). This process ensures a uniform starting point for all data, irrespective of its original state, converting each piece of data into a single entity — be it an article, a web page, or a report.

In the subsequent filtering step, we apply a series of heuristic-based rules to identify and to eliminate noisy and low-quality documents. For instance, we discard documents that are either too short or too lengthy. We also filter out documents that do not have a significant proportion of Arabic characters and sentences; this helps us exclude documents where Arabic characters might be present incidentally rather than being the primary language of the content. Additionally, any document containing a word with more than 100 characters is removed, as this often signifies the presence of extremely long URLs, a common trait in noisy documents.

Once a document passes the filtering step, it is subject to cleaning and normalization. We remove non-printable Unicode characters and rare diacritical marks, and we normalize the text using the Camel toolset for Arabic NLP. We remove embedded Javascript and HTML, which are common sources of noise in web scraped datasets. Additionally, we eliminate highly frequent words and phrases, which often represent boilerplate text, such as news channel names. We

normalize Arabic punctuation marks and we deploy a lightweight n-gram language model to further identify and remove noisy n-grams.

Finally, we apply a fuzzy deduplication step using standard locality sensitive hashing techniques. This deduplication process identifies and removes similar data across our entire pre-training set. At the end of this process, the final size of the data is about 20% of the original raw dataset size.

Unlike English, where large-scale open datasets and established preprocessing pipelines already exist, we had to custom-build our pipeline for Arabic. Our experimentation with smaller LLMs informed many of the heuristic choices we used in our final preprocessing pipeline. A key challenge we navigated was the balance between aggressive deduplication to ensure cleaner dataset, and a less aggressive approach, to preserve the limited amount of Arabic data. Ultimately, our rigorous efforts in data processing allowed us to develop an effective pre-training dataset on which the foundations of our model rests.

Training:

Trained on a dataset of 350 billion tokens, Jais uses data in Arabic (72 billion tokens), English (232 billion tokens), and programming languages (46 billion tokens). Our Arabic data is seen approximately 1.6 times, effectively equating to 115 billion tokens. Therefore, the total number of tokens the model is trained on is 395B tokens. The training of this model was conducted on 16 Cerebras CS-2 Wafer-scale compute engines as part of the Cerebras Condor Galaxy 1 supercomputer. By connecting these CS-2 systems into one unified unit, the training process could be completed in a purely data-parallel setup, allowing us to load the entire model and to train significantly faster than on GPU-based clusters. Using 16 CS-2s, our total pre-training process was completed in approximately 22 days.

During training, we observed that the Arabic language capabilities continued to improve as the number of times Arabic data is seen extends beyond 1, creating a pathway to compensate for the limited volume of Arabic data as compared to English or code. By continuing to add new English and code data while repeating Arabic, the model’s linguistic and reasoning capabilities continue to improve while maintaining its ability to process and to generate high-quality Arabic content. Moreover, bilingual transfer helps capabilities and knowledge acquired in one language, such as English, become accessible in Arabic, and vice versa.

The decision to train a bilingual model and to reiterate the Arabic content are based upon findings from scientific experimentation conducted prior to starting this run. By training a series of Arabic-only models, we established that the bilingual model outperforms an Arabic-only large language model in downstream Arabic language NLP tasks. This indicates that the English dataset helps to improve the performance of the model on Arabic via cross-lingual transfer.

Training Setup:

The Jais model was trained on the recently announced Condor Galaxy-1 (CG-1) supercomputer from Cerebras, built in partnership with G42. The foundation of the CG-1 cluster is the Cerebras Wafer Scale Engine (WSE) within the CS-2 system, the largest and most powerful AI processor currently available. The CG-1 supercomputer is designed using the unique Wafer-Scale Cluster architecture running the Weight Streaming execution model. CG-1 will have up to 4 exaFLOPs of sparse AI compute, delivered through 54 million compute cores, in a 64 CS-2 system cluster.

The Cerebras Wafer-Scale Cluster is a purpose-built system architecture that lets up to 192 Cerebras CS-2 systems to be connected and to operate as a single logical ML accelerator. The design decouples memory from compute, allowing it to deploy terabytes of memory for AI models, as opposed to gigabytes when using traditional GPUs. Within the cluster architecture, the CG-1 cluster uses a special execution mode called Weight Streaming, which fully bypasses the complexities of distributing models on traditional GPU clusters using model parallelism, and therefore provides simpler and higher performance scaling.

Instruction-tuning:

On its own, a pre-trained large language model is capable of generating coherent content that can be virtually indistinguishable from that written by humans. It is able to follow simple instructions and to complete new tasks when given examples or demonstrations. However, the model may not follow more complex instructions, and usually cannot generalize to new non-trivial tasks without demonstrations. Most importantly, the model tends to generate harmful or factually incorrect text for certain inputs, which makes it challenging to deploy in certain situations, e.g., as virtual assistants that can productively interact with humans. Towards this, we conducted instruction fine-tuning, which is the task of aligning the model towards the desired responses of humans.

We fine-tuned our LLM to follow instructions by building a dataset that contains pairs of instructions along with the desired “ideal” response. Our instruction dataset covers a wide range of commonly used tasks such as question answering, code generation, translation between English and Arabic, reasoning over textual content, etc. This included the full use of a wide array of publicly available instruction-tuning datasets. By increasing the model’s exposure to a diverse set of tasks and prompting patterns, we increase its ability not only to follow instructions, but also to generalize its abilities across many domains.

Since instruction-tuning datasets are mainly available in English (with the exception of datasets like xP3, which are multilingual), a specific subset of the datasets were translated using our in-house machine translation models into Arabic. The translated dataset helps to expand the Arabic corpus during fine-tuning and considerably improves its ability to follow Arabic instructions. It also allows the model to seamlessly navigate between English and Arabic instruction, enhancing its bilingual capabilities.

To further increase the availability of high-quality Arabic language in the fine-tuning corpus, we also introduce internally developed native Arabic instruction tuning data, focussed on regionally pertinent topics. This fresh dataset helps to ensure that the model is well-attuned to the intricacies of the Arabic language and culture, while encompassing prompting variations and idiomatic expressions that may occur in various settings.

Our instruction-tuning significantly elevated the model’s performance across a spectrum of Arabic and English NLP benchmarks. The evident progress in downstream tasks underscores the efficacy of the fine-tuning approach in boosting the model’s linguistic and reasoning capabilities.

Performance Evaluations:

We evaluate our models against a comprehensive range of downstream NLP tasks in both languages. In Arabic, we use 11 tasks ranging from common-sense reasoning to natural language understanding tasks such as sentiment analysis, irony detection, and hate speech detection. The reported score is the average performance across all tasks. Similarly for English, we consider 11 downstream tasks and report the average score. We present our findings below.

The Arabic-English bilingual LLM notably surpasses all existing open Arabic (or multilingual) models by a substantial margin, reflecting its tailored Arabic language capabilities. Moreover, its competitiveness with state-of-the-art English models of similar size such as LLaMA 2, despite using significantly less English data, showcases its cross-lingual adaptability.

Pretrained Models

We compare models primarily over the 6-7B and the 13B model scale. In the 6-7B scale, our bilingual model outperforms all other models, including multilingual models like BLOOM, by a large margin in Arabic downstream evaluation tasks.

At the 13B scale, our bilingual model outperforms both LLaMA and LLaMA 2 in Arabic downstream evaluations. This is expected since neither model was trained on any significant amount of Arabic data. Interestingly, the impact of smaller training data on the English downstream performance is much lower at the 13B scale, where LLaMA 2 is approximately 1 point above Jais, despite being trained on more than 7 times the amount of English data.

Instruction-Tuned Models

Among instruction-tuned models, we see a similar trend as among foundation models, where our models outperform in Arabic downstream tasks. Interestingly, even though LLaMA 2 with 13B parameters is a model trained with both supervised fine-tuning and reinforcement learning with human feedback, its performance increment on Jais for English downstream tasks is less than 1 point higher. This shows that our 13B model is also fairly good at English natural language understanding and generation, considering the English data size in pretraining.

In summary, our 13B Arabic-English Bilingual LLM demonstrates the effectiveness of purposeful design and careful training. This LLM bridges the gap created by limited Arabic data availability (in comparison to English) and showcases its strength against both Arabic and English monolingual models. It establishes that a focused bilingual model, which includes only two major languages, outperforms a highly multilingual model. Although including English datasets have been shown to improve Arabic performance, this behavior may not extend to training and instruction-tuning on several languages together – as illustrated by the large margin by which our models outperform the BLOOM family of models on Arabic tasks. The success achieved at the 13B scale opens a promising path forward for future work in this direction.

Impact

The development and deployment of a bilingual Arabic-English LLM holds the promise of far-reaching implications across linguistic, cultural, and technological dimensions, with a strategic impact that positions governmental and commercial organizations at the forefront of the digital revolution. Our endeavor is a step towards a future where the power of cutting-edge natural language processing (NLP) not only bridges language barriers, but also fuels advancements in understanding, generation, and deployment of Arabic language applications in diverse contexts.

Empowering the Arabic NLP Community:

The introduction of a powerful, competitive bilingual Arabic-English LLM opens doors to unprecedented advances in Arabic language understanding and generation within the region’s Arabic NLP community. Harnessing the model’s capabilities, researchers, educators, and innovators are empowered to explore novel use cases. The possibilities range from creative content generation to virtual assistants, and integration into more complex systems such as digital avatars. This empowerment not only drives innovation, but also strategically positions the Arabic NLP community as a key player in the global NLP landscape.

Sovereign LLM Implementation:

The inherent sovereignty of this LLM allows organizations across the Middle East to leverage and to deploy the model within their own infrastructures. Our fully in-house implementation ensures complete control over the model’s usage and fine-tuning and inference, promoting self-reliance while reducing dependency on external resources. By implementing the LLM locally, governmental and commercial entities can strategically position themselves as technological leaders, driving innovation and digital transformation in their respective domains.

Privacy-Enhanced On-Premise Deployment:

A significant outcome of this endeavor is the capability for local players to fine-tune and to deploy the model on-premise, ensuring complete data privacy and security. The protection of sensitive personal information not only engenders trust, but also strategically positions organizations to excel in today’s increasingly privacy-conscious environment. This enables the development of diverse applications, positioning governmental and commercial entities as pioneers in safeguarding individual privacy while delivering advanced Arabic language solutions.

Catalyzing Arabic-Centric Downstream Applications:

A robust Arabic-English LLM will ignite interest within the community, sparking a surge of enthusiasm for Arabic-focused LLMs. This renewed focus on linguistic and cultural nuances stimulates the creation of a myriad of downstream applications that cater to Arabic-speaking populations. By strategically leveraging these applications, governmental and commercial organizations can position themselves as thought leaders, driving innovative solutions aligned with the region’s cultural heritage and linguistic diversity.

Conclusion

We have introduced Jais, a new state-of-the-art Arabic-English bi-lingual large language model (LLM). This model performs a wide range of generative and downstream language tasks in both Arabic and English, ranging from common-sense reasoning to natural language understanding tasks such as sentiment analysis, irony detection, and hate speech detection. Its pre-trained and fine-tuned capability as described here outperforms all known open Arabic models and yields performance that is comparable to state-of-the-art open English models that were trained on larger datasets.

The models are available under Apache 2.0 license on HuggingFace. A conversational interface hosting the instruction tuned model is available for trial in our “playground” environment by registering. We encourage researchers, hobbyists, and enterprise developers alike to experiment with and to develop on top of our model – particularly those working on multilingual applications or non-English applications – we welcome all feedback and opportunities to collaborate.

Jais is the first of many planned developments of the G42-Cerebras Systems strategic partnership. This partnership – a part of the UAE’s broader national AI initiative – aims to advance fundamental AI research, build and broaden access to world-leading high performance AI compute, contribute to and support open source communities, and enable a new generation of innovative enterprise application developers.

Moreover, Jais importantly represents an important evolution and expansion of the NLP AI landscape in the Middle East. This first-of-kind Arabic model born in the UAE represents an important strategic step for governmental and commercial organizations toward the forefront of the digital revolution. By advancing Arabic language understanding and generation, empowering local players with sovereign and private deployment options, and nurturing a vibrant ecosystem of applications and innovation, this work supports a broader strategic initiative of digital and AI transformation to usher in an open, more linguistically inclusive, and culturally aware era.

If you’re interested in contributing or learning more beyond the open source, please reach out to our team info@inceptioniai.org.

References

Language Models are Few-Shot Learners: https://arxiv.org/abs/2005.14165
https://www.anthropic.com/index/introducing-claude
Training language models to follow instructions with human feedback: https://arxiv.org/abs/2203.02155
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model: https://arxiv.org/abs/2211.05100
Crosslingual Generalization through Multitask Finetuning: https://arxiv.org/abs/2211.01786
LLaMA: Open and Efficient Foundation Language Models: https://arxiv.org/abs/2302.13971
LLaMA 2: Open Foundation and Fine-Tuned Chat Models: https://arxiv.org/abs/2307.09288
An overview of Bard: an early experiment with generative AI: https://ai.google/static/documents/google-about-bard.pdf
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer: https://arxiv.org/abs/2203.03466
CAMeL Tools: https://github.com/CAMeL-Lab/camel_tools#camel-tools
Introducing Falcon LLM: https://falconllm.tii.ae/
Hugging Face Open LLM Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Authors

Dr. Andrew Jackson, CEO of Inception, a G42 company