event

The 1st Multilingual Model Workshop,Hosted by Cerebras

Data, models, and methods to train models beyond English

Date:

Friday, December 15th, 8am to 3pm Local Time

Location:

Hilton Garden Inn New Orleans Convention Center

Address:

1001 S Peters St, New Orleans, LA 70130

Description

Large language models serve a variety of generative AI use cases. However, they are primarily trained on English datasets that do not possess cultural information to reach a global audience adequately.

This workshop aims to bring together researchers working on building non-English, bi-lingual, and multi-lingual language models. We are interested in all aspects, implications, and challenges of building non-English and multilingual models. This includes the following:

  • Dataset sourcing and cleaning
  • Best practices for mixing different languages and datasets
  • Pre-training and continuous training recipes in data-constrained environments with low-resource languages
  • Instruction tuning without instruction datasets available in target languages
  • Benchmarking and evaluation of these models in the world where most of the public and commonly used benchmarks are in English
  • Alignment with target cultural aspects

As a group, we will share our mistakes and learnings, best practices, and aspirations. We aim to bring together experts in the field, engage in a meaningful dialogue, and foster solutions that promote equity and inclusivity in the AI landscape.


Session I


Neha Sengupta

Principal Applied Scientist, G42

Developing Arabic centric bilingual LLMs​

Abstract: ​This talk will present an overview of our experience of training Jais and Jais-chat, a family of Arabic centric bilingual LLMs. The Jais and Jais-chat models were trained by Core42, Cerebras ,and MBZUAI in collaboration on the Condor Galaxy supercomputer. At 30 Billion parameters, Jais and Jais-chat are the world’s largest and best performing Arabic centric open LLMs.​

We will begin by discussing the motivating factors and primary challenges of training Jais, including those of Arabic data collection and processing. We will dive into the supervised fine-tuning data and methodology for building Jais-chat, a bilingual Arabic-centric chat model. We will also discuss our ongoing work and preliminary results on aligning Jais-chat to human preferences. Finally, we will conclude with the ways to access Jais, and the roadmap ahead.


Joel Hestness

Principal Research Scientist, Cerebras Systems

Pretraining the Jais Bilingual Arabic-English Language Models

Abstract: In this talk, we will present details of the pretraining process for Jais models, a series of bilingual Arabic-English language models. Jais models are the current best Arabic models, and they show English capability competitive with state-of-the-art models, even when trained with fewer English tokens. We have scaled Jais models to 13B and 30B parameters.

We will discuss pretraining techniques we combined that result in state-of-the-art Arabic capabilities. Our vocabulary selection process ensures the model can access balanced capability in both Arabic and English. We will describe our use of Maximal Update Parameterization, which simplifies hyperparameter selection leading to predictable model scaling. Our scaling laws tests show we can mix Arabic and English in a 1:2 ratio and achieve near-perfect scaling in both languages.

Jais models are developed through a collaboration between Core42’s Inception, the Mohammed Bin Zayed University (MBZUAI), and Cerebras Systems.


Preslav Nakov

Professor at MBZUAI (Mohamed bin Zayed University of Artificial Intelligence)

Guardrails and Evaluation of the Jais Bilingual Arabic-English Language Models

Abstract: We will discuss the guardrails we developed for the Jais bilingual Arabic-English language models. This was achieved using (i) meticulous data cleansing, (ii) instruction-tuning for safety, (iii) safety prompt engineering, and (iv) developing keyword lists and classifiers to run at inference time. For (ii), we developed a risk taxonomy and examples that cover 5 risk areas, 12 harm types, and 61 specific harm categories.

We will further discuss evaluation. In addition to perplexity, we used downstream evaluation, for Arabic and English, covering world knowledge, commonsense reasoning, and misinformation & bias. For evaluating the model in Arabic, we used pre-existing datasets such as EXAMS, which contained Arabic matriculation questions, we curated our own dataset covering Arabic literature, and we translated English evaluation datasets to Arabic (manually, for MMLU; and automatically, using an in-house system, for the other English datasets). We further performed generation evaluation using GPT-4 as a judge. Yet, whenever feasible, we performed human evaluation, which was the most important input that informed many important decisions about building the model.


Rio Yokota

Professor, Tokyo Institute of Technology

Overview of Japanese Efforts to Train LLMs​

Abstract: Like other countries, companies and government agencies in Japan have been investing heavily in LLMs. This talk will give an overview of the recent efforts to pre-train LLMs in Japan in both academia and industry. Details of the corpus, models, and pre-training configuration will be highlighted for some of the largest runs with up to 175B parameters. The first half of the talk will focus on the overview of the institutions involved, funding, and computational resources. The latter half will focus more on the technical details of the pre-training, and issues we are facing.


Session II


Felix Stollenwerk

Senior Research Scientist, AI Sweden

GPT-SW3: The first large-scale generative language model for Swedish and the Nordic Languages​

Abstract: This talk gives an overview of the process of building GPT-SW3, a multilingual LLM trained on 6 languages. We cover data ingestion and preprocessing, training and evaluation of the model as well as future plans. The focus of the presentation will be on multilingual aspects and challenges.


Irene Baucells de la Peña

Research Engineering, Barcelona Supercomputing Center

Joan Llop Palao

Research Engineering, Barcelona Supercomputing Center

Evaluating Language Adaptation Techniques for Mid-Resource Languages​

Abstract: Large language models have provided convincing evidence of their significant capabilities, both in downstream tasks and in real-world scenarios. However, low- and mid-resource languages lack the necessary resources to train such models from scratch and often have to rely on multilingual models despite being underrepresented in the training data. This talk describes the experiments that prove that, for mid-resource languages, continued pre-training with vocabulary adaptation is a better alternative resulting in a significant improvement over a random initialization of weights. We perform a comprehensive evaluation to assess the effectiveness of the different techniques to perform continued pre-training from an existing model. Vocabulary adaptation proves to be the most effective technique in terms of performance gains and training efficiency.”


Ahmet Ustun

Research Scientist, Cohere AYA

Accelerating Multilingual AI Progress with AYA​

Abstract: Aya project is an ongoing collaborative open science endeavor aimed at building a multilingual language model via instruction tuning that harnesses the collective wisdom and contributions of people worldwide. The project was initiated by Cohere For AI as a multi-institutional collaboration with the help of a community of researchers, engineers, linguists, social scientists, and lifelong learners from over 100 countries around the world. In the Aya project, we want to improve available multilingual generative models and accelerate progress for languages across the world. In this talk, I will give a brief overview of AYA, present the work that has been done, and the outcome that might be expected from the project.


Session III


Kshitij Gupta

Graduate Student, MILA

Ayush Kaushal

Graduate Student, University of Texas-Austin

Continued Pre-training of LLMs​

Abstract: Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data. Our vision is to develop methods that can enable efficiently updating pretrained models with new knowledge while preventing forgetting of past knowledge. Taking a step towards efficient continual pre-training, we examine the effect of different warm-up strategies and replay when continuing to pre-train models on new data and new languages.


Pratyush Kumar

Co-Founder, Sarvam.ai

Training an instruction tuned and aligned LLM for Indian languages

Abstract​: This talk will cover challenges in creating high-quality pre-training, finetuning, and preference optimization datasets. We will share our experiences in training the models and discuss benchmarks and accuracy evaluation. We will also announce availability of the models at a hosted inference endpoint and discuss how the model brings a significant efficiency gain for Indian languages over existing models (including GPT-x).


Olubayo Adekanmbi

Founder/CEO Data Science Nigeria

Towards LLM Generative Diversity: Highly-nuanced and Context-aware Data Generation approaches

Abstract: Large Language Models (LLMs), such as GPT-3.5 have shown their prowess in generating human-like text based on extensive training on vast corpora of data. Understanding ambiguity, particularly in the presence of cultural and linguistic diversity, stands as a formidable task. The same word or phrase can hold distinct meanings based on the underlying socio-linguistic context and cultural underpinnings.

My presentation will discuss our various works in Nigeria on the evaluation of the ability of Large Language Models like GPT3.5 to capture the deep contextual meaning of cultural nuances, with a view to establishing the need to address the current gaps to support the increased local usage of LLMs in health, financial inclusion, agriculture and education in emerging markets.

blog

Context is Everything: Why Maximum Sequence Length Matters

GPU-Impossible™ sequence lengths on Cerebras systems may enable breakthroughs in Natural Language Understanding, drug discovery and genomics.

Read
Blog

Cerebras Sets Record for Largest AI Models Ever Trained on Single Device

Our customers can easily train and reconfigure GPT-3 and GPT-J language models with up to 20 billion parameters on a single CS-2 system

Read
Blog

TotalEnergies and Cerebras Create Massively Scalable Stencil Algorithm

TotalEnergies used the Cerebras CS-2 system to turn a problem long accepted to be memory-bound into compute-bound. On a benchmark case inspired by a seismic kernel used to image the Earth, the CS-2 delivered more than 200x performance compared to a NVIDIA® A100 GPU.

Read