BTLM-3B-8K: 7B Performance in a 3 Billion Parameter Model

Cerebras and Opentensor are pleased to announce BTLM-3B-8K (Bittensor Language Model), a new state-of-the-art 3 billion parameter open-source language model that achieves breakthrough accuracy across a dozen AI benchmarks. Given that the most popular model on Hugging Face today is 7B, we believe compacting 7B performance to 3B is an important milestone in enabling AI access on mobile and edge devices. Unlike large models like GPT-3 that runs from the cloud, BTLM fits in mobile and edge devices with as little as 3GB of memory, helping democratize AI access to billions of devices worldwide.

BTLM was trained on the newly unveiled Condor Galaxy 1 (CG-1) AI supercomputer, the first public deliverable of the strategic partnership between Cerebras and G42. We would like to acknowledge the generous support of two G42 companies, who provided assistance in this work. G42 Cloud and IIAI. We would also like to thank our partner Cirrascale, who first introduced Opentensor to Cerebras and provided additional technical support.

BTLM-3B-8K is available on Hugging Face with an Apache 2.0 license for commercial use.

BTLM-3B-8K Highlights:

7B level model performance in a 3B model
State of the art 3B parameter model
Optimized for long sequence length inference 8K or more
First model trained on the SlimPajama, the largest fully deduplicated open dataset
Runs on devices with as little as 3GB of memory when quantized to 4-bit
Apache 2.0 license for commercial use

BTLM was commissioned by the OpenTensor foundation for use on the Bittensor network. Bittensor is a blockchain based network that lets anyone contribute AI models for inference, providing a decentralized alternative to centralized model providers like OpenAI and Google. Bittensor serves over 4,000 AI models with more than 10 trillion model parameters across the network.

Large Models Don’t Fit on Small Devices

Large GPT models typically have over 100B parameters, requiring multiple high-end GPUs in order to perform inference. The release of LLaMA from Meta gave the world high performance models in as little as 7B parameters, making it possible to run LLMs on high end PCs. However even a 7B parameter model quantized to 4-bit precision does not fit in many popular devices such as the iPhone 13 (4GB RAM). While a 3B model would comfortably fit on almost all mobile devices, prior 3B sized models substantially underperformed their 7B counterparts.

In May, the OpenTensor foundation approached us to train a 3B model that (1) achieves state of the art accuracy, and (2) can perform inference with very long sequence lengths. This work led to today’s release of BTLM, a new state of the art 3B model trained with a context window of 8,192 tokens and the ability to extrapolate beyond this.

A New Standard for 3B Model Performance

BTLM sets a new standard in 3B parameter model quality, outperforming existing 3B models by a substantial margin. This becomes particularly noteworthy when considering BTLM was trained on only 627B tokens – significantly lower than RedPajama-INCITE-3B at 800B and OpenLLaMA-3B at 1 trillion tokens.

When looking at individual benchmarks, BTLM scores highest in every category with the exception of TruthfulQA. In RACE-middle it is tied with OpenLLaMA 3B v2.

Not only does BTLM-3B outperform all 3B models, it also performs in-line with many 7B models.

BTLM-3B surpasses the accuracy of RedPajama-INCITE-7B-Base, OpenLLaMA 7B, and Stable-LM-7B with 71% less training compute. BTLM-3B has a 58% smaller memory footprint and 2x faster inference than 7B models. This result will enable the power of 7B models to be more widely available in an easily deployable 3B package.

Long Sequence Length Inference

To enable long sequence applications, we use ALiBi position embeddings and trained on 470B tokens at the context length of 2,048 followed by 157B of tokens trained at 8,192 context length. To assess BTLM’s long sequence capability, we evaluate the on SlimPajama test set with 32,768 context length and plot loss at each token position. Although ALiBi allows extrapolation in theory, 2,048 context length training alone does not extrapolate well in practice. Thankfully variable sequence length training allows us to substantially improve extrapolation. BTLM-3B extrapolates well up to 10k context length but the performance degrades slightly beyond this.

Training

To achieve the milestone of 7B performance with a 3B model using 71% fewer training FLOPs, we combined multiple improvements to the training process. BTLM-3B is based on Cerebras-GPT with additional architectural and training improvements.

The model was trained on 627B tokens from the SlimPajama dataset, our cleaned and deduplicated version of RedPajama dataset. Training on deduplicated data allowed us to achieve higher accuracy with less compute budget.

We also found decaying the learning rate to 0.8% of the maximum and switching from the GeLU nonlinearity to SwiGLU further improved training efficiency.

Finally, we found substantial training efficiency improvements through improved hyperparameter tuning with the maximal update parameterization. The key here was to use sufficiently large batch sizes for both the small proxy models used for hyperparameter search and the final large model. The maximal update parameterization also helped ensure training stability. In our upcoming paper we will share more details, including an ablation study for each training improvement.

Hardware

BTLM was trained on the CG-1 AI supercomputer, which is the first deliverable of the G42 Cerebras strategic partnership. CG-1 is a 4 exaFLOP AI supercomputer, located in Santa Clara California, and built by G42 and Cerebras. Access to CG-1 was generously provided by two of G42’s portfolio companies G42 Cloud and IIAI. CG-1 is a shared computing resource with multiple concurrent users across G42 and Cerebras.

During the training run we needed to interleave with multiple high priority jobs on the cluster. Thanks to the simplicity of our purely data parallel stack, we were able to easily scale up and down our training to different numbers of CS-2 nodes without any code or configuration changes. The purely data parallel interface of the Cerebras weight streaming architecture eliminates the need to break up models using model, tensor, or pipeline parallelism, greatly simplifying scaling and debugging.

We encountered zero hardware failures during the course of this run, demonstrating the reliability of the CS-2. We are proud that CG-1, the initial deliverable of the G42 Cerebras strategic partnership, is making an immediate contribution to the opensource ML community.

Conclusion

By using the unique combination of the maximal update parameterization, improved hyperparameter tuning, updated model architecture, extensive data cleaning and deduplication, and variable sequence length training, BTLM sets a new standard in 3B parameter models and achieves accuracy comparable to many 7B models.

BTLM quantizes down to less than 3GB at 4-bit, making it the ideal model to deploy on popular mobile devices such as the base MacBook Air M1 and iPhone 13.

The BTLM training demonstrates the speed, simplicity, and scalability of training on Cerebras CS-2 systems. We graciously thank G42 and IIAI for making the Condor Galaxy 1 AI supercomputer available for training BTLM.

BTLM is available today on the Cerebras Hugging Face repo. BTLM will be deployed on the Bittensor network on July 27th, 2023.

Next steps

To improve the usability of BTLM, we plan to release instruction fine-tuned model variants. We will also release a white paper with the full details of the BTLM training process and extensive model evaluations.

Authors

Nolan Dey*, Daria Soboleva*, Faisal Al-Khateeb, Ribhu Pathria, Hemant Khachane, Shaheer Muhammad, Zhiming (Charles) Chen, Bowen Yang, Siyun Li, Abhay Gupta, Shreyas Saxena, Robert Myers, Jacob Robert Steeves, Marvin Tom, Joel Hestness
Marketing: James Wang
*Equal contribution

Appendix

To evaluate our models on the downstream tasks we use the Eleuther AI Language Model Evaluation Harness framework. We report normalized accuracies over the number of characters in the completion for all the tasks except for BoolQ, WinoGrande, RACE-middle, RACE-high and TruthfulQA. For the TruthfulQA task we report MC2 (multi-true) score. We use 0 shot evaluation for all the tasks except for MMLU which uses 5 shot evaluation. For MMLU we report the average across “Humanities”, “STEM”, “Social Sciences”, “Other” categories.

We also expanded the Eleuther harness to support the SIQA and RACE-middle tasks. We are planning to make our changes available publicly by committing directly into their GitHub repo.

Example Prompts

Prompt: “What is a manometer?”
Generation: “A manometric device is an instrument that measures the pressure of a fluid.”

Prompt: “What Robert Oppenheimer is famous for?”
Generation: “Robert Opperman was a physicist who was the director of the Manhattan Project, which was responsible for the development of nuclear weapons.”

Prompt: “What was Albert Einstein known for?”
Generation: “Albert Einstein was a German-born theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics).”

Prompt: “Who is Isaac Newton?”
Generation: “Isaac Newton was an English physicist, mathematician, astronomer, alchemist, and theologian. He is widely recognized as one of the most influential scientists of all time and as a key figure in the scientific revolution.”

Prompt: “Where does the blood get pumped to after it leaves the right ventricle?”
Generation: “The right atrium receives the deoxygenated blood from the pulmonary artery. The right side of the heart is the only side that receives de-oxygened…”

Prompt: “Peter lost his wallet after attending a local concert. To help him, Bob gave David 10 dollars. David then passed it on to Peter. Now what is the total amount of money Peter has?”
Generation: “A. $10”

Authors

Marketing: James Wang

*Equal contribution