Today we are releasing SlimPajama – the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. SlimPajama was created by cleaning and deduplicating the 1.21T token RedPajama dataset from Together. By filtering out low quality data and duplicates, we were able to remove 49.6% of bytes, slimming down the dataset from 1210B to 627B tokens. We believe SlimPajama offers the highest quality and most compute efficient data to train on for runs up to 627B tokens. When upsampled, we expect SlimPajama to perform equal to or better than RedPajama-1T when training at trillion token scale.

In addition to the data, we are also releasing the tools we built to create SlimPajama. Applying MinHashLSH (Leskovec et al. 2014) deduplication to trillion token datasets like RedPajama was not possible with off-the-shelf open-source code. We made several improvements to existing solutions to produce an infrastructure that can perform MinHashLSH deduplication on trillion token datasets in a distributed, multi-threaded, and memory efficient fashion. Today we are open-sourcing this infrastructure to enable the community to easily create higher quality, extensively deduplicated datasets in the future.

Our contributions are as follows:

  1. SlimPajama 627B – the largest extensively deduplicated, multi-corpora, open dataset for LLM training. We release it under the Apache 2.0 license at https://huggingface.co/datasets/cerebras/SlimPajama-627B
  2. Releasing validation and test sets, 500M tokens each, which has been decontaminated against the training data
  3. Library of methods to replicate or pre-process from scratch other datasets. To the best of our knowledge these are the first open-source tools to enable cleaning and MinHashLSH deduplication of text data at trillion token scale.

Why SlimPajama

The latest research (Penedo et al. 2023) has shown that data quality is as important as data quantity. While training on more than one data epoch can be beneficial, this should be a choice rather than a side-effect of duplicates in the dataset. We decided to extensively deduplicate RedPajama to produce a dataset with higher information density. This means when using SlimPajama, you can achieve higher accuracy with the same compute budget when compared to other datasets.

Tokens Open source Curated data sources Deduplication level
SlimPajama
627B Yes Yes Extensive
RedPajama
1.21T Yes Yes Partial
RefinedWeb-600B
600B Yes No Extensive
RefinedWeb-5T
5T No No Extensive
LLaMA
1.4T No Yes Partial
MPT 1T No Yes Partial
MassiveText 1.4T No Yes Extensive

Table 1: Comparison of dataset features.

SlimPajama

Since LLaMA (Touvron et al. 2023) demonstrated the power of training smaller models on more than 1T tokens, there has been a push in the open-source community to replicate and extend these findings. The LLaMA data collection methodology was quickly replicated and released as the RedPajama 1.2T token open-source dataset.

Our original intention was to use the RedPajama data as-is, but upon analysis we discovered some corpora contained missing files while others had a large percentage of duplicates. RedPajama followed the deduplication guidelines in the LLaMA paper, which was less strict and only operated within each data source, not between them. To improve compute efficiency and data quality, we decided to produce a cleaned and extensively deduplicated version of the data, which led to the development of SlimPajama.

To produce SlimPajama, we first removed short, low quality documents from RedPajama. After removing punctuation, space symbols, newlines and tabs, we filtered out documents with less than 200 characters. These documents typically contain only meta data and no useful information. Low-length filter was applied to every corpora other than Books and GitHub where we found useful short documents. In total this removed 1.86% of documents from RedPajama.

Data source Document low-length filter rate
Commoncrawl 0.02%
C4 4.7%
GitHub 0.00%
Books 0.00%
ArXiv 0.62%
Wikipedia 0.00%
StackExchange 0.32%
Total 1.86%

Table 2: Document low-length filter rates.

Training on deduplicated data makes language models better by improving training compute efficiency and reducing the chance of models producing text memorized from the training data (Penedo et al., 2023; Abbas et al., 2023; Lee et al., 2021; Holtzman, 2019; Face, 2023). Every corpus contained duplicates with the most significant duplication found in CommonCrawl and GitHub. Penedo et al. (2023) found similar rates of duplication in CommonCrawl data. In total, we were able to prune 49.6% of bytes from RedPajama, leaving us with the 627B token SlimPajama dataset.

To perform deduplication we used MinHashLSH Leskovec et al. (2014) with a Jaccard similarity threshold of 0.8. We construct document signatures on top of pre-processed lower-cased 13-grams. Pre-processing includes removing punctuation, consecutive spaces, newlines and tabs. We also strip the documents to remove any leading or trailing escape characters. Deduplication was performed both within and between data sources.

Data source Byte duplication rate
Commoncrawl 63.76%
C4 6.85%
GitHub 46.16%
books 2.01%
ArXiv 0.06%
Wikipedia 2.24%
StackExchange 0.20%
Total 49.60%

Table 3: Data source byte duplication rate.

Below we compare the data compositions of several related datasets RedPajama (2023); Touvron et al. (2023); Team (2023); Penedo et al. (2023). SlimPajama has comparatively lower proportions of web data than RedPajama and RefinedWeb, and comparatively higher proportions of high quality data from Books, arXiv, and Wikipedia. While Penedo et al. (2023) rightly point out that purely filtering web data alone is more scalable, SlimPajama provides higher quality data through curation of sources.

Data source SlimPajama RedPajama LLaMA MPT RefinedWeb MassiveText
Commoncrawl 52.2% 72.6% 67.0% 10.0% 100% 0.0%
C4 26.7% 14.4% 15.0% 0.0% 0.0% 10.0%
GitHub 5.2% 4.9% 4.5% 0.0% 0.0% 4.0%
Books 4.2% 2.1% 4.5% 3.0% 0.0% 30.0%
ArXiv 4.6% 2.3% 2.5% 1.9% 0.0% 0.0%
Wikipedia 3.8% 2.0% 4.5% 4.0% 0.0% 1.0%
StackExchange 3.3% 1.7% 2.0% 1.4% 0.0% 0.0%
mC4 3.1.0 – English (200+ words) 0.0% 0.0% 0.0% 33.0% 0.0% 0.0%
C4 – English – SemDedup 80% 0.0% 0.0% 0.0% 29.9% 0.0% 0.0%
The Stack – Selected Languages 0.0% 0.0% 0.0% 10.0% 0.0% 0.0%
The Stack – Markdown 0.0% 0.0% 0.0% 3.5% 0.0% 0.0%
Semantic Scholar ORC 0.0% 0.0% 0.0% 3.3% 0.0% 0.0%
MassiveWeb 0.0% 0.0% 0.0% 0.0% 0.0% 45.0%
News 0.0% 0.0% 0.0% 0.0% 0.0% 10.0%

Table 4: Dataset source proportions for SlimPajama and related datasets.

Similar to RedPajama, there can be distributional bias in the SlimPajama dataset that can manifest in various forms in the downstream model deployment. There are other risks associated with large language models such as amplifying social stereotypes, memorizing training data, or revealing private or secure information.

SlimPajama Data Processing Pipeline

The existing open-source infrastructure for pre-processing text datasets did not scale to datasets with 1 Trillion tokens. The most significant bottlenecks occurred within the interleaving, shuffling and deduplication steps. Our optimizations are inspired by producer-consumer patterns that we implemented using a standard multiprocessing library available in Python. We also had to rewrite the datasketch (Zhu 2023) implementation to reduce both required memory and make the code more efficient in a distributed setting. Details of our implementation are available at https://github.com/Cerebras/modelzoo/tree/main/modelzoo/transformers/data_processing/slimpajama. The end-to-end pre-processing took 2.5 days with a 64 core CPU. The largest memory consumption observed was 1.4 TB.

Next Steps

We built SlimPajama with the intention of training a LLaMA style model for our partner Opentensor. While we expect training to begin in the coming weeks, both Opentensor and Cerebras believe it would be beneficial to the ML community to release the dataset first. The team at Cerebras also believes opensourcing our Trillion-token scale, data processing library will help others to further improve open source LLM datasets.

Authors

Cerebras: Daria Soboleva*, Faisal Al-Khateeb*, Joel Hestness, Nolan Dey
Opentensor: Robert Myers, Jacob Robert Steeves
*Equal contribution

Resources

Acknowledgements

We’d like to thank Together, Ontocord.ai,  ETH DS3Lab , AAI CERC Lab for creating the original RedPajama dataset and releasing it open source.
This release was made possible with the support and collaboration of Opentensor.
Easy cloud access to Cerebras systems is provided by our partner Cirrascale.

5 Citation

To cite our work please use:

@misc{cerebras2023slimpajama,
author = {Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert and Steeves, Jacob R and Hestness, Joel and Dey, Nolan},
title = {{SlimPajama: A 627B token cleaned and deduplicated version of RedPajama}},
month = June,
year = 2023,
howpublished = {\url{https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama}},
url = {https://huggingface.co/datasets/cerebras/SlimPajama-627B},
}

References

Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S. Morcos. SemDeDup: Data-efficient learning at web-scale through semantic deduplication. 2023. URL https://arxiv.org/abs/2303.09540.

Together Computer. RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.

Hugging Face. Large-scale Near-deduplication Behind BigCode, 2023. URL https://huggingface.co/ blog/dedup.

Ari Holtzman. The Curious Case of Neural Text Degeneration. 2019. URL https://arxiv.org/abs/1904. 09751.

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating Training Data Makes Language Models Better. 2021. URL https://arxiv.org/abs/2107.06499.

Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of Massive Datasets. Cambridge University Press, 2014.

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. 2023. URL https://arxiv.org/abs/2306.01116.

The MosaicML NLP Team. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs, 05 2023. URL https://www.mosaicml.com/blog/mpt-7b.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and Efficient Foundation Language Models. 2023. URL https://arxiv.org/abs/2302.13971.

Eric Zhu. datasketch: Big Data Looks Small — datasketch 1.5.9 documentation. http://ekzhu.com/datasketch/, 2023.

A Additional Dataset Details

Figure 1: SlimPajama prep-processing pipeline.

Our pre-processing pipeline consists of multiple stages such as NFC normalization, cleaning, deduplication, document interleaving, document shuffling, split into train and holdout sets, deduplication of train set against holdout. All these steps are presented at Figure 1. Additional steps such as tokenization, sequence packing, sequence-level shuffling as well as upsampling can be performed using our scripts located at https://github.com/Cerebras/modelzoo/tree/main/modelzoo/transformers/data_processing/slimpajama. All steps here assume that the whole dataset cannot fit in the available RAM and distributed across multiple processes. We are welcoming any additional datasets preparation steps or suggestions on how to make this even more efficient on the large scale datasets!

B Examples of Low-length Filtered Data

Example 1:

{
 "text": "\n\n\n\n\n\n\n\n\n\n\n", "meta": {
      "timestamp": "2018-05-31T02:04:03",
      "yymm": "1805",
      "arxiv_id": "1805.11741",
      "language": "en",
      "url": https://arxiv.org/abs/1805.11741 }
}

Example 2:

{
"text": "\\section{}\n\n\n\n\n\n\n\n\n\n\n\n\n", "meta": {
"timestamp": "2012-09-14T02:06:30",
"yymm": "1209",
"arxiv_id": "1209.2994",
"language": "en",
"url": https://arxiv.org/abs/1209.2994 }
}

Example 3:

{
    "text": "$126.70 $144.80 Save 13% Current price is $126.7, Original price is $144.8. You Save 13%.\n "meta": {
          "timestamp": "2019-04-21T11:02:23Z",
          "source": "c4",
          "language": "en",
          "url": https://www.barnesandnoble.com/w/politics-in-states-and-communities-thomas-r-dye/11000568
       }
}

C Example of Duplicates

“Hazing Can End with Enlightened Education\nby | Oct 1, 2018 | General\nThe headlines make it clear that hazing is still happening all over the planet with young men and women being forced to make decisions that can have life threatening consequences. If you follow the NMB Foundation Blog, you know the dangers of hazing.\nHazing and What Can We Do?\nAccording to the NMB Foundation, plenty, because the Mission Statement of the NMB Foundation is to help young men and women recognize and prevent the dangers of Rite of Passage…”

“The headlines make it clear that hazing is still happening all over the planet with young men and women being forced to make decisions that can have life threatening consequences. If you follow the NMB Foundation Blog, you know the dangers of hazing.\nHazing and What Can We Do?\nAccording to the NMB Foundation, plenty, because the Mission Statement of the NMB Foundation is to help young men and women recognize and prevent the dangers of Rite of Passage…”