Foundation models are capable of providing general-purpose responses to a number of tasks. However, foundation models that require higher-quality outputs for specific tasks require more training. Our research indicates that smaller foundation models that are fine-tuned on domain-specific datasets and tasks can outperform a larger foundation model. We show that a GPT-NeoX 1.4B model that is fine-tuned for 2,000 training steps can perform just as well as the out-of-the-box GPT-J 6B model. Additionally, we show how users can easily fine-tune models using Launchpad, the simplified command line interface of our cloud-enabled Cerebras AI Model Studio.

Introduction to Training Foundation Models

Foundation models, such as BERT, CLIP, and the GPT family, including ChatGPT and GPT-4, are revolutionizing the way we approach machine learning. Trained using generic data, these models achieve great generalizability across a wide range of tasks. For some users, using a foundation model as-is, or out-of-the-box, will be sufficient – this is called Zero-Shot Learning (ZSL). These users are interested in a task that the model is already capable of doing and are satisfied with the quality of the output.

However, users that require either a task that is unique to the model or a higher quality output for already supported tasks will need to further train a foundation model. For example, a user may want to adapt the foundation model so that it can detect fake news. Or perhaps a user may want to train the model from scratch using protein data or legal text. In each of these use cases, the model does not fully understand the specialized language of the domain, and more performance can be gained by further training the model on a domain-specific dataset that reflects these differences in language.

Performance gains from fine-tuning

At Cerebras, we wanted to test if fine-tuning a model on a domain-specific dataset could achieve better performance when compared to an out-of-the-box model. We chose to conduct several experiments where we would fine-tune various GPT models on different datasets to see how they perform with respect to out-of-the-box GPT models. For each experiment, we saved checkpoints along the way and computed the following evaluation metrics: training loss, evaluation loss, accuracy, and perplexity. Following normal practice, the evaluation run was done using a separate, “eval” dataset which is separate from the training dataset. Please see the definitions below for each of the evaluation metrics.

  • Training loss – The difference between the predicted outcome of a given model and its corresponding true output (also known as target or label) for a specific set of training data. Training loss measures how well the model fits the training data
  • Evaluation loss – The difference between the predicted outcome of a given model and the actual outcome for a separate validation dataset that is not used during the training process. The purpose of evaluation loss is to provide an estimate of the model’s generalization performance on new, unseen data. A lower evaluation loss typically indicates better generalization performance.
  • Accuracy – The proportion of correctly classified samples over the total number of tokens in the training data. Accuracy measures how well the model is able to classify the training data
  • Perplexity – The average number of candidate words that could follow a given sequence of words. A lower perplexity indicates that the model is better at predicting the next word in a sequence.

Fine-tuning with TRC2 Corpus

We began our experiments by fine-tuning the GPT-J 6B on the TRC2 Corpus dataset, which contains 1,800,370 news stories from Reuters. We fine-tuned GPT-J 6B on the TRC2 Corpus dataset for a total of 12,000 steps in order to compare it to the out-of-the-box (step zero) GPT-J 6B model. Please note that step zero is the equivalent of Zero-Shot Learning (ZSL), or using a foundational model out of the box without any fine-tuning. Figure 1 shows the eval plot of each of the metrics. All three metrics keep improving which indicates that the model continues to generalize and improve.

Figure 1. Eval metric plots for GPT-J 6B on TRC2 dataset.

In our next experiment, we fine-tuned the GPT-NeoX 1.4B model on the TRC2 Corpus dataset for a total of 3,000 steps and saved a checkpoint every 1,000 steps in order to compare it to the out-of-the-box (step zero) GPT-J 6B model.

Model Step Accuracy Loss Perplexity
GPT-J 6B 0 (Original) 0.625 1.92 6.84
0 (Original) 0.439 2.89 18.00
GPT-NeoX 1.4B 1000 0.570 2.13 8.44
2000 0.602 1.88 6.56
3000 0.621 1.76 5.84

Table 1. GPT-NeoX 1.4B evaluation metrics at different checkpoints compared to zero-shot GPT-J 6B.

As shown in Table 1 and Figure 2, the zero-shot performance of a GPT-J 6B model trained using the TRC2 corpus performs nearly as well as a fine-tuned GPT-NeoX 1.4B model after about 2,000 training steps. If we continue training GPT-NeoX 1.4B, it will perform even better against the out-of-the-box GPT-J 6B. This indicates that a user can fine-tune a smaller model on a domain-specific dataset and outperform an out-of-the-box larger model. Users that elect to train smaller models will require less resources when deploying for inference, which will save them significant money over the lifetime of their generative AI application.

It is worth noting that the numbers in Table 1 were achieved with very little hyper-parameter tuning. More hyper-parameter tuning and iteration could yield even better results. With fine-tuning, GPT-J 6B would outperform NeoX 1.4B but that would require more compute for training and inference. The correct trade-off will depend on the goal of the application.

Figure 2. After 3,000 fine-tuning steps, the smaller model accuracy approaches that of the much larger model. Loss and perplexity are better after only 2,000 fine-tuning steps.

Fine-tuning with Curation Corpus

Another dataset we used for fine-tuning using the Cerebras AI Model Studio is Curation Corpus,  a collection of 40,000 professionally-written summaries of news articles.

We fine-tuned the GPT-NeoX 1.4B foundation model using Curation Corpus for 1000 steps.

Step Accuracy Loss Perplexity
0 0.50 2.78 16.0
1000 0.52 2.17 8.8

Table 2. GPT-NeoX 1.4B evaluation metrics at different checkpoints on Curation Corpus.

Fine-tuning with BigPatent

Another summarization dataset that uses a different domain is BigPatent, which consists of 1.3 million records of U.S. patent documents with human-written, abstractive summaries. Due to the legal nature of the original patent text, patent summarization is a challenging task.

We fine-tuned a GPT-J 6B foundation model using the BigPatent dataset for 7,000 steps.

Step Accuracy Loss Perplexity
0 0.55 1.99 7.28
7000 0.60 1.72 5.59

Table 3. GPT-J 6B evaluation metrics at different checkpoints on BigPatent.

The results in Table 2 and Table 3 both show significant improvement compared to Zero-Shot Learning.

Fine-Tuning vs Training from Scratch

Users have two options when creating domain-specific models: further training a foundation model, including tuning an existing generic foundation model, and training a new model from scratch. As discussed above, fine-tuning refers to the process of taking a pre-trained model and adapting it to a new task or dataset by further training on the new data, usually with smaller learning rates and fewer training epochs. Training from scratch means training a new model, which has not already been trained on a dataset, on a specific task, or dataset. Both fine-tuning and training from scratch have different requirements, as shown in Table 4.

Fine-Tuning Training from Scratch
Task Requirements Task must be similar to the tasks learned by the foundation model No task requirement as the model has not been exposed to data yet.
Data Requirements Dataset does not need to be large, but should be similar to the original dataset used for pre-training. Dataset should be sufficiently large to avoid over-fitting and achieving good performance
Compute Requirements Faster and requires less computational resources than training from scratch since the model’s initial parameters have already been learned. Computationally expensive and time-consuming, especially for complex models with many parameters
Model Performance Fine-tuning leads to better performance than out of the box foundation model when the pre-trained model is relevant to the new task, and the new dataset is similar to the original dataset. Training from scratch can outperform fine-tuning when (1) the pre-trained models are not relevant to the new task, (2) the new dataset is significantly different from the original dataset, or (3) if the dataset size is sufficiently large

Table 4. Fine-Tuning versus Training from scratch.

To summarize, fine-tuning is used to adapt pre-trained models to new tasks or datasets, while training from scratch is the process of training new models with no prior knowledge of any dataset. Fine-tuning is usually faster and requires fewer computational resources, but training from scratch can be more powerful when pre-trained models are not relevant, or the new dataset is significantly different from the original dataset.

Training with the Cerebras AI Model Studio Launchpad

Users that are interested in fine-tuning or training from scratch can do so with the Cerebras AI Model Studio Launchpad.

For the initial set of supported models (Table 5), we chose a diverse set of NLP models to cover a variety of options for your fine-tuning tasks. These include foundation models are from EleutherAI, Salesforce, and our own Cerebras-GPT family. More models coming soon! These models are compatible with the Hugging Face checkpoint format.

Note that different model sizes provide a tradeoff between compute versus target accuracy. Smaller foundation models will require less compute and train faster compared to larger models. In addition, smaller models are easier to deploy.

Another difference is the actual data that the foundation models are trained on. For example, the Saleforce CodeGen-Multi series of models are fine-tuned on the programming language dataset, while Saleforce CodeGen-Mono series are fine-tuned on Python programming language. Therefore, depending on your task, some models might provide better results than others.

Our internally trained Cerebras-GPT style models were trained on the Pile dataset. Those models have been trained in accordance with Chinchilla scaling laws. You can explore the models in more detail in our Model Zoo repository, and learn more about the training methodology in our paper, “Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster”.

Model Description
EleutherAI GPT-J 6B
EleutherAI GPT-NeoX We support all checkpoint variants in Hugging Face from EleutherAI NeoX:
• EleutherAI GPT-NeoX 20B
• EleutherAI Pythia: 70M, 160M, 419M, 1B, 1.4B, 2.8B, 6.9B and 12B.
Saleforce CodeGen We support all checkpoint variants of Saleforce Codegen:
• CodeGen-NL: 350M, 2.7B, 6.1B and 16.1B.
• CodeGen-Multi: 350M, 2.7B, 6.1B and 16.1B.
• CodeGen-Mono: 350M, 2.7B, 6.1B and 16.1B.
Cerebras-GPT We support the following variants of Cerebras-GPT
• Cerebras-GPT: 111M, 256M, 590M, 1.3B, 2.7B, 6.7B and 13B.

Table 5. List of foundation models supported by AI Model Studio.

We offer these models for fine-tuning at competitive prices and with 8x faster training speeds versus traditional cloud. Cerebras is capable of delivering performance gains at lower costs through the use of the Cerebras Wafer-Scale Cluster. And with the Cerebras AI Model Studio, your data is always secure and remains yours; you own your own ML methods, and your trained weights are yours to keep. Not only that, but our staff of experts, who produced this exciting research, is on-hand to help you optimize your fine-tuning or training from-scratch experiment.

Quick Guide to Fine-Tuning with the Cerebras AI Model Studio Launchpad

The easiest way for users to set up and run their experiments is to use Launchpad, the Cerebras AI Model Studio’s simplified command-line interface. This takes away the overhead of setting up the environment, preparing the model code and checkpoints, and the challenge of running large foundational models at scale and performance. It is designed to allow the user to focus on their experiments from the start so they can just dive in. Follow these simple steps to train your first model:

1. Log in to the user node (ssh user@ip-addr)
2. Copy over your dataset in our recommended based on our data loader guide
3. Enter Launchpad

Welcome to Launchpad. Type help or ? to list commands.

> help
Documented commands (type help <topic>):
add exit help list start stop eval experiment history run status view


4. Enter list to see which model was selected, the available datasets and checkpoints, and the hyperparameters one can change in the model.

> list
Model name: GPT-3-1.3B

Available datasets:
    - pile_2048

Available checkpoints:

    - ID: 4, Timestamp: 2023-03-29 00:12:25.398312, global_step: 0
    - ID: 5, Timestamp: 2023-03-29 00:31:36.099650, global_step: 100
    - ID: 9, Timestamp: 2023-03-29 13:36:47.741818, global_step: 10100

            - Number must be in range [0.0, 1.0]
            - Type of the value must be one of [float]
            default: 0.0
            description: Dropout rate to use.
            required: false
    optimizer: Refer to documentation for a full list of available optimizers and learning
       rate schedulers.

5. Enter the add dataset command to add the custom dataset you have already copied over.

> add dataset -h
usage: add dataset [-h] --name NAME --paths PATHS [PATHS ...]

Add a new dataset to registry of available datasets.

optional arguments:
  -h, --help            show this help message and exit
  --name NAME           Unique name of the dataset
  --paths PATHS [PATHS ...]
                        List of data directories for this dataset.

> add dataset –name example_dataset –paths <path_to_dataset_dir>

6. Enter the experiment command to add an experiment with the hyperparameters of your choice.

> experiment -h
usage: experiment [-h] {add,view,delete} ...

Manage experiments to run.

positional arguments:
add Add a new experiment
view View existing experiments
delete Delete an experiment

optional arguments:
-h, --help show this help message and exit

7. The experiment add command will open the configuration file in the vim editor. The configuration file follows the yaml syntax. Here you can change the model and hyperparameters, including the number of steps to train for.

8. Enter the run command to start training.

> run -h
usage: run [-h] [-n N]

optional arguments:
  -h, --help  show this help message and exit
  -n N        Run the last `n` experiments. If not provided, it will run the
              last available experiment.

9. Enter the status command to see the list of jobs that have been started and their current status. This command will also provide the tensorboard link to view progress and results.

The user can also run status view --job_id <job_id> to get more details about a specific job, such as losses, hyperparameters that were used for it, and checkpoints generated by the run.

> status
Tensorboard at: http://sc-r10ra10-s15.cerebrassc.local:43900/

| Job ID  | Model  | Start Time  | End Time  | Latest Status  |

> status view -h
usage: status view [-h] --job_id JOB_ID
                   [--summaries | --hyperparams | --checkpoints]

optional arguments:
  -h, --help       show this help message and exit
  --job_id JOB_ID  Filter by the given job id
  --summaries      Display loss summaries
  --hyperparams    Display hyperparameters used in this job
  --checkpoints    Display all checkpoints collected from this run

The status command can also be used to cancel a running job.

> status cancel -h
usage: status cancel [-h] --job_id JOB_ID

optional arguments:
  -h, --help       show this help message and exit
  --job_id JOB_ID  Filter by the given job id

10. To exit Launchpad, enter the exit command.

Get Started

Fine-tuning with Cerebras AI Model Studio is easy and effective for building high-quality models. Contact Cerebras by emailing us at developer@cerebras.net or by filling out this form. Please let us know if you are interested in a model that is not listed.


Emad Barsoum, Senior Director of AI
Vishal Subbiah, ML Software Architect
Udai Mody, Product Marketing and Partnerships

April 18, 2023