Large language models have been evolving drastically fast since the invention of transformer-based models and the more recent chatbot with its tsunami of popularity, ChatGPT.  

For a LLM to have the ability to chat and answer all different kinds of questions (e.g. summarization, multiple choice, career advice, etc.), it is standard to do supervised fine-tuning (SFT) after the pre-training stage. To further improves models’ responses with a more natural tone, it is common to collect high quality model generations labeled by human and fine-tune the model with these preferences. 

Exploring efficient ways to fine-tune models using human preference data remains an open area of study. Reinforcement Learning from Human Feedback (RLHF) [5]  is one of the popular ways to do fine-tuning with human preference after the SFT, however, it involves complex and costly training of multiple models. Direct Preference Optimization (DPO)[1] is a recent work of fine-tuning LLMs with human preference data and it matches RLHF quality with a simpler training method. Our aim is to offer a cost-effective method using DPO to fine-tune pre-trained models with human preference data, providing practical guidance for its optimal utilization. 

As part of this research, we are excited to announce the release of BTLM-Chat, an enhanced version of BTLM specifically fine-tuned using the DPO method. This model has been refined based on a diverse set of human preferences to generate more coherent, contextually appropriate, and engaging dialogue. BTLM-Chat is now available on the Hugging Face platform, offering the community a robust tool for various conversational AI applications. You can access and implement this model directly from the Hugging Face model repository, allowing for seamless integration into your projects. 

Why DPO?

The Directed Preference Optimization (DPO) approach represents a relatively novel methodology for aligning Large Language Models (LLMs) with human preferences, emerging as a practical and stable alternative to traditional Reinforcement Learning (RL) techniques. Research in this domain can be broadly categorized into two distinct streams: those employ RL and those do not. Within the RL paradigm, prominent methodologies include RLHF (Reinforcement Learning from Human Feedback), which leverages human preference to guide learning, and RLAIF (Reinforcement Learning from AI Feedback), which integrates responses from Large Language Models into the learning process. These methods predominantly rely on Proximal Policy Optimization (PPO) and the development of a nuanced reward model. 

Conversely, non-RL approaches have also garnered attention, proposing innovative strategies such as LIMA/SFT (Supervised Fine-tuning using a Limited and Manually curated dataset) [6] and C-RLFT (Conditional Reinforcement Learning from Text) [7], which assigns rewards based on responses generated by the model. While recent discourse in the field has increasingly focused on non-RL methods for fine-tuning LLMs, empirical validations of these techniques remain limited. 

 However, the DPO method, distinguished by its practical application in numerous recent studies, demonstrates significant promise. It not only enhances chat functionalities but also shows improved performance in various downstream tasks, thereby marking a notable advancement in the field of LLM fine-tuning.  

DPO emerges as an alternative to the RLHF methodology, building upon its foundational principles yet diverging significantly in its implementation. Central to DPO’s innovation is its streamlined approach: it utilizes a loss function derived from the RLHF objective, coupled with the Bradley-Terry model for preference estimation. This strategic integration simplifies the training process, enabling the model to be trained in a supervised manner, a stark contrast to the multiple model training and complex RL optimization characteristically employed in RLHF. 

Furthermore, DPO offers enhanced stability in model convergence. Traditional RL optimization processes are often marked by instability, a challenge DPO circumvents through its more straightforward training approach. This stability not only facilitates easier model training but also potentially leads to more consistent model performance. Therefore, DPO represents a significant evolution in the field of LLM fine-tuning, offering a more accessible and stable alternative to the complexities inherent in traditional RL-based methods. 


In our experimental framework, we focused on training two distinct models: Pythia2.8B-DPO[3] and BTLM-DPO[4], utilizing Low-Rank Adaptation (LoRA) in both cases. A key finding from our study was that the implementation of DPO resulted in a noticeable improvement in the models’ downstream accuracy, as evaluated using the BTLM evaluation harness. This enhancement was observed across various downstream tasks.  

Following the DPO paper [1], we trained both models using Anthropic-HH dataset  [2] which contains 161k preference text pairs about helpfulness and harmlessness. We first finetuned Pythia2.8B and BTLM base models using the preferred (chosen) subset of the Anthropic hh-rlhf dataset for one epoch to get the SFT models, namely Pythia2.8B-SFT and BTLM-SFT. We then finetuned the SFT models using DPO method and Anthropic hh-rlhf dataset for another one epoch to get the DPO models, namely Pythia2.8B-DPO and BTLM-DPO. 

For the purpose of clarity and consistency in our analysis, we standardized the beta parameter at 0.1 for both models. This default setting was used to generate the results illustrated in the subsequent figure, providing a comparative baseline for assessing the impact of DPO on different models and tasks. 

From the experiments, we observed an overall improvement in downstream task performance during training. The downstream evaluation improves more quickly at the beginning of the DPO fine-tuning and gradually plateaus. This observation is more prominent for the BTLM-DPO model. The Pythia model starts to improve significantly after approximately 60,000 tokens, which could be a result of the different pre-training datasets used for these models and the nature of the models. However, both models plateau at around 100,000 tokens.

In Table 1, which provides a detailed breakdown of the downstream evaluation for each task, we observed that the BTLM-DPO model showed more balanced improvements across all tasks, in contrast to Pythia, which exhibited uneven enhancements and even degradation in some tasks. Interestingly, the ARC-c task, typically a reasoning task, also showed improvement, suggesting the broader applicability of the DPO method. Overall, for the tasks we chose, the DPO model improves the foundational model by approximately 2%.

It is also important to note that the extent of performance improvement varied depending on the specific model and hyperparameter configurations employed. Certain tasks demonstrated more significant enhancements than others, suggesting a relationship between model architecture, hyperparameters, and task-specific performance.

DPO experiments with different precision settings

In our study, we conducted a comprehensive evaluation of the DPO method, employing both LoRA (Low-Rank Adaptation) and non-LoRA configurations, while also varying the precision settings. Our objective was to ascertain whether the outcomes reported in the foundational DPO study could be consistently achieved across these varied settings. To this end, we utilized the Pythia2.8B model and the Anthropic-HH dataset.

The experimental setups included: (1) utilizing floating-point 32-bit precision (float32) without using LoRA, (2) utilizing bfloat16 precision without using LoRA, and (3) utilizing bfloat16 precision in conjunction with LoRA. These variations were meticulously chosen to assess the impact of precision and the LoRA adaptation on the efficacy of the DPO method. All the models are trained for one epoch with beta=0.1.

For evaluation, we leveraged two metrics: rewards/accuracies, and GPT-4 win-rate evaluation. The metric of ‘rewards/accuracies’ is defined as a measure of accuracy, predicated on whether the reward of the chosen response assigned by the DPO model is greater than the reward of rejected response. See appendix for the definition of rewards/accuracies.

GPT-4 evaluation (win-rate) is a method to evaluate the model performance in chat capabilities using chosen response from dataset and the response generated by the model. The evaluation is structured around a win-rate metric, where GPT-4 analyzes a response from a target LLM against a selected standard response from a curated dataset. The core of this evaluation lies in GPT-4’s ability to critically determine which response – either from the LLM under evaluation or the standard dataset – better aligns with contextual relevance and coherence. We used 256 samples for each category, namely harmlessness and helpfulness, from the Anthropic-HH dataset.

Figure 2: Reward/accuracies of models with different precisions during DPO training for Pythia2.8B.
Table 2: GPT-4 win-rate evaluation for DPO models finetuned with different precisions.

At the end of DPO training, we focus more on the GPT-4 win-rate evaluation score as it is a proxy to indicate human alignment in chat. From these experiments we observe that the best performing model is the one using fp32 and without LoRA. However, the difference between bf16 and fp32 is not significant. Overall, the suggestion is to use the model without LoRa to get the best alignment. In the case there is a lack of resources, using LoRA also seems to be a working solution to achieve a decent chat model.

Does DPO cause models to forget?

In our experimental series, we investigated the extent of original information retention in DPO models during training. Specifically, we explored how varying the beta parameter influences this ‘forgetting’ process. Using the BTLM model and the SlimPajama dataset, the dataset that BTLM was originally trained on, we evaluated the performance of the BTLM and its DPO variants by measuring their respective losses on the validation set of SlimPajama. Our findings indicated a small increase in loss, suggesting minimal forgetting by the models. Notably, we observed that a higher beta parameter correlated with greater information retention, indicating its pivotal role in controlling the extent of information preservation in DPO models.

Table 3: The validation loss of SlimPajama after BTLM-DPO finetuning with different beta

Beta selection is key to DPO performance

Our experiments, involving various datasets and beta parameters, revealed distinct patterns in the efficacy of DPO. The results indicate a superior performance of DPO with conversational datasets in terms of GPT-4 win-rate evaluation. Specifically, we found that a smaller beta parameter (less than 0.1) yields optimal results for chat-oriented datasets. In contrast, larger beta values (ranging from 0.3 to 0.5) are more effective for datasets focused on instructional fine-tuning, summarization, and similar tasks.

Our exploration included two key datasets: Anthropic-HH and SHP. The Anthropic-HH dataset is an aggregation of preference data, specifically curated conversations to evaluate dimensions of helpfulness and harmlessness. On the other hand, the SHP dataset is derived from a comprehensive scraping of public discussions and interactions on the Reddit platform, containing a broader range of preference data of 18 distinct areas.

Notably, the GPT-4 win-rate evaluation on the SHP dataset was significantly inferior compared to the Anthropic-HH dataset, indicating that conversational datasets are more conducive to DPO training. We recommend using DPO with conversational datasets for enhancing human alignment (as proxied by the GPT-4 win-rate evaluation) effectively.

Table 4: The GPT-4 win-rate evaluation and average downstream task evaluation for DPO training with different beta and datasets.

How to use DPO to make your model better?

Finetuning with DPO with general conversational datasets improves human alignment
In this work, we evaluated BTLM-SFT models that were initially trained on various datasets and subsequently fine-tuned using DPO on different datasets. Our findings demonstrate the DPO method improves models’ human alignment, and this capability generalizes to unseen datasets.

Throughout these experiments, we have two major observations:

    1. A DPO model trained solely on the Anthropic-HH dataset exhibited good performance on both the Anthropic-HH and SHP datasets, surpassing the performance of the model trained exclusively on SHP in terms of GPT-4 win-rate evaluation. This could be attributed to the inherent characteristics of the SHP dataset and potential biases in the GPT-4 evaluation process.
    2. The most proficient model in terms of chat capabilities emerged from training the SFT model using the Alpaca dataset, followed by DPO fine-tuning using the Anthropic-HH dataset. This model outperformed others when evaluated with both Anthropic-HH and SHP datasets. This observation leads to a pivotal conclusion: the quality of the initial SFT model is crucial for training a good DPO model. Additionally, this indicates that employing different datasets for SFT and DPO training can be an effective strategy, which contradicts with the suggestion from the original paper.

Table 5: Different BTLM-SFT and BTLM-DPO models evaluated on SHP dataset.
Table 6: Different BTLM-SFT and BTLM-DPO models evaluated on Anthropic-HH dataset.

How long do you need training DPO for?

It would be beneficial to understand for how long we need to train our model with DPO. However, for models with difference sizes and architectures, we may need to train a DPO model for different number of tokens. To determine the length of training required to successfully train a DPO model, we selected two models from the Pythia family: Pythia-1.4B, and Pythia-2.8B.

Figure 3 and Figure 4 indicate that the performance of downstream tasks evaluation ceases to improve after a relatively small number of tokens, leading us to concentrate on rewards/accuracies and win-rate. The data shows a clear relationship between them. Specifically, the win-rate peaks concurrently with rewards/accuracies and the models demonstrate their best performance at the peak. We believe this suggests that it is preferable to select the model at the peak of rewards/accuracies and we recommend monitoring this metric while training DPO and apply early stopping on it to determine when the model is trained enough steps.

Figure 3: The reward/accuracies, average downstream tasks evaluation, and GPT-4 win-rate during training of Pythia1.4B-DPO model
Figure 4: The reward/accruacies, average downstream tasks evalution, and GPT-4 win-rate during training of Pythia2.8B-DPO model


In our investigation, Direct Preference Optimization (DPO) emerged as a highly effective approach for aligning models with human preferences. Its implementation and training processes are straightforward, making it a practical choice for model fine-tuning. Notably, the application of DPO not only enhanced the model’s proficiency in conversational tasks but also improved its performance in various other downstream tasks. A significant advantage of DPO is its ability to retain foundational knowledge from the original model. This retention ensures that the model continues to build upon its pre-existing knowledge base, even as it adapts to new preferences and tasks through DPO fine-tuning. 

Try out BTLM-Chat, a DPO fine-tuned version of BTLM model on Huggingface.  

Authors: Alexander Vishnevskiy, Yishi Xu, Daria Soboleva  


[1] Rafailov, Rafael, et al. Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv:2305.18290, arXiv, 29 May 2023., 

[2] Bai, Yuntao, et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862, arXiv, 12 Apr. 2022., 

[3] Biderman, Stella, et al. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. arXiv:2304.01373, arXiv, 31 May 2023., 

[4] Dey, Nolan, et al. BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model. arXiv:2309.11568, arXiv, 20 Sept. 2023., 

[5] Ouyang, Long, et al. Training Language Models to Follow Instructions with Human Feedback. arXiv:2203.02155, arXiv, 4 Mar. 2022., 

[6] Zhou, Chunting, et al. LIMA: Less Is More for Alignment. arXiv:2305.11206, arXiv, 18 May 2023., 

[7] Wang, Guan, et al. OpenChat: Advancing Open-Source Language Models with Mixed-Quality Data. arXiv:2309.11235, arXiv, 20 Sept. 2023., 


The definition of rewards/accuracies

prob – probability, policy – policy model, reference – reference model