Oct 12 2023

How we fine-tuned Llama2-70B to pass the US Medical License Exam in a week

One evening we received an request from our partner M42: could we fine-tune Llama-70B on a custom dataset they developed to see if it can pass the US Medical License Exam (USMLE) sample questions? Also, could we do this by next week?

This is quite a novel task – we had not worked with this medical dataset, the USMLE benchmarks, or fine-tuned Llama2 70B. An ML project with so many unknowns could take weeks to sort out. A fast turn-around would take a small miracle. Naturally, we said: “yes – we’ll take it on!”

LLMs For Healthcare

Large language models (LLMs) have a near unlimited capacity to absorb and learn from troves of data, making them a powerful tool for medical diagnosis. In May 2022, Google trained a model called Med-PaLM that scored 67% in the USMLE sample questions – the first LLM to pass this exam. Most recently, its successor Med-PaLM 2 achieved 87%. While these are impressive results, Med-PaLM and other high-performance models are closed source and unavailable for public use. M42’s goal was to train a model with performance that rivals closed-access models (e.g., Med-PaLM) while releasing it with open-access thereby greatly broadening access to AI healthcare.

Training & Data

The general approach to training an LLM for a specific domain is to take a pre-train foundational model and fine tune it using a specialized dataset. In this case, we selected Llama-2 70B, one of the largest and best regarded open foundation models as the base model. For the dataset, M42 tapped into publicly available datasets. It was then further refined by medical personnel from M42’s global network of specialist healthcare providers.

Fine tuning a 70 billion parameter model is no easy feat. In FP16 precision, each parameter requires 14 bytes of memory: 2 bytes for the weight and 12 bytes for the Adam optimizer states. 70 x 14 = ~1TB of memory just for parameters. Accounting for activation memory as well would require dozens of GPUs and doing the unpleasant work of model decomposition and memory management, which could take weeks to implement.

A different approach would be to give up accuracy to save memory space. One could for example, use a memory saving technique like LoRA (Low Rank Adaptation). Using LoRA, the pretrained model weights are frozen and fine tuning is done on a smaller, lower rank matrix, greatly reducing the memory footprint. However, because the model weights are frozen, LoRA isn’t as expressive as full fine tuning and accuracy suffers, especially in math problems. In short, fine tuning a large model on GPUs requires intricate consideration of memory, hardware, and accuracy tradeoffs.

On the other hand, fine tuning models on Cerebras couldn’t be more straight forward. For this run, we simply loaded the model and trained. Unlike any other AI accelerator, Cerebras CS-2s have memory disaggregated from compute, allowing us to build clusters and supercomputers with terabytes of memory. To fine tune this model, we used an 8x CS-2 partition of the Condor Galaxy 1 (CG-1) AI supercomputer with 12 terabytes of memory, easily accommodating the 1 terabyte of parameter memory requirement needed for Llama-70B. We spent zero time adapting the problem to conform to the hardware – training a 70B parameter model on Cerebras is as trivial as training a 1B parameter model on a GPU. We started the primary training run on Friday and completed training by Tuesday.

Project Timeline

Wednesday

10am: customer request comes in

Thursday:

11am: M42 shares training data
1pm: 70B-llama2-hf downloaded and converted to Cerebras format
7pm: data is shuffled and ready
9:40: 7B model training started

Friday:

12:13am: 70B fine tuning run started

Tuesday

1:50 am: 70B fine tuning finished
9:21am: Model converted to HuggingFace format and shared with the M42

Wednesday:

Evaluation complete: model scores 72% in USMLE sample questions, beating Med-PaLM and GPT-3.5!

Model Performance

The trained model called Med42 scores 72% in the USMLE, easily clearing the 60% pass grade. It outperforms the best open-source model ClinicalCamel-70B by 33%. Compared to closed models, its performance sits between GPT-3.5 and GPT-4, a remarkable achievement considering it uses a fraction of their parameters and training cost.

One year ago, there was only one model in the world that could pass the USMLE. It took a specialist team at Google, a 540 billion parameter backbone model, and countless TPU-hours to train a model that scored 68% in the USMLE. Today, researchers in collaboration with Cerebras achieved the same milestone over a weekend using a fraction of the time and power, showing the incredible training efficiency of this medical dataset, the Llama-2-70B base model, and the Cerebras AI training platform.

This case study showcases the power and ease of use of the CG-1, one of world’s fastest AI supercomputers. Built by Cerebras and G42, CG-1 trains and fine tunes the largest and most innovative models out of the box quickly and easily, with no modifications, high utilization, and modest power usage. Fine-tuning Llama-70B over a week shows the incredible simplicity and robustness of our solution. There’s no need to implement third party frameworks or arcane memory saving techniques. The model always fits on Cerebras and fine-tuning just works at any model size. The ML expertise Cerebras has built training large models with our partners, models such as Cerebras-GPT, BTLM-3B-8K, and Jais lets us deliver fast, accurate results almost always on the first run. If you’d like to work with us on your AI project, don’t hesitate to reach out.

Resources:

Med42 on HuggingFace