Parallelization is the use of multiple processors to speed up computational tasks. It is an important tool for machine learning, as it allows more data to be processed faster and with greater accuracy. By distributing computations across multiple processors, algorithms can learn from bigger datasets in less time, resulting in better performance on real-world problems. Furthermore, parallelization enables machines to complete complex calculations that would otherwise be impossible or too slow to do on a single processor. This makes it critical for organizations that need timely and accurate predictions based on large amounts of data. Parallelization also opens the door for new research possibilities by allowing scientists to explore deeper into existing datasets or create simulations of massive scale with greater accuracy than before. Ultimately, parallelizing machine learning algorithms will help organizations become smarter and more efficient. 

There are three parallelization techniques for training the state-of-the art deep learning models: 

  • Data parallel 
  • Pipeline Model Parallel 
  • Tensor Model parallel 

Data Parallel 

The simplest method for parallelizing work is called data parallel. Data parallel training is only possible if the model fits onto a single device – that is, the parameters fit in the device’s on-chip memory, and the largest matrix multiply in the largest layer can be done on the device. When running data parallel, multiple devices, each with the same configuration, are each presented a portion of the training set. The results are then averaged. This is dead simple. And this is the go-to approach for all models smaller than two billion parameters. 

Tensor Model Parallel 

With models larger than two billion parameters, the parameters no longer fit in memory and the largest matrix multiplies no longer fit on a single device. Instead, every single layer must be split and spread across several devices. This approach is called tensor model parallel. Even in the largest of networks, the 1 trillion parameter behemoth Megatron, which uses more than 2000 GPUs to train, the Microsoft team could only spread a single layer over a maximum of 8 GPUs.  

Pipeline Model Parallel 

Starting at 20 billion parameters, yet another form of parallelism is deployed, namely Pipeline Model Parallel. In this mode, a sequential pipeline is formed with where the work from Layer 1 is done on a GPU or group of GPU’s and then Layer 2 is done on a separate GPU or group of GPUs. This involves deciding which layers should be put on which devices and carefully measuring the latency between them. 

In sum, for models of fewer than two billion parameters, the easiest technique – data parallel – is sufficient. Models larger than two billion and smaller than 20 billion parameters require two simultaneous techniques: data parallel and more complicated tensor model parallel. And models of 20 billion parameters and beyond require all three techniques – data parallel, pipeline tensor model parallel and pipeline model parallel – to be implemented simultaneously. This combination of all three approaches is known as 3D parallelism.