Pre-training is a deep learning technique that uses unsupervised learning to enable the deep neural network (DNN) to learn useful features from raw data. By pre-training a model, the deep learning algorithm can be trained with fewer labeled examples and more efficiently on multiple tasks. This approach has become increasingly popular in natural language processing and computer vision applications. Pre-training enables deep neural networks to better absorb information, as compared to training solely on labeled datasets. Ultimately, this leads to improved performance for deep learning models on supervised tasks such as classification or regression. Overall, pre-training deep learning models results in higher accuracy rates and faster training times.  

Pre-training deep learning models has become a critical step for successful deep learning applications. This is because pre-training allows the model to learn features from large datasets and adapt to a variety of tasks, allowing it to perform better than when training from scratch. Pre-training also helps to reduce overfitting, which is important for deep learning models due to their typically high complexity and capacity for memorization. In addition, pre-trained deep learning models tend to be much faster to train, since they don’t require as many iterations of training data. As such, pre-training can help save time and resources compared to training deep learning models from scratch. Ultimately, pre-training deep learning models can offer significant improvements in accuracy and performance while reducing costs and time, making it a key step in deep learning applications. 

Cerebras Systems’ CS-2 is able to accelerate large GPT training with sparse pre-training and dense fine-tuning. The initial results show that one can pre-train GPT-3 XL with up to 75% unstructured sparsity and 2.5x fewer training FLOPS on Cerebras CS-2, while using dense fine-tuning to preserve the evaluation metrics on many downstream tasks. This may be the first time a large GPT model has been pre-trained with high sparsity without significant loss in downstream task metrics. These initial findings show the promise of sparse training, and Cerebras Systems are motivated to explore more advanced sparse techniques for even further improvement on even larger models, enabled by the Cerebras CS-2.