GPT-J is a language model developed by OpenAI for natural language processing (NLP). It uses a deep neural network to generate text from a prompt, mimicking human writing style. GPT-J is trained on vast amounts of text data, allowing it to create more meaningful and accurate representations of the inputted content. By using GPT-J, users can create sophisticated pieces of writing with relatively little effort. The technology has revolutionized the way we use NLP, enabling us to develop far more effective models for understanding natural language than ever before. GPT-J has a wide range of potential applications, from content creation to automated customer support. By leveraging the power of GPT-J, developers can create sophisticated and powerful NLP models that make use of its highly accurate representations of natural language.

With GPT-J we have accessible weights and an ability to tune the model for specific domains and tasks. Fine-tuning is much, much less computationally expensive than pre-training – you would typically fine-tune on a much smaller dataset, so compute requirements for fine-tuning is a tiny fraction of those for pre-training. However, it is still cumbersome and hard for a 6B-parameter-model. These large-scale GPT models are commonly trained with the Adam optimizer which stores two terms, momentum and variance, per every model parameter, and when trained in mixed precision typically needs total of 16 bytes per parameter to store model weights, gradients and optimizer states. With 6B parameters this results in 96GB. It means that this model doesn’t fit into memory of modern accelerators and requires multiple accelerators and complicated model-parallel distribution just to load the model for continuous training or fine-tuning. 

Thanks to Cerebras Systems groundbreaking weight streaming execution mode, the CS-2 has no such limitations. With our implementation of GPT-J it is now easy to load a publicly available GPT-J checkpoint and tune this model on a single CS-2 with a custom domain-specific or task-specific dataset.