May 02, 2025
Don't be lazy: CompleteP enables compute-efficient deep transformers
arXiv, 2025
Nolan Dey*, Bin Claire Zhang*, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, Joel Hestness
[arXiv]
February 21, 2025
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs
ICLR, 2025
Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, Joel Hestness
November 18, 2024
Empirical Upper Bounds for Unstructured Sparsity in Compute-Efficient Language Modeling
Machine Learning and Compression NeurIPS Workshop, 2024
Esha Singh, Shane Bergsma, Nolan Dey, Joel Hestness, Gavia Gray
November 01, 2024
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers
Gavia Gray, Aman Tiwari, Shane Bergsma, Joel Hestness
[arXiv]
October 31, 2024
Sparse maximal update parameterization: A holistic approach to sparse training dynamics
October 13, 2024
Self-Data Distillation for Recovering Quality in Pruned Large Language Models
Vithursan Thangarasa, Ganesh Venkatesh, Mike Lasby, Nish Sinnadurai, Sean Lie
[arXiv]
September 04, 2024
Bilingual Adaptation of Monolingual Foundation Models
FM-Wild ICML Workshop, 2024
Gurpreet Gosal, Yishi Xu, Gokul Ramakrishnan, Rituraj Joshi, Avraham Sheinin, Zhiming (Charles)Chen, Biswajit Mishra, Natalia Vassilieva, Joel Hestness, Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Onkar Pandit, Satheesh Katipomu, Samta Kamboj, Samujjwal Ghosh, Rahul Pal, Parvez Mullah, Soundar Doraiswamy, Mohamed El Karim Chami, Preslav Nakov
[arXiv] [OpenReview]
June 01, 2024
The practitioner's guide to the maximal update parameterization
Blog & open-source code, 2024
Nolan Dey, Quentin Anthony, Joel Hestness
May 20, 2024
MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
Vithursan Thangarasa, Mahmoud Salem, Shreyas Saxena, Kevin Leong, Joel Hestness, Sean Lie
[arXiv]
May 15, 2024
Breaking the Molecular Dynamics Timescale Barrier Using a Wafer-Scale System
Kylee Santos, Stan Moore, Tomas Oppelstrup, Amirali Sharifian, Ilya Sharapov, Aidan Thompson, Delyan Z Kalchev, Danny Perez, Robert Schreiber, Scott Pakin, Edgar A Leon, James H Laros III, Michael James, Sivasankaran Rajamanickam
[arXiv]
May 15, 2024
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment
Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz
[arXiv]
November 30, 2023
Efficient and Approximate Per-Example Gradient Norms for Gradient Noise Scale
Gavia Gray, Anshul Samar, Joel Hestness
November 13, 2023
Efficient Algorithms for Monte Carlo Particle Transport on AI Accelerator Hardware
John Tramm, Bryce Allen, Kazutomo Yoshii, Andrew Siegel, Leighton Wilson
[arXiv]
November 08, 2023
Position Interpolation Improves ALiBi Extrapolation
Faisal Al-Khateeb, Nolan Dey, Daria Soboleva, Joel Hestness
[arXiv]
September 26, 2023
Scaling the “Memory Wall” for Multi-Dimensional Seismic Processing with Algebraic Compression on Cerebras CS-2 Systems
Hatem Ltaief, Yuxi Hong, Leighton Wilson, Mathias Jacquelin, Matteo Ravasi, David Keyes
September 22, 2023
BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model
Efficient Natural Language and Speech Processing NeurIPS Workshop, 2023
Nolan Dey*, Daria Soboleva*, Faisal Al-Khateeb, Ribhu Pathria, Hemant Khachane, Shaheer Muhammad, Zhiming (Charles) Chen, Bowen Yang, Siyun Li, Abhay Gupta, Shreyas Saxena, Robert Myers, Jacob Robert Steeves, Marvin Tom, Joel Hestness
[arXiv] [Workshop Paper] [Blog] [Hugging Face] [1.08M downloads and 10th most popular text generation model in first month]
August 31, 2023
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models
Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin, Eric Xing
[arXiv]
May 22, 2023
Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning
April 07, 2023
Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster
March 22, 2023
Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency
Vithursan Thangarasa, Shreyas Saxena, Abhay Gupta, Sean Lie
[arXiv]
March 21, 2023
SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models
January 20, 2023
Wafer-Scale Fast Fourier Transforms
November 23, 2022
GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics
September 28, 2022
Disruptive Changes in Field Equation Modeling: A Simple Interface for Wafer Scale Engines
Mino Woo, Terry Jordan, Robert Schreiber, Ilya Sharapov, Shaheer Muhammad, Abhishek Koneru, Michael James, Dirk Van Essendelft
[arXiv]