Generative AI networks such as large language models (LLMs) have exponentially grown in number of parameters to achieve improved accuracy and multi-modal capabilities. New Enterprise use-cases can be unlocked by scaling LLM model to 70B parameters and beyond. However, a 70B model in FP32 requires over 280GB of memory that is only available across multiple accelerator cards.  Further, the high compute requirements of the large LLMs also necessitate multiple accelerator cards to achieve desired performance. Consequently, practical deployment of LLMs for Enterprise use-cases must contend with the high cost of inference infrastructure. To enable Enterprise use cases within practical deployment resource constraints, there is a strong need for improving the inference efficiency of the LLM models without impacting their capabilities through accelerator-aware training as well as fine-tuning.  Inference accelerator aware training and finetuning of the LLM can significantly reduce their memory and compute requirements. Cerebras and Qualcomm Technologies, Inc. are working together to enable cost-effective deployment of LLMs. This work will leverage cutting edge techniques for training and finetuning of LLMs on the Cerebras CS-2 and Wafer-Scale cluster to enable optimized inference on Qualcomm® Cloud AI 100.

This offering will provide several tools to adapt the LLMs so they fully leverage the capabilities of the Qualcomm ® AI Accelerator. Each of these techniques by themselves can deliver 1.5 to 3X performance boost and taken together can deliver up to 10X acceleration.

    • Quantization techniques such as Micro-scaling (Mx) formats help reduce the memory footprint of the model. As the decode stage of an LLM is typically memory bandwidth bound, the footprint reduction is also accompanied by performance gains.
    • Speculative sampling aims at performance improvements by devising a draft LLM model that is much smaller than the original LLM and consequently has much higher performance. Speculative inference on draft model accompanied by validating multiple tokens on target model gives performance gains without any loss in accuracy.
    • Sparsity from pruning the model reduces the memory footprint and improves performance further by reducing memory bandwidth. Sparsity also reduces the amount of computation that needs to be performed.
    • Network architecture search (NAS) is yet another approach that incrementally modifies the structure of the model to optimize for a particular use-case and inference hardware.

As these approaches either modify the network parameters/architecture or utilize a smaller draft model they require training to maintain the desired accuracy. Cerebras and Qualcomm Technologies aim to ease the deployment of these advanced techniques from training to inference for delivering unprecedented performance gains.

The following sections detail each of these techniques, their application and performance gains.

Microscaling Mx format for model parameters

In 2023, AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Technologies, Inc. developed the Open Microscaling (Mx) Format Specification with the goal of creating and standardizing next-generation 8-, 6- and 4-bit data types for AI training and inferencing (Qualcomm Cloud AI 100 Accelerates Large Language Model Inference by ~2x Using Microscaling (Mx) Formats – Qualcomm Developer Network).  The Mx format is like classical block floating point format with a common scaling factor across a set of k numbers. The Mx Specification 1.0 introduces four concrete floating point and integer-based data formats, namely MXFP8, MXFP6, MXFP4, and MXINT8 (Table 1).

Table 1: Mx Specification 1.0. E and M refer to number of exponent and mantissa bits, respectively. Block size refers to the number of elements in the block.

In the context of LLMs, FP32 or FP16 parameters can be “direct-cast” (compressed and quantized) into the reduced bit-width Mx formats. Consequently, based on the Mx representation utilized the memory footprint of the model can be reduced by a factor of ~4x (MXFP4) to ~2x (MXFP8). As the performance of the decode stage of the LLM is memory bandwidth bound, the Mx representation also results in a similar scaling up on bandwidth and performance.

The Qualcomm Cloud AI 100 compiler provides an option that automatically compresses the weights from FP32 or FP16 into Mx format during the offline compilation phase. At inference run time, the generated binary performs on-the-fly decompression in the vector engine with an optimized software kernel. Decompression can be performed in parallel with the weight fetch and computations, and the overhead is mostly hidden. After decompression, the calculations are performed as before in FP16 precision. The use of FP16 is acceptable since the LLMs performance is largely memory bandwidth constrained so that the compute is not a bottleneck. FP16 also helps retain the higher precision activations which overcomes loss of accuracy from the quantization. In our experience, direct cast from FP32 or FP16 to MxFP6 has only a nominal impact on network accuracy (>99% of FP32).

If direct casting to Mx format is insufficient to maintain acceptable accuracy, Cerebras enables fine-tuning of the network to recover any appreciable accuracy loss. The Cerebras platform enables model retraining and finetuning on large scale clusters with the ease of training on a single device. This unique capability enables completing finetuning in a matter of hours on even the largest SOTA models, specifically targeted to improving accuracy for the quantized inference model. The result is a streamlined fast transition from training on Cerebras CS-2 clusters, quantization, and finetuning, to final optimized quantized inference on Qualcomm Cloud AI 100 without loss of accuracy.

Utilizing Mx format for weights has been shown to give 1.6x-2.2x improvement in network performance for LLMs with 2.7B to 70B parameters.

Speculative sampling with draft LLM model

The decoding performance of a LLM is strongly influenced by the size of the model. For low latency use-cases (batch size equal to 1), the entire parameter set of the model must be fetched from the off-chip memory to produce a single token. The performance of the LLM in tokens per second reduces with an increase in the parameter size of the network.

Speculative sampling improves the performance of the target LLM by devising a much smaller draft model (typically 10x fewer parameters) that mimics the larger model target (Chen etal., 2023).  As the draft model is much smaller than the target model, it can be run faster but its accuracy is likely to be lower. In a typical scenario the draft model is invoked “n” times to speculatively generate “n” tokens. The n tokens are then input to the larger target LLM model to validate them as a plausible completion. The target LLM model may only accept m (0 ≤ m  n) tokens. Further, the target model generates an additional token. Thus, n invocations of the draft model followed by single invocation of the target model can generate up to n + 1 tokens. The speculative sampling approach is guaranteed to have the same accuracy as the target model as all generated tokens are validated. A key metric of the approach is the acceptance rate for speculative tokens by the target model. The greater the number of speculative tokens accepted by the target model, the higher the performance gains.  Hence, codesigning the draft model along with finetuning of the foundational model for a particular task can synergistically provide higher speedup than using an off-the-shelf smaller draft model which may not speculate completions that have high acceptance rate.

Figure 1: Illustration of speed-up due to speculative sampling when the draft LLM is executed 6 times, and all speculatively generated tokens are accepted by the target LLM.

The effectiveness of speculative sampling depends on the draft model size and accuracy balanced with the target model size and accuracy. To achieve optimal performance, the draft model needs to be small enough to run much faster than the target model, but it cannot be too small that the accuracy produces a low acceptance rate. Conversely, the draft model accuracy needs to be high enough to predict the target model well, but if this is achieved by making the draft model too large, it will be too slow to run ahead effectively.  Finding this delicate balance of model size and accuracy can require extensive hyper-parameter search and a large amount of training to optimize for the hardware characteristics. Through the work between Cerebras and Qualcomm Technologies, users gain access to both the extreme performance of the Cerebras CS-2 and Wafer-Scale cluster for training and state-of-the-art training techniques to seamlessly produce custom-tuned draft and target models optimized for high performance inference tailored for the Qualcomm Cloud AI hardware architecture.

To demonstrate this capability, we used state-of-the-art techniques to train a 450M parameter LLaMA2-style draft model and aligned it to the LLaMA2 7B target model for chat, on the Cerebras CS-2 Wafer-Scale cluster. We then ran the draft and target model to demonstrate a speculative sampling speed-up of 1.8x on Qualcomm Cloud AI 100 Ultra.

Unstructured sparsity for model compression

Another key technique to reducing inference latency and cost is sparsity, resulting from removing connections within the model. Neural network models are made up of layers of neurons and connections between them that are represented as weights in a collection of matrices. When there are missing connections, represented as zeros in the weight matrices, we refer to the model as sparse. For all zero weights, no computation or memory is technically required during training or inference to achieve the correct result. However, only solutions designed for sparsity, such as the Cerebras CS-2 for sparse training and the Qualcomm Cloud AI 100 for sparse inference, can fully take advantage of sparsity for performance gain.

Neural networks that are already sparse can be naturally accelerated by these solutions. However, what if the model is not naturally sparse, like GPT LLM models? It turns out that even fully dense models, such as GPT, can be made sparse by inducing unstructured sparsity. In this form of sparsity, certain weights are set to zero which effectively prunes the connections within the model, as shown below (Figure 2).

Figure 2: Weight sparsity applied to a dense neural network by zeroing weights prunes the connection between neurons in the model.

Although the original model was dense, inducing unstructured sparsity is a natural process because all the weights are not created equal. In fact, the goal of training a model is ultimately to discover which weights are more important than others. When a dense model is fully trained, there will naturally be many resulting weights that are very small in magnitude, which are less important than the high-magnitude weights. The nature of training itself suggests that unstructured sparsity is also natural even in an otherwise dense model.

To address the high latency and cost of large GPT inference, it’s possible to create sparse models to improve inference speed and lower cost by using fewer weights to compress the model. Since inference latency is typically memory bandwidth bound, the compression results in higher performance on Qualcomm Cloud AI 100. Pruning models for inference is already a common practice in industry for small models especially when running on edge devices and in computer vision. However, pruning is not yet widely adopted on large-scale GPT models yet because high-quality pruning requires extensive model training, where the cost on traditional hardware without sparse training acceleration is prohibitively high for large models.

Through the work between Cerebras and Qualcomm Technologies, users gain access to the only hardware capable of accelerating unstructured sparsity training, i.e. Cerebras CS-2 and Wafer-Scale cluster. Paired with the state-of-the-art pruning and training techniques, they can seamlessly create sparse models optimized to achieve highest performance on the Qualcomm Cloud AI platform, capable of sparse inference acceleration.

To demonstrate this capability, we used a combination of iterative sparse pruning and retraining on the Cerebras CS-2 Wafer-Scale cluster, to prune the LLaMA2 13B model to 80% sparse. We then ran this model to demonstrate 2.5x inference speedup on Qualcomm Cloud AI 100 Ultra.

Once pruned, the weight matrix must be compressed so that the benefits of reduced bandwidth and size can be realized. Many techniques exist to compress a sparse matrix, which effectively removes the zeros and records metadata to help with restoring the zeros during decompression. We use a delta column sparse compression (CSC) technique to minimize the amount of metadata which must be retained to reconstruct the original tensor. This technique works well for sparsity levels greater than about 60%.

The delta CSC format removes zeros from the columns of the weight matrix and records the number of zeros between each sparse (non-zero) element. With a 4-bit delta value, 0 to 15 zeros can be recorded between consecutive elements. If more than 15 zeros exist, then a “dummy” zero can be kept in the sparse data.  In addition to the delta counts, a count of the number of sparse elements per column is kept to determine the start of a new column in the sparse data. As an example, with 80% sparse data compressed using 4-bit delta CSC, a 4x compression ratio can be achieved assuming well distributed sparsity.

Decompression is efficiently implemented in the Qualcomm Cloud AI 100 vector engine. Multiple columns can be parallelized using SIMD processing. As zeros are re-inserted and tiles of weights are reconstructed, they can be pipelined to the matrix engine for multiplications with activations. The aggregate Qualcomm Cloud AI 100 decompression kernel throughput exceeds the bandwidth of the DRAM, so most workloads can take full advantage of the compression.

Further enhancements: network architecture search (NAS), distillation, compression

Microscaling Mx quantization, speculative sampling, and unstructured sparsity are highly effective optimizations offered through Cerebras and Qualcomm Technologies joint efforts.  But we believe they are just the beginning. By further co-designing the model architecture on the training hardware along with the target model on the inference hardware, we can achieve even greater acceleration.

Traditionally, neural networks are constructed using manually defined layers or sequences of layers called blocks, building up to a network architecture. Network architecture search (NAS) is an advanced technique that automates the optimal design of neural network architectures rather than relying on manually defined architectures. It involves systematically and efficiently searching over a design space of architectures to find the optimal network for a particular task and dataset. Hardware-aware NAS (HW-NAS) additionally considers the capabilities of the inference hardware where the models will eventually be deployed. NAS techniques can start from an existing baseline model architecture (for example from an off-the-shelf model) and modify “locally” in the design space or can build up a new architecture from scratch that may be completely different from the baseline model.

Deci.AI’s AutoNAC (Automated Neural Architecture Construction) is a data- and hardware-aware optimization algorithm, which created DeciCoder-6B, a generative code completion LLM that is optimized for the Qualcomm Cloud AI 100 inference accelerator. DeciCoder-6B uses variable Grouped Query Attention (GQA) with the number of groups varying across the transformer layers of the model, which was found to give the best throughput for a given accuracy target for the coding use-case (Pass@1 score on HumanEval).

Cerebras and Qualcomm Technologies will enable additional advanced techniques in the future, such as network architecture search, distillation, and increased model compression, all of which can individually provide an additional 2x or more inference acceleration. Distillation transfers model knowledge to a smaller model by training a smaller “student” model to mimic the large “teacher” baseline model, and thereby compressing the large model. There are also additional compression techniques such as codebook compression which can compress model weights further. Cerebras and Qualcomm Technologies will make advanced techniques such as these available in a streamlined manner from training to inference.

Combination of approaches

The individual techniques can be combined to gain multiplicative gains in inference performance. For instance, we have combined MxFP6 quantization with speculative sampling to obtain 3.8x speed up in inference performance. Considering unstructured sparsity can deliver 2.4x performance gain, combining the three techniques (MxFP6, sparsity and speculative sampling) can deliver almost 10x speed up.  Integrating additional advanced techniques such as NAS further improves performance. Cerebras and Qualcomm Technologies are working together to enable effortless adoption of these advanced approaches from training to inference to deliver unprecedented performance gains.

Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

Qualcomm: Karam Chatha, Natarajan Vaidhyanathan, Rishi Chaturvedi, Ravi Sivalingam and Colin Verrilli
Cerebras: Sean Lie, Abhay Gupta