Abstract
Large Language Models (LLMs) have transformed natural language processing tasks successfully. Yet, their large size and high computational needs pose challenges for practical use, especially in resource-limited settings. Model compression has emerged as a key research area to address these challenges. This paper presents a survey of model compression techniques for LLMs. We cover methods like quantization, pruning, and knowledge distillation, highlighting recent advancements. We also discuss benchmarking strategies and evaluation metrics crucial for assessing compressed LLMs. This survey offers valuable insights for researchers and practitioners, aiming to enhance efficiency and real-world applicability of LLMs while laying a foundation for future advancements.
1 Introduction
Large Language Models (LLMs) (Touvron et al., 2023a, b; Zhang et al., 2022; Scao et al., 2022; Wang and Komatsuzaki, 2021; OpenAI, 2024) refer to Transformer language models that contain billions (or more) of parameters, which are trained on massive text data. LLMs consistently exhibit remarkable performance across various tasks, but their exceptional capabilities come with significant challenges stemming from their extensive size and computational requirements. For instance, the GPT-175B model (Brown et al., 2020), with an impressive 175 billion parameters, demands a minimum of 350GB of memory in half-precision (FP16) format. Furthermore, deploying this model for inference necessitates at least five A100 GPUs, each featuring 80GB of memory, to efficiently manage operations. To tackle these issues, a prevalent approach known as model compression (Han et al., 2016) offers a solution. Model compression involves transforming a large, resource-intensive model into a compact version suitable for deployment on resource-constrained devices. Additionally, model compression can enhance LLM inference speed and optimizes resource efficiency.
In our paper, our primary objective is to illuminate the recent strides made in the domain of model compression techniques tailored specifically for LLMs. Our work conducts an exhaustive survey of methodologies, metrics, and benchmarks of model compression for LLMs. Figure 1 shows the taxonomy of model compression methods for LLMs, including quantization, pruning, knowledge distillation, and low-rank factorization. Figure 2 further shows basic flow of these model compression methods for LLMs. Furthermore, our study sheds light on prevailing challenges and offers a glimpse into potential future research trajectories in this evolving field. We advocate for collaborative efforts within the community to pave the way for an ecologically conscious, all-encompassing, and sustainable future for LLMs. While there were previous surveys on neural networks model compression (Li et al., 2023c) and it has been lightly discussed in prior surveys on LMs (Rogers et al., 2020) and LLMs (Zhao et al., 2023), our work is the inaugural survey dedicated solely to model compression for LLMs.
2 Metrics and Benchmarks
2.1 Metrics
Model compression of LLMs can be measured using various metrics, which capture different aspects of performance. These metrics are commonly presented alongside accuracy and zero-shot ability to comprehensively evaluate the LLM.
Model Size in a LLM typically is measured by the number of total parameters of the LLM. In general, LLMs with more parameters often requires more computational resources and memory for both training and inference.
Floating Point Operations (FLOPs) is an indicator that measures the computational efficiency of LLMs, representing the number of floating-point operations required for the LLM to perform an instance. In model compression, reducing FLOPs helps to make the LLM run faster and more efficiently.
Mean FLOPS Utilization (MFU) quantifies the practical efficiency of computational resource utilization by LLMs during tasks. MFU measures the ratio of actual FLOPS utilized by the LLM to the maximum theoretical FLOPS of a device. Unlike FLOPs, which estimates the maximum operations an LLM might perform, MFU assesses the actual effectiveness of resource use in operation. Essentially, while FLOPs measures a LLM’s theoretical compute needs, MFU shows how effectively these computations are utilized in practice.
Inference Time (i.e., latency) measures the time taken by the LLM to process and generate responses for input data during inference. Inference time is particularly crucial for real-world applications where the LLM needs to respond for user queries or process large amounts of data in real-time.
Speedup Ratio measures how much faster a compressed LLM performs tasks compared to the original LLM. Specifically, it measures the ratio of the inference time of the uncompressed model over the inference time of the compressed model. Higher ratios mean greater efficiency and reduced computation time, highlighting effective compression.
Compression Ratio measures how much a LLM’s size is reduced through compression, calculated as the original size divided by the compressed size. Higher ratios mean greater size reduction, showing the compression’s effectiveness in saving storage and memory.
2.2 Benchmarks and Datasets
The main goal of these benchmarks and datasets is to measure the efficiency and performance of compressed LLMs in comparison to their uncompressed counterparts. These benchmarks and datasets typically consist of diverse tasks and datasets that cover a range of natural language processing challenges.
2.2.1 Common Benchmarks and Datasets
The majority of research evaluates compressed LLMs on well-established NLP benchmarks and datasets. For instance, WikiText-2 (Merity et al., 2017), C4 (Raffel et al., 2020), and PTB (Marcus et al., 1993) are designed for evaluating the perplexity performance of language models. LAMBADA (Paperno et al., 2016), PIQA (Tata and Patel, 2003), and OpenBookQA (Mihaylov et al., 2018) are designed to evaluate the zero-shot ability of language models. GSM8K (Cobbe et al., 2021), CommonsenseQA (Talmor et al., 2019) and StrategyQA (Geva et al., 2021) are designed to evaluate the reasoning ability of language models.
2.2.2 BIG-Bench
BIG-Bench (BBH) (Srivastava et al., 2023) is a benchmark suite designed for LLMs, covering over 200 NLP tasks, e.g., Text Comprehension Tasks, Inference Tasks, and Mathematical Reasoning Tasks. The aim of BBH is to evaluate the performance of LLMs across these various complex tasks. The compressed LLMs use BBH to measure their capability across a multidimensional spectrum of tasks.
2.2.3 Unseen Instructions Datasets
Unseen instructions datasets are used to evaluate the performance of LLMs on unseen tasks. For instance, the Vicuna-Instructions (Zheng et al., 2023) dataset created by GPT-4 includes 80 complex questions across nine different categories like generic, knowledge-based, and writing tasks. Another dataset, User-Oriented-Instructions (Wang et al., 2023d), consists of 252 carefully selected instructions inspired by various user-focused applications such as Grammarly, StackOverflow, and Overleaf. These datasets evaluate how well compact LLMs can handle and carry out new tasks by presenting them with unfamiliar instructions.
2.2.4 EleutherAI LM Harness
The EleutherAI LM Harness (Gao et al., 2023) is an advanced framework for evaluating LLMs, providing a unified testing platform that supports over 60 standard academic benchmarks along with hundreds of subtasks and variants. The standardized evaluation tasks provided by the harness ensure the reproducibility and comparability of evaluation, which is essential for implementing fair and reproducible evaluations for the compressed LLMs.
3 Quantization
Quantization (Gray and Neuhoff, 1998) refers to the process of reducing the number of bits (i.e., precision) in the parameters of the model with minimal loss in inference performance. Quantization can be categorized into two main approaches: Quantization-Aware Training (QAT), and Post-Training Quantization (PTQ). The primary distinction between the two approaches lies in whether retraining is needed during quantization. PTQ enables direct use of quantized models in inference, while QAT requires retraining to rectify errors introduced by quantization. Table 1 shows the performance of many representative LLM quantization methods.
Category† . | Methods . | LLM . | Bit Width . | Perplexity Difference‡ . | Speedup . | |||
---|---|---|---|---|---|---|---|---|
Weights . | Activations . | KV Cache . | Wikitext-2 . | C4 . | ||||
QAT | LLM-QAT | LLaMA-30B | 4 | 8 | 16 | 0.5 | 0.9 | – |
BitDistiller | LLaMA2-13B | 2 | 16 | 16 | 1.9 | – | – | |
OneBit | LLaMA-13B | 1 | 16 | 16 | 4.09 | 3.64 | – | |
Weight-Only Quantization | LUT-GEMM | LLaMA-65B | 3 | 16 | 16 | 0.14 | – | 2.04× |
SqueezeLLM | LLaMA-13B | 3 | 16 | 16 | 0.51 | 0.67 | 2.4× | |
GPTQ | OPT-175B | 3 | 16 | 16 | 0.34 | 0.23 | 3.24× | |
AWQ | LLaMA2-70B | 3 | 16 | 16 | 0.42 | – | 3.2× | |
OWQ | LLaMA-65B | 3.01 | 16 | 16 | 0.72 | – | – | |
SpQR | LLaMA-30B | 3.89 | 16 | 16 | 0.15 | 0.1 | 2.0× | |
QuIP | LLaMA2-70B | 2 | 16 | 16 | 3.007 | 3.228 | – | |
Weight-Activation Quantization | ZeroQuant | GPT-J-6B | 8 | 8 | 16 | 0.16 | – | 3.67× |
LLM.int8() | OPT-13B | 8 | 8 | 16 | – | 0.00 | 1.22× | |
SmoothQuant | OPT-175B | 8 | 8 | 16 | 0.18 | – | 1.56× | |
RPTQ | OPT-175b | 4 | 4 | 16 | 2.26 | 2.15 | – | |
Olive | BLOOM-7B | 4 | 4 | 16 | 2.11 | 2.24 | 4.5× | |
OS+ | LLaMA-65B | 4 | 4 | 16 | 5.77 | – | – | |
QT | OPT-1.3B | 8 | 8 | 16 | 17.74 | – | – | |
ZeroQuant-FP | LLaMA-30B | 4 | 8 | 16 | 0.18 | 0.13 | – | |
OmniQuant | LLaMA-7B | 4 | 6 | 16 | 0.41 | 0.55 | – | |
KV Cache Quantization | KVQuant | LLaMA-65B | 16 | 16 | 2 | 0.19 | 0.11 | 1.4× |
WKVQuant | LLaMA-13B | 4 | 16 | 4 | 0.12 | 0.14 | – |
Category† . | Methods . | LLM . | Bit Width . | Perplexity Difference‡ . | Speedup . | |||
---|---|---|---|---|---|---|---|---|
Weights . | Activations . | KV Cache . | Wikitext-2 . | C4 . | ||||
QAT | LLM-QAT | LLaMA-30B | 4 | 8 | 16 | 0.5 | 0.9 | – |
BitDistiller | LLaMA2-13B | 2 | 16 | 16 | 1.9 | – | – | |
OneBit | LLaMA-13B | 1 | 16 | 16 | 4.09 | 3.64 | – | |
Weight-Only Quantization | LUT-GEMM | LLaMA-65B | 3 | 16 | 16 | 0.14 | – | 2.04× |
SqueezeLLM | LLaMA-13B | 3 | 16 | 16 | 0.51 | 0.67 | 2.4× | |
GPTQ | OPT-175B | 3 | 16 | 16 | 0.34 | 0.23 | 3.24× | |
AWQ | LLaMA2-70B | 3 | 16 | 16 | 0.42 | – | 3.2× | |
OWQ | LLaMA-65B | 3.01 | 16 | 16 | 0.72 | – | – | |
SpQR | LLaMA-30B | 3.89 | 16 | 16 | 0.15 | 0.1 | 2.0× | |
QuIP | LLaMA2-70B | 2 | 16 | 16 | 3.007 | 3.228 | – | |
Weight-Activation Quantization | ZeroQuant | GPT-J-6B | 8 | 8 | 16 | 0.16 | – | 3.67× |
LLM.int8() | OPT-13B | 8 | 8 | 16 | – | 0.00 | 1.22× | |
SmoothQuant | OPT-175B | 8 | 8 | 16 | 0.18 | – | 1.56× | |
RPTQ | OPT-175b | 4 | 4 | 16 | 2.26 | 2.15 | – | |
Olive | BLOOM-7B | 4 | 4 | 16 | 2.11 | 2.24 | 4.5× | |
OS+ | LLaMA-65B | 4 | 4 | 16 | 5.77 | – | – | |
QT | OPT-1.3B | 8 | 8 | 16 | 17.74 | – | – | |
ZeroQuant-FP | LLaMA-30B | 4 | 8 | 16 | 0.18 | 0.13 | – | |
OmniQuant | LLaMA-7B | 4 | 6 | 16 | 0.41 | 0.55 | – | |
KV Cache Quantization | KVQuant | LLaMA-65B | 16 | 16 | 2 | 0.19 | 0.11 | 1.4× |
WKVQuant | LLaMA-13B | 4 | 16 | 4 | 0.12 | 0.14 | – |
:The results presented in the table are solely derived from the original papers.
: (The perplexity of the quantized LLM) - (The perplexity of the origin LLM).
3.1 Quantization-Aware Training
QAT involves retraining a quantized model to counteract performance degradation caused by quantization. For instance, LLM-QAT (Liu et al., 2023b) implements the standard QAT framework directly onto LLMs. LLM-QAT distills knowledge by generating data from the LLM itself, and train the quantized LLM to align with the output distribution of the original LLM based on the generated data. BitDistiller (Du et al., 2024) merges QAT with self-distillation, enhancing LLM performance at sub-4-bit precisions. It employs tailored asymmetric quantization, clipping, and a Confidence-Aware Kullback-Leibler Divergence objective for faster convergence and superior results. OneBit (Xu et al., 2024) introduces a novel 1-bit parameter representation method and an effective parameter initialization method to implement 1-bit quantization for LLM weight matrices, paving the way for the extremely low bit-width deployment of LLMs.
While QAT can mitigate quantization’s accuracy degradation, retraining demands a lot of effort due to tens or hundreds of billions of parameters in LLMs. A practical solution is to incorporate Parameter-Efficient Fine-Tuning (PEFT) into the retraining process of QAT. Currently, methods like QLORA (Dettmers et al., 2023), PEQA (Kim et al., 2023a), and LoftQ (Li et al., 2023a) combine quantization with PEFT for model fine-tuning efficiency. However, these methods are typically task-dependent. L4Q (Jeon et al., 2024) makes a preliminary attempt to enhance generality by leveraging LoRA-wise learned quantization step size for LLMs. We think that introducing PEFT to enhance QAT efficiency is not only feasible but also holds significant promise, warranting thorough exploration.
3.2 Post-Training Quantization
PTQ efficiently converts a full-precision LLM to low-precision without retraining, saving memory and computational costs. We categorize PTQ for LLMs into three groups: Weight-Only Quantization, Weight-Activation Quantization, and KV Cache Quantization. The disparity between these groups lies in their quantization objectives. Weight-only quantization focuses solely on quantizing weights, whereas weight-activation quantization extends its objective to both weights and activations. Prior research (Yao et al., 2023) indicates that activation quantization is typically more sensitive to weight quantization, allowing weight-only quantization to achieve lower bit-width. However, since quantized weights necessitate dequantization before multiplication with activations, weight-only quantization inevitably introduces additional computational overhead during inference and cannot enjoy the accelerated low-bit operation supported by specific hardware. Furthermore, KV cache quantization targets the KV cache, which stores keys and values of attention layers. The KV cache often consumes lots of memory, acting as a bottleneck for input streams containing lengthy tokens. By implementing KV cache quantization, it is possible to increase throughput and accommodate inputs with longer tokens more efficiently.
3.2.1 Weight-Only Quantization
Weight-only quantization is the most conventional and widespread method. For example, LUT-GEMM (Park et al., 2024) uses binary-coding quantization (BCQ) (Rastegari et al., 2016) format, which factorizes the parameters of LLMs into binary parameters and a set of scaling factors, to accelerate quantized matrix multiplications in weight-only quantization. GPTQ (Frantar et al., 2023) proposes a layer-wise quantization method based on Optimal Brain Quantization (OBQ) (Frantar and Alistarh, 2022), which updates weights with inverse Hessian information, and quantizes LLMs into 3/4-bit. QuIP (Chee et al., 2023) optimally adjusts weights by utilizing the LDL decomposition of the Hessian matrix derived from vectors drawn uniformly at random from a calibration set, and multiplies weight and Hessian matrices with a Kronecker product of random orthogonal matrices to ensure incoherence between weight and Hessian matrices. Combining these two steps, QuIP successfully quantizes LLMs into 2-bits with minimal performance loss.
To further minimize quantization errors in the weight-only quantization of LLMs, many studies identify sensitive weights, which have an important effect on LLMs’ quantization performance, and store these sensitive weights in high precision. For example, AWQ (Lin et al., 2023) stores the top 1% of weights that have the most significant impact on LLM performance in high-precision, and integrates a per-channel scaling method to identify optimal scaling factors. Here, “channel” denotes individual dimensions or feature maps within the model. Similar with AWQ, OWQ (Lee et al., 2024) store weights sensitive to activation outliers in high-precision, and quantizes other non-sensitive weights. Different from OWQ, SpQR (Dettmers et al., 2024) employs the L2 error between the original and quantized predictions as a weight sensitivity metric. Furthermore, SqueezeLLM (Kim et al., 2023b) introduces a weights clusters algorithm based on sensitivity, using k-means centroids as quantized weight values, to identify sensitive weights. The sensitivity is approximated by the Hessian matrix of weights. Then, SqueezeLLM stores sensitive weights in an efficient sparse format, and quantize other weights. SqueezeLLM quantizes LLMs in 3-bit, and achieves a more than 2× speedup compared to the FP16 baseline.
3.2.2 Weight-Activation Quantization
Alongside studies centered on weight-only quantization in LLMs, there is a plethora of research focusing primarily on weight-activation quantization in LLMs. For example, ZeroQuant (Yao et al., 2022) is the first work to implement weight-activation quantization for LLMs, which uses group-wise quantization for weight and token-wise quantization for activations, and reduces the precision for weights and activations of LLMs to INT8.
LLMs have outliers in activations, and the performance of LLMs declines considerably, if these activations with outliers are directly quantized. Recent studies try to treat these outliers specially to reduce quantization errors in weight-activation quantization. For example, LLM.int8() (Dettmers et al., 2022) stores these outlier feature dimensions into high-precision, and uses vector-wise quantization, which assigns separate normalization constants to each inner product within matrix multiplication, to quantize other features. LLM.int8() quantizes weights and activations of LLMs into 8-bit without any performance degradation. SmoothQuant (Xiao et al., 2023) designs a per-channel scaling transformation to smooths the activation outliers based on the discovery that different tokens have similar variations across channels of activations. RPTQ (Yuan et al., 2023a) finds that the range of values varies greatly between different channels, and integrates a channel reordering method, which clusters and reorders the channels in the activation and uses the same quantization parameters to quantize the values in each cluster, into layer normalization and linear layer weights to efficiently reduce the effect of numerical range differences between channels. OliVe (Guo et al., 2023) thinks that outliers are more important than the normal values, and uses an outlier-victim pair (OVP) quantization to handle outlier values locally with low hardware overheads and significant performance benefits. OS+ (Wei et al., 2023) further finds that outliers are concentrated in specific and asymmetric channels. Based on the findings, OS+ incorporates channel-wise shifting to eliminate the impact of asymmetry and channel-wise scaling to balance the distribution of outliers. LLM-FP4 (Liu et al., 2023a) uses floating-point formats (specifically FP8 and FP4) to address the limitations of traditional integer quantization (such as INT8 and INT4) to deal with outliers. Furthermore, LLM-FP4 (Liu et al., 2023a) points out that exponent bits and clipping range are important factors that effect the performance of FP quantization, and introduces a search-based framework for determining the optimal exponent bias and maximal quantization value. OmniQuant (Shao et al., 2024b) handles the activation outliers by equivalently shifting the challenge of quantization from activations to weights, and optimizes the clipping threshold to adjust the extreme values of the weights.
3.2.3 KV Cache Quantization
With the increasing number of input tokens supported by LLMs, the memory usage of the KV cache also increases. Recent efforts begin to focus on KV cache quantization to reduce the memory footprint of LLMs and accelerate their inference. For example, KVQuant (Hooper et al., 2024) proposes several KV Cache Quantization methods, such as Per-Channel Key Quantization, PreRoPE Key Quantization, and Non-Uniform KV cache quantization, to implement 10 million context length LLM inference. Through an in-depth analysis of the element distribution within the KV cache, KIVI (Liu et al., 2024) finds that key caches should be quantized per-channel, while value caches should be quantized per-token. Finally, KIVI succeeds in quantizing the KV cache to 2 bits without fine-tuning. WKVQuant (Yue et al., 2024) presents an innovative approach for quantizing LLMs by integrating past-only quantization to refine attention computations, employing a two-dimensional quantization strategy to manage the distribution of key/value (KV) caches effectively, and utilizing cross-block reconstruction regularization for optimizing parameters. This method enables the quantization of both weights and KV caches, resulting in memory savings that rival those of weight-activation quantization, while nearly matching the performance levels of weight-only quantization.
4 Pruning
Pruning (LeCun et al., 1989) is a powerful technique to reduce the size or complexity of a model by removing redundant components. Pruning can be divided into Unstructured Pruning, Semi-Structured Pruning, and Structured Pruning. Structured pruning removes entire components like neurons, attention heads, or layers based on specific rules while preserving the overall network structure. On the other hand, unstructured pruning prunes individual parameters, resulting in an irregular sparse structure. Semi-structured pruning is a method that lies between structured pruning and unstructured pruning, capable of achieving fine-grained pruning and structural regularization simultaneously. It prunes partial parameters based on specific patterns rather than entire channels, filters, or neurons, making it a fine-grained form of structured pruning. Table 2 shows the performance of many representative LLM pruning methods.
Category† . | Methods . | LLM . | Perplexity Difference . | Compression Rate . | Speed up . |
---|---|---|---|---|---|
(WikiText-2)‡ . | |||||
Unstructured Pruning | SparseGPT | OPT-175B | −0.14 | 50% | – |
Wanda | LLaMA-65B | 1.01 | 50% | – | |
SAMSP | LLaMA2-13B | 0.63 | 50% | – | |
DSnoT | LLaMA-65B | 2.08e4 | 90% | – | |
Structured Pruning | LLM-Pruner | LLaMA-13B | 3.6 | 20% | – |
Shortened LLaMA | LLaMA-7B | 10.5 | 35% | – | |
FLAP | LLaMA-65B | 7.09 | 50% | – | |
SliceGPT | LLaMA2-70B | 1.73 | 30% | 1.87× | |
Semi-Structured Pruning | E-Sparse | LLaMA-65B | 2.13 | 2:4 | 1.53× |
SparseGPT | OPT-175B | 0.39 | 2:4 | 2× | |
Wanda | LLaMA-65B | 2.69 | 2:4 | 1.24× |
Category† . | Methods . | LLM . | Perplexity Difference . | Compression Rate . | Speed up . |
---|---|---|---|---|---|
(WikiText-2)‡ . | |||||
Unstructured Pruning | SparseGPT | OPT-175B | −0.14 | 50% | – |
Wanda | LLaMA-65B | 1.01 | 50% | – | |
SAMSP | LLaMA2-13B | 0.63 | 50% | – | |
DSnoT | LLaMA-65B | 2.08e4 | 90% | – | |
Structured Pruning | LLM-Pruner | LLaMA-13B | 3.6 | 20% | – |
Shortened LLaMA | LLaMA-7B | 10.5 | 35% | – | |
FLAP | LLaMA-65B | 7.09 | 50% | – | |
SliceGPT | LLaMA2-70B | 1.73 | 30% | 1.87× | |
Semi-Structured Pruning | E-Sparse | LLaMA-65B | 2.13 | 2:4 | 1.53× |
SparseGPT | OPT-175B | 0.39 | 2:4 | 2× | |
Wanda | LLaMA-65B | 2.69 | 2:4 | 1.24× |
: The results presented in the table are solely derived from the original papers.
: (The perplexity of the pruned LLM) - (The perplexity of the origin LLM).
4.1 Unstructured Pruning
Unstructured pruning preserves the pruned model’s performance, hence, works related to unstructured pruning of LLMs often dispense with retraining to restore performance. Nevertheless, unstructured pruning renders the pruned model irregular, necessitating specialized handling or software optimizations for inference acceleration. An innovative approach in this domain is SparseGPT (Frantar and Alistarh, 2023), which introduces a one-shot pruning strategy without retraining. SparseGPT frames pruning as an extensive sparse regression problem and solves it using an approximate sparse regression solver. SparseGPT achieves significant unstructured sparsity, even up to over 50% on the largest GPT models like OPT-175B and BLOOM-176B, with minimal increase in perplexity. To reduce the cost about the weight update process required by SparseGPT, Wanda (Sun et al., 2024) achieves model sparsity by pruning weights with the smallest magnitudes multiplied by the norm of the corresponding input activations, without the need for retraining or weight updates. To further minimize pruning-induced errors while upholding the desired overall sparsity level, SAMSP (Shao et al., 2024a) utilizes the Hessian matrix as a metric for weight matrix sensitivity evaluation, and dynamically adjusts sparsity allocation based on sensitivity. Furthermore, DSnoT (Zhang et al., 2024) minimizes the reconstruction error between dense and sparse models through iterative weight pruning-and-growing on top of sparse LLMs to enhance LLM performance across various sparsity rates, especially at high sparsity levels. To provide hardware support for handling unstructured pruning on the GPU Tensor Core hardware, Flash-LLM (Xia et al., 2023) introduces an unstructured sparse matrix multiplication method, which loads weight matrices in a sparse format from global memory and reconstructs them in a dense format within high-speed on-chip buffers for computation using tensor cores.
4.2 Structured Pruning
Compared to unstructured pruning, structured pruning offers the advantage of being hardware-agnostic, allowing for accelerated inference on traditional hardware post-pruning. However, the removal of larger and potentially more critical components in structured pruning may result in performance degradation, typically requiring efficient parameter fine-tuning for recovery. We divide LLMs structured pruning works into several groups based on pruning metrics: Loss-based Pruning, Magnitude-based Pruning, and Regularization-based Pruning.
Loss-based Pruning (Molchanov et al., 2019) assesses the significance of a pruning unit by measuring its impact on loss or gradient information (e.g., first-order or second-order derivatives of loss). For example, LLM-Pruner (Ma et al., 2023) introduces a one-shot structured pruning on LLMs based on gradient information. Specifically, LLM-Pruner identifies dependent structures via a dependency detection algorithm and selects optimal pruning groups using gradient information, rather than solely relying on loss changes, in a task-agnostic manner. Different from LLM-Pruner, which focuses on narrowing LLMs’ width, Shortened LLaMA (Kim et al., 2024) introduces a one-shot depth pruning on LLMs. Shortened LLaMA chooses the Transformer block as the prunable unit, and prunes these unimportant Transformer blocks, where the importance of Transformer blocks is evaluated by loss and its second-order derivative. After pruning, both LLM-Pruner and Shortened LLaMA utilize LoRA to rapidly recover the performance of the pruned model.
Magnitude-based Pruning (Han et al., 2015) involves devising a heuristic metric based on the magnitudes of pruning units, and use the metric to assess the importance of pruning units, subsequently pruning those units whose scores fall below a predefined threshold. For example, FLAP (An et al., 2024) utilizes a structured fluctuation metric to assess and identify columns in the weight matrix suitable for pruning, measuring the variation of each input feature relative to a baseline value to estimate the impact of removing a column of weights. Additionally, FLAP uses an adaptive structure search to optimize global model compression, and restores the model’s performance post-pruning through a baseline bias compensation mechanism, avoiding the need for fine-tuning. To further maintain the pruned model’s performance, SliceGPT (Ashkboos et al., 2024) leverages the computational invariance of transformer networks and optimizes the pruning process through Principal Component Analysis (PCA). Specifically, SliceGPT employs PCA as the pruning metric, applying it at each layer of the transformer network to project the signal matrix onto its principal components and eliminate insignificant columns or rows from the transformed weight matrices, ultimately aiming to compress the model effectively.
Regularization-based Pruning (Wen et al., 2016) typically adds a regularization term (e.g., L0, L1, and L2 regularization) into the loss function to induce sparsity for LLMs. For example, Sheared LLaMA (Xia et al., 2024) uses a pair of Lagrange multipliers based on pruning masks to impose constraints on the pruned model shape directly, thereby formulating pruning as a constrained optimization problem. Through solving this optimization problem, Sheared LLaMA derives optimal pruning masks. Additionally, Sheared LLaMA introduces dynamic batch loading, a strategy that adapts training data loading based on each domain’s loss reduction rate, enhancing the efficiency of data utilization during training.
Structured pruning typically reduces model size by removing redundant parameters, but it may degrade model performance. A novel approach is to combine knowledge distillation (Hinton et al., 2015) with structured pruning. Knowledge distillation allows knowledge extracted from a LLM to be transferred to a smaller model, helping the smaller model maintain its performance while reducing its size.
4.3 Semi-Structured Pruning
Apart from unstructured pruning and structured pruning, there are many studies which use semi-structured pruning to prune partial weights of LLMs based on specific patterns. N:M sparsity, where every M contiguous elements leave N non-zero elements, is an example of semi-structured pruning. For example, E-Sparse (Li et al., 2023b) implements N:M sparsity by introducing information entropy as a metric for evaluating parameter importance to enhances the significance of parameter weights and input feature norms. E-Sparse incorporates global naive shuffle and local block shuffle to efficiently optimize information distribution and mitigate the impact of N:M sparsity on LLM accuracy. Furthermore, many pruning studies can also be generalized to semi-structured patterns. For example, SparseGPT (Frantar and Alistarh, 2023) and Wanda (Sun et al., 2024) also explore N:M sparsity of LLMs. SparseGPT (Frantar and Alistarh, 2023) employs block-wise weight partitioning, with each block containing M weights. It identifies and prunes N weights with the lowest reconstruction error (based on Hessian information), ensuring a sparsity ratio of N:M. This process iteratively prunes and updates model weights, addressing one block at a time until the desired sparsity level is achieved across the entire model. Wanda (Sun et al., 2024) achieves structured N:M pruning by dividing the weight matrix into groups of M consecutive weights and computing an importance score for each weight. The score is determined by the product of the weight’s magnitude and the norm of the corresponding input activations. Within each weight group, the N weights with the highest scores are retained, while the rest are set to zero, thereby implementing structured N:M pruning. Furthermore, choosing the optimal pruning strategy is crucial for compatibility with the target hardware. For instance, Choquette et al. (2021) introduce the Ampere Tensor Core GPU architecture (e.g., A100 GPUs) and propose 2:4 fine-grained semi-structured sparsity to accelerate Sparse Neural Networks on this hardware. However, the current implementation of the Ampere architecture supports only the 2:4 ratio, leaving other ratios without acceleration.
LLMs often perform well on multiple tasks, which means they contain a multitude of parameters for various tasks. Dynamic pruning (Xia et al., 2020) methods can dynamically prune different parts of the model based on the current task’s requirements to provide better performance on specific tasks. This helps strike a balance between performance and efficiency.
For PTQ and pruning, preparing a high-quality calibration dataset to assist in improving the performance of compressed LLMs is crucial. Specifically, Williams and Aletras (2023) make a extensive empirical study on the effect of calibration data upon model compression methods, and find that the performance of downstream tasks can vary significantly depending on the calibration data selected. High-quality calibration data can improve the performance and accuracy of the compressed model, so careful selection and preparation of calibration data are necessary.
5 Knowledge Distillation
Knowledge Distillation (KD) (Hinton et al., 2015) is a technique aimed at transferring knowledge from a large and complex model (i.e., teacher model) to a smaller and simpler model (i.e., student model). We classify these methods into two clear categories (Gu et al., 2024): Black-box KD, where only the teacher’s outputs are accessible, typically from closed-source LLMs, and White-box KD, where the teacher’s parameters or output distribution are available, usually from open-source LLMs.
5.1 Black-box KD
Black-box KD usually prompts the teacher LLM to generate a distillation dataset for fine-tune the student LM, thereby transfering capabilities from teacher LLM to the student LM. In Black-box KD, teacher LLMs such as ChatGPT (gpt-3.5-turbo) and GPT4 (OpenAI, 2024) are typically employed, while smaller LMs (SLMs), such as GPT-2 (Radford et al., 2019), T5 (Raffel et al., 2020), FlanT5 (Chung et al., 2024), and CodeT5 (Wang et al., 2021), are commonly utilized as student LMs. On the other hand, researchers find that LLMs have emergent abilities, which refers to a significant improvement in performance when the model reaches a certain scale, showcasing surprising capabilities. Lots of Black-box KD methods try to distill emergent abilities from LLMs to student LMs, and we introduce three commonly used emergent ability distillation methods: Chain-of-Thought (CoT) Distillation, In-Context Learning (ICL) Distillation, and Instruction Following (IF) Distillation.
5.1.1 Chain-of-Thought Distillation
CoT (Wei et al., 2022; Wang et al., 2023b) prompts LLMs to generate intermediate reasoning steps, enabling them to tackle complex reasoning tasks step by step. Li et al. (2024b) and Hsieh et al. (2023) employ LLMs to prompt the generation of explanations and leverage a multi-task learning framework to bolster the reasoning capabilities of smaller models while enhancing their capacity for generating explanations. Magister et al. (2023) show that LLMs’ reasoning capability can be transferred to SLMs via knowledge distillation, but there’s a trade-off between model and dataset size in reasoning ability. Ho et al. (2023) use zero-shot CoT techniques to prompt LLMs to generate diverse rationales to enrich the distillation dataset for the student models. Shridhar et al. (2023) distill two student models: a problem decomposer and a subproblem solver, which the problem decomposer decomposes complex problems into a sequence of subproblems, and the subproblem solver solves these subproblems step by step. Wang et al. (2023a) incorporate contrastive decoding during rationale generation for teacher models and address shortcut issues by introducing a counterfactual reasoning objective during student model training. Fu et al. (2023) demonstrate that increasing task-specific capabilities through distillation may inadvertently lead to reduced performance in solving generalized problems, and focus on improving mathematical capability of student LMs via distillation. PaD (Zhu et al., 2024) prompts LLMs to generate Program-of-Thought (PoT) rationales instead of CoT rationales to construct distillation dataset, and fine-tunes SLMs with the distillation dataset. Wang et al. (2023e) establishes a multi-round interactive learning paradigm that enables student LMs to provide feedback to teacher LLMs during the distillation process, thereby obtaining tailored training data. Additionally, DRA introduces a self-reflection learning mechanism, allowing the student LMs to learn from their mistakes and enhance their reasoning abilities. Li et al. (2024c) find that negative data generated from teacher LMs also has reasoning knowledge, and guides student LMs to learn knowledge from both negative samples besides positive ones.
5.1.2 In-Context Learning Distillation
ICL (Dong et al., 2023; Wang et al., 2023c) employs structured prompts with task descriptions and examples for LLMs to learn new tasks without gradient updates. Huang et al. (2022) introduce a method called in-context learning distillation, which transfers in-context learning ability from LLMs to smaller models by combining in-context learning objectives with language modeling objectives. Specifically, it trains the student model to improve its generalization across various tasks by imitating the soft label predictions of the teacher model and the hard label ground truth values. Additionally, the method incorporates two few-shot learning paradigms: Meta In-context Tuning (Meta-ICT) and Multitask In-context Tuning (Multitask-ICT). In Meta-ICT, the student model adapts to new tasks with in-context learning and guidance from the teacher. Conversely, Multitask-ICT treats all target tasks as training tasks, directly using examples from them in distillation. The outcomes show that Multitask-ICT is more effective, despite its increased computational requirements. AICD (Liu, 2024) leverages the autoregressive nature of LLMs to perform meta-teacher forcing on CoTs within the context, jointly optimizing the likelihood of all in-context CoTs, thereby distilling the capabilities of in-context learning and reasoning into smaller models.
5.1.3 Instruction Following Distillation
IF (Ouyang et al., 2022; Brooks et al., 2023) aims to bolster the zero-shot ability of LLMs through fine-tuning using a collection of instruction-like prompt-response pairs. For instance, Lion (Jiang et al., 2023) prompts the LLM to identify and generate the “hard” instructions, which are then utilized to enhance the student model’s capabilities. LaMini-LM (Wu et al., 2024) develops an extensive collection of 2.58 million instructions, comprising both existing and newly generated instructions, and fine-tunes a diverse array of models by using these instructions. SELF-INSTRUCT (Wang et al., 2023d) uses student LMs themselves as teachers to generate instruction following dataset, and fine-tunes students themselves with the dataset. Selective Reflection-Tuning (Li et al., 2024a) leverages the teacher LLMs to reflect on and improve existing data, while the student LMs assess and selectively incorporate these improvements, thereby increasing data quality and compatibility with the student LMs.
Black-Box Distillation uses the teacher model’s outputs as supervision, but the teacher model’s outputs may not cover all possible input scenarios. Thus, understanding how to handle a student model’s generalization on unknown data and how to increase data diversity is an area that requires further investigation.
5.2 White-box KD
White-box KD enables the student LM to gain a deeper understanding of the teacher LLM’s internal structure and knowledge representations, often resulting in higher-level performance improvements. A representative example is MINILLM (Gu et al., 2024), which is the first work to study distillation from the Open-source generative LLMs. MINILLM uses a reverse Kullback-Leibler divergence objective, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution, and derives an effective optimization approach to learn the objective. Further, GKD (Agarwal et al., 2024) explores distillation from auto-regressive models, where generative language models are a subset. GKD trains the student using self-generated outputs, incorporating teacher feedback, and allows flexibility in using different loss functions when the student cannot fully replicate the teacher’s distribution. Different from the above studies, which focus on learning the teacher distribution, TED (Liang et al., 2023) proposes a task-aware layer-wise distillation method, which designs task-aware filters, which align the hidden representations of the teacher and student models at each intermediate layer, to reduce the knowledge gap between the student and teacher models.
Although white-box distillation allows student LMs to learn the knowledge of teacher LLMs more deeply compared to black-box distillation, currently, open-source LLMs perform worse than closed-source ones, limiting the improvement of student LMs performance in white-box distillation. This is one of the barren factors hindering the development of white-box distillation. A feasible solution is to distill knowledge from closed-source LLMs to open-source LLMs through black-box distillation, and then use white-box distillation to transfer knowledge from open-source LLMs to student LLMs.
White-box distillation often involves understanding and utilizing the internal structure of LLMs, such as layer connections and parameter settings. A more in-depth exploration of different network structures and interactions between layers can improve the effectiveness of white-box distillation.
6 Low-Rank Factorization
Low-Rank Factorization (Srebro and Jaakkola, 2003) reduces a large matrix into smaller ones to save space and computational effort. For example, it decomposes a large matrix W into two small matrices U and V (i.e., W ≈ UV), where U is m × k and V is k × n, with k much smaller than m and n. Recent studies try to employ low-rank factorization to compress LLMs and achieve significant success in this regard. For example, LPLR (Saha et al., 2023) compresses weight matrices of LLMs through randomized low-rank and low-precision factorization. Specifically, LPLR approximates the column space of the matrix using random sketching techniques, quantizes these columns, and then projects the original columns onto this quantized space to create two low-rank factors stored in low-precision. ASVD (Yuan et al., 2023b) finds that the activation distribution has an effect on the compression performance. To sovle the problem, ASVD proposes to scale the weight matrix with a diagonal matrix that contains scaling factors corresponding to the activation distribution of the input feature channels. Moreover, ASVD assigns the most suitable compression ratio to different layers by analyzing the singular values distribution in each layer’s weight matrix, ensuring minimal loss of model performance during the compression process. Furthermore, Sharma et al. (2024) demonstrate that the performance of LLMs can be significantly improved by applying Layer-Selective Rank Reduction (LASER) to specific layers of Transformer models. LASER involves selectively reducing the rank higher-order components of weight matrices, which is shown to improve the model’s handling of rare training data and its resistance to question paraphrasing.
7 Challenges and Future Directions
7.1 More Advanced Methods
The research on model compression techniques for LLMs is still in its early stages. These compressed LLMs, as demonstrated in prior studies (Frantar and Alistarh, 2023; Liu et al., 2023b; Ho et al., 2023), continue to exhibit a significant performance gap when compared to their uncompressed counterparts. By delving into more advanced model compression methods tailored for LLMs, we have the potential to enhance the performance of these uncompressed LLMs.
7.2 Scaling up Model Compression Methods from Other Models
In our paper, we introduce several representative model compression methods for LLMs. However, many classic model compression methods remain prevalent in traditional small models. For example, lottery tickets (Frankle and Carbin, 2019) and parameter sharing (Savarese and Maire, 2019) are widely used model compression methods in small models. These methods still hold significant potential in the era of LLMs. Future work should focus on exploring how to extend these compression methods to LLMs to achieve further compression.
7.3 LLM Inference and Deployment
The efficiency of compressed LLMs during deployment is also a significant area for exploration. This involves multiple evaluation metrics, including arithmetic intensity, memory size, and throughput. Furthermore, we can use an analytical tool, the Roofline Model (Williams et al., 2009), to assess the resource efficiency of compressed LLMs on specific hardware. Evaluating the deployment efficiency of compressed LLMs on specific hardware can guide researchers in selecting and analyzing the advantages and disadvantages of various model compression methods and further optimizing these methods.
7.4 The Effect of Scaling Law
The scaling law (Kaplan et al., 2020) underscores the significant impact of model size, dataset size, and compute resources on the performance of LLMs. However, the scaling law presents a fundamental challenge for LLM compression, i.e., there is a trade-off between model size and performance in compressed LLMs. Delving into the mechanisms and theories underpinning the scaling law is crucial for elucidating and potentially overcoming this limitation.
7.5 AutoML for LLM Compression
Existing compression techniques have made remarkable progress, but they still heavily depend on manual design. For instance, designing appropriate student architectures for knowledge distillation requires a significant amount of human effort. To reduce this reliance on manual design, a feasible solution is to combine Automated Machine Learning (AutoML) techniques such as Meta-Learning (Finn et al., 2017) and Neural Architecture Search (NAS) (Zoph and Le, 2017) with model compression. By combining with AutoML techniques, model compression can automatically select appropriate hyperparameters and tailor architectures and scales of compressed models, thus minimizing human involvement and lowering the associated costs. Furthermore, AutoML can identify optimal model compression strategies tailored to specific task requirements, thereby further enhancing compression rates without compromising model performance.
7.6 Explainability of LLM Compression
Earlier research (Stanton et al., 2021; Xu et al., 2021) has raised significant concerns regarding the explainability of model compression techniques applied to Pre-trained Language Models (PLMs). Notably, these same challenges extend to LLM compression methods as well. For example, CoT-distillation can enhance SLMs’ reasoning performance, yet the mechanism through which it imparts CoT ability remains unclear. This challenge underscores the importance of integrating explainability with model compression approaches for the advancement of LLM compression applications. Explainability not only clarifies the changes and trade-offs in the compression process but also enhances efficiency and accuracy. Additionally, interpretability aids in evaluating the compressed model’s performance to ensure it aligns with practical requirements.
8 Conclusion
In this survey, we have explored model compression techniques for LLMs. Our coverage spanned compression methods, metrics, and benchmark datasets. By diving into LLM compression, we’ve highlighted its challenges and opportunities. This survey aims to be a valuable reference, providing insights into the current landscape and promoting ongoing exploration of this pivotal topic.
Acknowledgments
We would like to thank the anonymous reviewers and the action editor for their valuable feedback and discussions. The work of Jian Li is supported partially by National Natural Science Foundation of China (No. 62106257). The work of Yong Liu is supported partially by National Natural Science Foundation of China (No. 62076234), Beijing Outstanding Young Scientist Program (No. BJJWZYJH012019100020098), the Unicom Innovation Ecological Cooperation Plan, and the CCF-Huawei Populus Grove Fund.
References
Author notes
Action Editor: Luke Zettlemoyer