16 minute read

LLM Quantization Logo

LLM Quantization


Introduction

Large language models (LLMs) have seen rapid growth in parameter count and computational requirements over the past few years. Models like Mistral‑7 B and Llama‑3 70 B can achieve impressive language understanding but require tens or hundreds of gigabytes of memory and specialized GPUs to perform inference. Quantization is a model compression technique that reduces the numerical precision of neural network parameters and activations while aiming to preserve the model’s predictive quality. By mapping floating‑point weights to a smaller set of integer values (for example FP32 → INT8 or even 1‑bit weights), quantization reduces the memory footprint, lowers bandwidth requirements and energy consumption, and enables deployment on resource‑constrained devices such as smartphones, laptops and robots. This report reviews the motivation for quantization, the available techniques, their impact on model training and inference, consequences for architecture design, real‑world applications, and provides a detailed overview of 1‑bit/1.58‑bit quantization based on the BitNet‑b1.58 model.

What is Quantization?

Quantization in neural networks refers to approximating continuous‑valued parameters with a discrete set of values. In its simplest form, each 32‑bit floating‑point weight or activation $X$ is transformed into a smaller integer representation $X_{\mathrm{quantized}}$ using a scale factor $S$ and, optionally, a zero‑point for asymmetric mapping:

\[S = \frac{2^{b-1}-1}{\alpha},\qquad X_{\mathrm{quantized}} = \operatorname{round}(S \cdot X),\]

where $b$ is the target bit width and $\alpha$ is the maximum absolute value in the original tensor. The discrete representation can later be de‑quantized back to an approximate floating‑point value during computation. Mapping to fewer bits reduces the range and resolution of representable numbers; for example, going from FP16 to FP8 or FP4 compresses the numerical range and resolution but also yields large reductions in storage and computation cost. Quantization can be applied to weights, activations, or both, and can be performed post‑training or during training (quantization‑aware training).

Types of quantization

Different schemes exist for deciding the mapping and bit width:

Technique Key idea / typical bit widths Notes
Post‑training quantization (PTQ) Quantize a trained model without retraining; uses calibration data to determine scale/zero‑point. Easier to deploy but may reduce accuracy for very low bit widths.
Quantization‑aware training (QAT) Simulates quantization during training using straight‑through estimators so the network learns to be robust to quantization noise. Requires access to training data; yields better accuracy at low precision.
Dynamic vs static quantization Dynamic quantization computes scale at run time; static quantization uses calibration data to pre‑compute scale. Static often yields better accuracy at the cost of calibration overhead.
Weight‑only vs weight‑activation quantization Some methods quantize only weights (e.g., GPTQ, AWQ), while others also quantize activations (SmoothQuant, FP8). Activation quantization further reduces memory bandwidth but is harder to calibrate.
Symmetric vs asymmetric Symmetric quantization centers values around zero; asymmetric adds a zero‑point to represent non‑zero mean distributions. Asymmetric mapping is useful when activation distributions are biased.
Specialized schemes Methods like GPTQ and AWQ use layer‑wise Hessian information or activation‑aware scaling to select important weights and minimize quantization error. These schemes enable lower bit‑widths (e.g., INT4) while retaining accuracy.

Quantization formats vary from standard floating‑point types (FP16, BF16, FP8, FP4) to integer formats (INT8, INT4, INT2). Moving from 32‑bit to 8‑bit reduces model size roughly by 4× and can provide up to 16× improvement in performance per watt. Further reduction to 4‑bit yields an 8× reduction in model size but requires more careful calibration.

The Need for Quantization

Resource and energy constraints

LLMs contain billions of parameters; storing them in full precision requires enormous memory. For example, the 7‑billion‑parameter Mistral model needs about 28–29 GB of memory in FP32, 14.5 GB in FP16, and only around 4 GB when quantized to INT4. A 70‑billion‑parameter model may require ~280 GB in FP32, making deployment impossible on consumer devices. Quantization reduces the required memory and bandwidth, enabling inference on devices with limited high‑bandwidth memory and reducing the number of DRAM–SRAM transfers. The Clarifai technical blog notes that reducing precision from 32‑bit to 8‑bit compresses the model size fourfold and can improve performance per watt by up to 16×. These reductions translate directly into lower energy consumption and faster inference.

Latency and user experience

Running LLMs locally avoids the latency associated with round trips to cloud servers. On‑device models provide privacy and availability, but memory bandwidth on mobile devices is limited (50–90 GB/s vs. 2–3 TB/s on data‑centre GPUs). Quantized models reduce memory traffic per token generation—going from 16‑bit to 4‑bit cuts bandwidth requirements by a factor of four. This not only accelerates inference but also reduces battery usage and heat generation, which are critical for smartphones and laptops.

Economic and environmental considerations

Training and serving large models consume significant energy. The Understanding Efficiency study noted that lower numerical precision decreases energy only in compute‑bound regimes and that batching and request scheduling further improve energy efficiency. Quantization reduces energy both by using cheaper integer operations and by shrinking model memory, which reduces the energy spent moving data from DRAM to compute units. The BitNet‑b1.58 research shows that replacing floating‑point multiplications with integer additions dramatically reduces energy: ternary matrix multiplication achieved 71.4× energy savings over FP16 arithmetic and improved end‑to‑end energy efficiency by up to 41×.

Impact on Model Training

Post‑training quantization and calibration

In PTQ, a model is first trained in full precision and then quantized. Calibration with a small dataset (typically 128–512 samples) collects activation statistics to determine scaling factors. Simple min–max calibration may be sensitive to outliers, so modern techniques like SmoothQuant and AWQ adjust weight scales and smooth activation distributions before quantization. NVIDIA’s TensorRT Model Optimizer supports multiple formats (FP8, FP4, INT8, INT4) and provides calibration methods such as SmoothQuant (addresses activation outliers), Activation‑aware weight quantization (AWQ) (preserves weights with large activation magnitudes), and AutoQuantize (uses sensitivity scores to rank layer tolerance to quantization).

PTQ yields large speed‑ups without retraining but may degrade accuracy, especially for small models or very low precisions. A 2025 study evaluating GPTQ, AWQ, SmoothQuant and FP8 across models from 1 B to 405 B parameters found that 4‑bit quantization led to significant accuracy drops for small models, whereas 70‑B models maintained good performance. AWQ consistently outperformed GPTQ across tasks. The study also observed that quantization magnifies a model’s existing weaknesses rather than correlating directly with task difficulty; commonsense and mathematical reasoning tasks suffered larger drops.

Quantization‑aware training (QAT)

QAT simulates quantization during training so that the network learns to compensate for discretization errors. During the forward pass, floating‑point weights are quantized and de‑quantized, while gradients flow through the quantization function using a straight‑through estimator. This process yields models that better tolerate low‑precision weights and activations. QAT is computationally expensive and rarely applied to very large models, but it is essential for extremely low precisions (≤ 4 bits). The BitNet‑b1.58 model uses a variant of QAT to train an LLM from scratch with ternary weights: shadow floating‑point weights are maintained for gradient updates, while forward passes use quantized weights. A scaling parameter γ is computed as the average absolute value of the shadow weights; weights are then mapped to the ternary set {-1, 0, +1} based on thresholds. The activations are quantized to 8‑bit integers at each forward pass.

Effects on training dynamics

Quantization can alter the training dynamics and representational capacity. The Interpreting the Effects of Quantization study found that 4‑bit quantization mildly reduces model confidence—e.g., a 4‑bit Llama‑2‑7B exhibits an 11–13 % reduction in confidence on certain datasets—but generally does not cause drastic changes in calibration. The number of “dead” neurons (neurons with near‑zero activation) remains largely unchanged across 4‑bit, 8‑bit and full‑precision variants. However, the number of salient neurons contributing to predictions may increase for quantized small models, suggesting that quantization introduces perturbations and increases representational noise. In larger models, full precision often leads to richer, more distributed representations, whereas quantization sparsifies the contribution of neurons.

In extremely low precisions such as 1‑bit, the architecture must be adapted. BitNet‑b1.58 replaces standard linear layers with BitLinear modules; forward passes use ternary weights and 8‑bit activations, while backward passes update full‑precision shadow weights. The quantization function uses a scaling factor to ensure that the ternary weights approximate the original floating‑point values. Training from scratch or gradually shifting from full precision to 1‑bit yields near full‑precision accuracy with only a 2–3 point drop.

Impact on Inference

Memory footprint and latency

Quantization reduces the size of model checkpoints and the memory required to load them. For example, a 13‑billion‑parameter Llama‑2‑Chat model requires 26 GB at FP16 but only 7.9 GB after Q4_K_M quantization, enabling inference on a consumer laptop with 16 GB of RAM. BitNet‑b1.58 models use 2.6–3.55× less memory and achieve 1.23–2.71× lower latency than their FP16 counterparts across model sizes while matching perplexity. These improvements come largely from replacing floating‑point multiplications with integer additions and exploiting zeros in ternary weights, which reduce DRAM–SRAM transfers and allow skipping operations for zero weights. Energy consumption for matrix multiplication is reduced by 71.4×, and overall end‑to‑end energy efficiency improves up to 41×.

Quantization also affects throughput. Microsoft’s Ladder and T‑MAC hardware techniques enable mixed‑precision matrix multiplication (mpGEMM) using lookup tables to avoid de‑quantization and multiplication. On a Surface Laptop 7, T‑MAC achieved speeds of 48 tokens/s for a 3 B BitNet‑b1.58 model, 30 tokens/s for a 2‑bit 7 B Llama, and 20 tokens/s for a 4‑bit 7 B Llama. The Ladder compiler supports custom low‑precision data types and provides up to 14.6× speed‑ups over existing compilers.

Accuracy and robustness

The IJCAI‑25 comprehensive evaluation shows that quantization generally preserves accuracy for large models but can lead to significant drops for small models. Models with 70 B parameters quantized to 4 bits maintained good performance, while small models (e.g., 1 B–7 B) suffered accuracy losses, particularly with GPTQ. AWQ consistently outperformed GPTQ, and FP8 quantization emerged as the most robust across tasks. The study also reported that coding and STEM tasks experienced the largest degradation in free‑form conversations, while reasoning tasks sometimes improved. Quantization tends to amplify existing weaknesses; difficult tasks do not always incur the largest losses.

While PTQ may reduce model confidence, calibration errors remained small; differences in Adaptive Calibration Error (ACE) across 4‑bit and 8‑bit variants were dataset‑ and architecture‑specific. Quantized models did not exhibit substantial increases in the number of dead neurons. Overall, quantization can achieve high compression ratios with modest accuracy loss for well‑designed models but requires careful calibration and, in some cases, quantization‑aware training.

Impact on Model Architecture

Architectural choices under quantization

Quantization influences the design of both large and small models. For on‑device deployment, models often adopt deeper and thinner architectures rather than wide, shallow ones. The On‑Device LLMs 2026 report notes that sub‑billion‑parameter models can handle many practical tasks; below ~1 B parameters, deeper and thinner networks consistently outperform wide, shallow ones. This shape offers better parameter efficiency and works well with low‑bit quantization, since deeper networks can maintain expressiveness despite reduced numerical precision.

Low‑bit quantization can impose constraints on architecture. Mixed‑precision matrix multiplication requires hardware support for asymmetric computations; Ladder and T‑MAC overcome these limitations by converting unsupported data types into hardware‑compatible ones and using lookup tables. Modern chips integrate low‑precision units (FP8, INT4, INT2), but memory bandwidth remains the bottleneck. Handling the key/value (KV) cache in transformers is particularly challenging; compressing KV caches sometimes yields larger gains than further weight quantization.

Extreme quantization (1‑bit and 1.58‑bit)

At 1 bit per weight, the model architecture must be re‑engineered. The BitNet‑b1.58 architecture replaces standard linear layers with BitLinear modules that store weights in the ternary set {-1, 0, +1}. Activations are quantized to 8‑bit integers, and a scaling factor ensures that the product of ternary weights and activations approximates full‑precision operations. Training maintains full‑precision shadow weights and uses a straight‑through estimator; thus, the model can be trained end‑to‑end in low precision. The BitNet architecture also quantizes the embedding and output layers, leaving no high‑precision “escape hatches.” The result is a ternary model that achieves performance close to FP16 models while drastically reducing memory and compute requirements.

Similarly, PrismML’s Bonsai 8B model implements a proprietary 1‑bit design across all layers (embeddings, attention, MLP and LM head). It delivers competitive performance at only 1.15 GB (around 14× smaller than standard 8‑B FP16 models) and can run on consumer devices such as an iPhone 17 Pro at roughly 44 tokens/s. The company defines intelligence density as the negative log of average error per gigabyte; Bonsai achieves an intelligence‑density score that far exceeds other models in its parameter class.

Use Cases and Applications of Quantized LLMs

Edge and mobile deployment

Quantization enables AI capabilities on devices where cloud connectivity, latency or privacy is a concern. Four primary motivations for running LLMs locally are latency, privacy, cost, and availability. Locally running models avoid network delays, ensure that sensitive data never leaves the device, and eliminate cloud serving costs; they also continue to function when offline. Reducing the precision of model weights directly reduces memory traffic, which is the main bottleneck on mobile NPUs. With quantization, developers can deploy models like Llama‑2‑13 B‑Chat on consumer laptops with 16 GB RAM. Sub‑billion‑parameter models such as Phi‑4 mini (3.8 B) and SmolLM2 (135 M) are specifically designed for on‑device inference.

Cloud cost reduction

Running quantized models on servers reduces infrastructure costs. Lower precision models require fewer GPUs and less high‑bandwidth memory per request. Batching requests becomes more efficient because smaller models allow more concurrent inference streams. In data‑centre settings, quantization combined with batching and arrival shaping improves energy efficiency and throughput. Hardware accelerators with native low‑precision support (e.g., NVidia Blackwell GPUs or custom mpGEMM accelerators) further amplify these savings.

Real‑world applications

  • BitNet‑b1.58: A 1.58‑bit model trained from scratch that matches the perplexity of full‑precision Llama models while using 2.6–3.55× less memory and delivering 1.23–2.71× lower latency. It achieves 71.4× arithmetic energy savings and up to 41× end‑to‑end energy efficiency. The scaling law observed in full‑precision LLMs holds in the ternary regime; a 13‑B BitNet‑b1.58 matches the perplexity of a 3‑B FP16 model, and training with 2 trillion tokens outperforms StableLM‑3 B on multiple benchmarks.
  • PrismML’s Bonsai 8B: A commercially viable 1‑bit model that fits into 1.15 GB and runs at ~44 tokens/s on an iPhone 17 Pro. It retains competitive accuracy across benchmarks despite being 14× smaller than comparable models. Bonsai demonstrates that 1‑bit models can be production‑ready and opens opportunities for persistent on‑device agents, real‑time robotics, secure enterprise copilots and offline AI applications.
  • Llama‑2 and Llama‑3 quantized variants: Post‑training quantization (Q4_K_M, GPTQ, AWQ) compresses models from 26 GB to around 7–8 GB and enables inference on consumer devices. AWQ often yields smaller accuracy drops than GPTQ. The IJCAI study found that FP8 and AWQ are robust across a wide range of tasks.
  • Microsoft’s Ladder and T‑MAC: Hardware approaches that support custom low‑precision data types and replace multiplications with lookup tables. Ladder provides up to 14.6× speed‑ups and broadens hardware support for low‑precision data types. T‑MAC enables efficient mixed‑precision GEMM and achieves 20–48 tokens/s on low‑bit models on laptops.
  • On‑device small language models: Products such as Phi‑4 mini, Qwen2.5‑VL‑7 B‑Instruct and other sub‑billion‑parameter models are designed for efficient on‑device deployment. They rely on quantization and architectural optimizations to deliver practical performance while fitting within a few hundred megabytes.

Comprehensive Overview of 1‑bit/1.58‑bit Quantization (BitNet‑b1.58)

Motivation and design

The BitNet‑b1.58 paper proposes training LLMs with ternary weights (−1, 0, +1), which corresponds to 1.58 bits per weight because (\log_2(3) \approx 1.58). The motivation is to reduce memory footprint, bandwidth and energy consumption without sacrificing accuracy. Each weight multiplication becomes an integer addition (and can be skipped entirely when the weight is zero), greatly lowering computational cost. Ternary weights also reduce DRAM transfers because zero weights need not be fetched.

Quantization function and training

The BitNet-b1.58 model replaces standard nn.Linear layers with BitLinear layers and maintains full-precision shadow weights. During each forward pass, a scaling factor is computed as the mean absolute value of the shadow weights:

\[\gamma = \frac{1}{N}\sum_i |w_i|\]

Each weight is then mapped to the ternary set $\lbrace -1, 0, +1 \rbrace$ by comparing $w_i$ against thresholds based on $\gamma$:

\[w_i^{q} = \begin{cases} +1, & w_i > 0.5\gamma \\ 0, & -0.5\gamma \le w_i \le 0.5\gamma \\ -1, & w_i < -0.5\gamma \end{cases}\]

Activations are quantized to 8-bit integers in forward passes, while the backward pass updates the shadow weights. The straight-through estimator is used to propagate gradients through the quantizer. Training can start from scratch or gradually transition a full-precision model to ternary weights; the authors found that the latter yields near full-precision accuracy with only a modest drop.

Performance and scaling law

Across model sizes from 110 M to 7 B parameters, BitNet‑b1.58 models use roughly 2.6–3.55× less memory and are 1.23–2.71× faster in inference than FP16 LLaMA models, while matching perplexity. On zero‑shot classification tasks, the ternary models achieve accuracy within a few percentage points of their FP16 counterparts and sometimes improve over them. Energy measurements show that ternary matrix multiplication provides a 71.4× reduction in arithmetic energy consumption, and the overall end‑to‑end energy efficiency improves up to 41× for 70‑B models. The authors propose a new scaling law: in the ternary regime, the cross‑entropy loss scales with model size and data as

\[L(N, D) \approx a \left( \frac{1}{N^{\alpha}} + \frac{1}{D^{\beta}} \right),\]

where (N) is model size and (D) is the number of training tokens. Interestingly, the exponent for model size α remains similar to full‑precision models (≈0.34), implying that small ternary models offer little advantage over full‑precision ones, but ternary 13 B models match the performance of 3 B full‑precision models. Training with 2 trillion tokens improved performance further, surpassing StableLM‑3 B on several benchmarks.

Advantages and limitations

1‑bit/1.58‑bit quantization offers dramatic memory and energy savings but comes with trade‑offs. The BitNet‑b1.58 models achieve competitive perplexity with small accuracy drops and maintain the same scaling law as full‑precision models, making them attractive for large‑scale pre‑training. However, the ternary format requires custom kernels and hardware support; inference frameworks must support integer additions and skipping zero weights. BitNet‑b1.58 still uses 8‑bit activations, and quantization of KV caches remains an open challenge. Models below ~3 B parameters show little benefit over full precision due to the limited capacity offered by ternary weights. Nevertheless, the successful deployment of BitNet‑b1.58 and Bonsai 8B demonstrates that extreme quantization can be practical and provides a path toward on‑device LLMs.

Conclusion

Quantization is a crucial technique for making large language models efficient, environmentally sustainable, and deployable in real‑world applications. By reducing numerical precision, quantization compresses models, cuts memory bandwidth and energy consumption, and enables inference on edge devices. Post‑training quantization and quantization‑aware training offer different trade‑offs between implementation complexity and accuracy. Modern calibration methods like SmoothQuant, AWQ and AutoQuantize improve PTQ performance, while QAT is essential for extreme low‑bit regimes. The impact of quantization on training and inference depends on model size, architecture and task; small models suffer larger accuracy drops, whereas 70‑B models maintain robust performance. Quantization can alter neuron saliency and sparsity but does not drastically change calibration or introduce dead neurons. New hardware techniques such as Ladder and T‑MAC accelerate low‑bit GEMM operations and support custom data types.

Extreme quantization (1‑bit and 1.58‑bit) challenges conventional neural network design but offers transformative benefits. The BitNet‑b1.58 model shows that ternary weights combined with 8‑bit activations and quantization‑aware training can match full‑precision performance while delivering over 70× energy savings and enabling on‑device deployment. Commercial releases like PrismML’s Bonsai 8B further demonstrate the viability of 1‑bit models for production use. As hardware support for low‑precision arithmetic matures and quantization techniques continue to improve, we can expect quantized LLMs to become ubiquitous across cloud and edge platforms.

References and Further Reading