LLM Creation Pipeline

🌐 Training Data Collection Web crawl · Books · Code · Academic · Curated Trillions of tokens

WEBCommon Crawl, Wikipedia, news sites

BOOKSBookCorpus, Project Gutenberg, arXiv

CODEGitHub, Stack Overflow, documentation

CURATEDHigh-quality filtered & licensed datasets

🔧 Data Processing & Filtering Deduplication · Quality filtering · PII removal · Domain mixing Data pipeline

DEDUPExact & fuzzy deduplication (MinHash)

FILTERPerplexity filtering, heuristic rules

SAFETYPII scrubbing, toxic content removal

MIXDomain weighting and sampling ratios

✂️ Tokenization BPE / SentencePiece → integer IDs · Vocabulary construction Artifact: vocab.json

ALGOBPE / SentencePiece training on corpus

VOCABVocabulary file — 32k–100k entries

SPECIAL[BOS], [EOS], [PAD], chat turn tokens

🏗️ Model Architecture Definition Hyperparameters · Layer config · Attention heads · Context window config.json

HYPERd_model, n_layers, n_heads, d_ff, ctx_len

ATTNMHA / MQA / GQA attention variants

POSRoPE, ALiBi, absolute positional encoding

NORMPre-norm, RMSNorm vs LayerNorm

⚡ Pre-Training (Causal LM) Next-token prediction · Cross-entropy loss · Distributed training Weeks · 1000s of GPUs

OBJCausal language modeling — predict next token

OPTIMAdamW, cosine LR schedule, warmup

INFRAFSDP / Megatron / Pipeline parallelism

CKPTCheckpoint strategy, loss monitoring

🧭 Alignment & Fine-tuning SFT → Reward Model → RLHF / DPO · Constitutional AI Artifact: adapter weights

SFTSupervised fine-tuning on instruction data

RMReward model trained on human preferences

RLHFPPO / DPO optimization against reward model

CAIConstitutional AI / RLAIF self-critique

📦 Model Artifacts Weights · Embeddings · Q/K/V matrices · Attention patterns The trained model

EMBEDToken embedding matrix [vocab × d_model]

Q·K·VPer-head projection matrices × L layers

FFNW1, W2 feed-forward weights per layer

NORMγ, β normalization parameters per layer

CACHERuntime KV cache — not stored in weights

🚀 Deployment & Serving Quantization · KV cache · Continuous batching · Inference optimization Production system

QUANTINT8/INT4 quantization — shrinks model 4–8×

CACHEKV cache — avoids recomputing past tokens

BATCHContinuous batching for throughput

Click any block to explore the full pipeline — from raw text to deployed model.

Follow the pipeline top to bottom. Each stage produces an artifact used by the next. Start with Training Data and finish at Deployment to see the complete picture.

Quick Reference — GPT-3 (175B) Scale

Training tokens: 300B

Parameters: 175B

d_model: 12,288

Layers: 96

Attn heads: 96

d_k = d_v: 128

d_ff: 49,152

Context: 2,048 tokens

Vocab size: 50,257

Training: ~3.14×10²³ FLOPs

Web Crawl Data

01 · Data Collection · Largest raw source

Common Crawl is the largest source — a petabyte-scale monthly snapshot of the web containing billions of pages. GPT-3 used ~410B tokens from filtered Common Crawl alone, representing ~60% of its training mix.

Wikipedia is small in volume (~4B tokens) but extremely high quality: factual, well-structured, multilingual. It is typically upsampled 3–4× so it appears more often than its raw byte fraction would suggest.

Raw web data is noisy — spam, boilerplate, duplicate content, SEO manipulation, adult content. It cannot be used directly and requires the full processing pipeline (stage 02).

Common Crawl

Wikipedia

News sites

Multilingual

~60% of training mix

Books & Academic

01 · Data Collection · Long-form reasoning signal

BookCorpus and similar datasets provide long-form prose — crucial for learning coherent multi-paragraph reasoning. Books contain richer vocabulary and sentence structure than web text.

arXiv papers contribute scientific reasoning, mathematical notation, and structured argumentation. LaTeX source is often included so the model learns to handle equations and formal notation.

Books are heavily upsampled relative to their raw byte count — the signal density per token is far higher than web text. GPT-3 used ~22B book tokens at roughly 2× sampling weight.

BookCorpus

Project Gutenberg

arXiv

Long-form prose

Upsampled 2×

Code Data

01 · Data Collection · Improves general reasoning

Code training data (GitHub, Stack Overflow, documentation) dramatically improves reasoning and instruction-following — not just coding ability. Code is highly structured, contains explicit logic chains, and has clear input→output relationships that transfer to general problem-solving.

Models trained with code data show improved performance on chain-of-thought reasoning, mathematics, and multi-step problems — suggesting that code's structural patterns generalize well beyond programming tasks.

# Code is structured, explicit, verifiable def solve(args): # Explicit logic chain the model must learn result = process(args) return result # ← clear input→output

GitHub

Stack Overflow

Documentation

Improves reasoning

Multi-language

Curated High-Quality Data

01 · Data Collection · Quality over quantity

Not all tokens are equal. A small set of high-quality, carefully selected data — textbooks, encyclopedias, legal documents, scientific papers — can be worth orders of magnitude more per token than raw web text.

Microsoft's Phi models demonstrated this strikingly: a 1.3B parameter model trained on "textbook-quality" synthetic data outperformed much larger models trained on raw web data. This shifted the field toward data quality over data quantity.

Modern pipelines classify and score each document, then heavily oversample top-rated content regardless of its raw byte fraction.

Quality > Quantity

Textbooks

Encyclopedias

Scientific papers

Phi / Mistral finding

Deduplication

02 · Data Processing · Prevents memorization of repeated content

The web is full of duplicate content — the same article syndicated to 500 sites, templated pages, mirror repositories. Without deduplication the model memorizes repeated text verbatim and perplexity metrics become misleadingly low.

Exact deduplication hashes document content and removes identical copies. Fast but misses paraphrases.

Fuzzy deduplication uses MinHash + Locality Sensitive Hashing (LSH) to find near-duplicates. Documents are represented as sets of n-gram shingles; MinHash estimates Jaccard similarity in sublinear time; pairs above ~0.8 similarity have one copy removed.

Jaccard(A, B) = |A ∩ B| / |A ∪ B| MinHash approximates this in O(n) via random hash functions on n-gram shingles

MinHash + LSH

n-gram shingles

Exact + fuzzy

Prevents memorization

Quality Filtering

02 · Data Processing · Removes low-signal documents

Heuristic filters remove obvious low-quality documents: pages with very high symbol/word ratios (spam), too few unique words (boilerplate), below minimum length, or in unwanted languages.

Perplexity filtering trains a small reference model (e.g. KenLM on Wikipedia) and scores each document. Very high perplexity indicates gibberish or heavily non-natural-language content.

More sophisticated approaches use a classifier trained to distinguish Wikipedia-quality text from low-quality text, filtering on classifier confidence score.

KenLM perplexity

Heuristic rules

Classifier filtering

Language detection

PII & Safety Filtering

02 · Data Processing · Legal compliance and safety

PII removal targets email addresses, phone numbers, social security numbers, credit card numbers, IP addresses, and names in sensitive contexts. Regex patterns and NER models identify and either redact or remove affected documents.

Toxic content filtering uses classifiers trained on hate speech, graphic violence, CSAM, and other harmful categories. The threshold is carefully calibrated — aggressive filtering can remove legitimate historical, medical, or legal content.

This is best-effort at trillion-token scale. Some harmful content inevitably passes through, which is one reason alignment (stage 06) is critical.

Regex PII

NER models

Toxic classifiers

GDPR compliance

Best-effort

Domain Mixing & Sampling

02 · Data Processing · Weighting sources for optimal training

The final corpus is a carefully weighted mixture. Each domain is assigned a sampling weight determining how often its tokens appear in training batches. High-quality sources are upsampled; raw web text is downsampled.

The Chinchilla paper showed most LLMs were undertrained relative to their size — more data with a smaller model often outperforms less data with a larger model, changing how these ratios are set.

GPT-3 approximate domain weights: Common Crawl : 60% (0.44× per epoch) Books2 : 22% (2.0× per epoch) Wikipedia : 3% (3.4× per epoch) Books1 : 8% (1.9× per epoch) Other : 7% (2.0× per epoch)

Upsampling

Chinchilla scaling

Multi-epoch

Domain weights

Tokenizer Training (BPE)

03 · Tokenization · Trained separately, frozen before model training

BPE (Byte-Pair Encoding) starts with individual bytes/characters and iteratively merges the most frequent adjacent pair until the vocabulary reaches the target size (32k–100k tokens).

SentencePiece treats the raw Unicode byte stream directly, making it language-agnostic and robust to whitespace variations. Used by LLaMA, T5, Gemma.

The tokenizer is trained once, then permanently frozen. Every document is tokenized with this fixed vocabulary before any model training begins.

BPE iteration: "l o w e r" + "l o w e s t" + ... → merge most frequent pair: "lo" → "lo" → "l ow e r" ... → repeat until vocab_size tokens reached

BPE

SentencePiece

Frozen after training

32k–100k vocab

Vocabulary File (vocab.json)

03 · Tokenization · Permanent artifact shipped with the model

The vocabulary file maps every token string to an integer ID and back. This file must ship with the model — without it you cannot convert text to input IDs or decode output IDs back to text.

The vocabulary encodes implicit decisions about the corpus: what languages are well-represented (common tokens have short, efficient encodings) and what domains are present (Python keywords like def may be a single token).

vocab.json structure: { "hello": 31373, " the": 262, ← leading space included "def": 4299, ← Python keyword as 1 token "Ġhello": 31373, ← GPT-2 style prefix ... }

vocab.json

merges.txt

token ↔ ID mapping

Ships with model

Special Tokens

03 · Tokenization · Control signals, not natural language

Special tokens are reserved IDs with semantic meaning for the model's operation. They are inserted at training time to structure inputs and outputs.

[BOS] — beginning of sequence, prepended to every input. [EOS] — end of sequence, the model learns to predict this when done. [PAD] — fills short sequences in a batch. [MASK] — used in masked LM (BERT-style).

Chat models add structured turn tokens so the model distinguishes user from assistant: <|user|>, <|assistant|>, <|system|>.

[BOS] [EOS]

[PAD] [MASK]

Chat turn tokens

Control signals

Not natural language

Core Hyperparameters

04 · Architecture · Specified before training, permanent

The configuration file fully specifies the architecture before any weight is initialized. These choices are essentially permanent — changing them requires retraining from scratch.

d_model: width of the representation throughout the network. n_layers (L): how many attention+FFN blocks are stacked. n_heads (H): parallel attention heads; d_k = d_model / H. Context length: max tokens the model can see — attention is O(n²) so doubling context quadruples attention cost.

GPT-3 (175B) config: d_model = 12288 n_layers = 96 n_heads = 96 d_k = d_v = 128 (= 12288 / 96) d_ff = 49152 (= 4 × d_model) ctx_len = 2048

Small (1B)

d=2048, L=24, H=16

Medium (7B)

d=4096, L=32, H=32

Large (70B)

d=8192, L=80, H=64

XL (175B)

d=12288, L=96, H=96

config.json

d_model

n_layers

n_heads

ctx_len

d_ff = 4× d_model

Attention Variants

04 · Architecture · MHA / MQA / GQA tradeoffs

Multi-Head Attention (MHA): the original design. Every head has its own Q, K, V projections. H heads × 3 matrices × (d_model × d_k) parameters per layer.

Multi-Query Attention (MQA): all Q heads are separate but K and V are shared across heads. Reduces KV cache memory H× at modest quality cost. Used in PaLM, Falcon.

Grouped Query Attention (GQA): G groups share K/V, each group has its own Q heads. Best quality-vs-inference-cost tradeoff. Used by LLaMA-2 70B, Mistral, Gemma.

MHA: Q_i, K_i, V_i separate per head i params: 4 × d_model² per layer MQA: Q_i separate, K and V shared KV cache: 1/H of MHA GQA: Q_i separate, K_g/V_g shared per group KV cache: G/H of MHA

GQA (modern standard)

MHA (original)

MQA

KV cache tradeoff

Positional Encoding

04 · Architecture · Injecting sequence order into attention

Absolute sinusoidal (original Transformer): fixed sine/cosine encoding added to embeddings. Hard limit at training context length.

RoPE (Rotary Position Embedding): applies a rotation to Q and K vectors based on their positions. The dot product Q·K naturally encodes relative distance. Can be extended beyond training length via RoPE scaling. Used by LLaMA, GPT-NeoX, Mistral.

ALiBi (Attention with Linear Biases): adds a linear penalty to attention scores based on distance between positions. No learnable parameters; extrapolates well to longer sequences. Used by MPT, BLOOM.

RoPE (dominant)

ALiBi

Sinusoidal

Learned absolute

Length extrapolation

Normalization Strategy

04 · Architecture · Stabilizing deep network training

Post-norm (original Transformer): LayerNorm after residual — LayerNorm(x + sublayer(x)). Requires careful LR warmup; unstable with very deep networks.

Pre-norm (GPT-2 onward): LayerNorm before the sublayer — x + sublayer(LayerNorm(x)). More stable, allows removing warmup. Standard in modern LLMs.

RMSNorm: simplification that only normalizes by root mean square, dropping mean subtraction. ~10% faster, empirically equivalent quality. Used by LLaMA, Mistral, Gemma.

LayerNorm: (x - μ) / (σ + ε) × γ + β RMSNorm: x / RMS(x) × γ where RMS(x) = √(mean(x²))

RMSNorm (modern)

Pre-norm

Post-norm (original)

Training stability

Training Objective: Causal LM

05 · Pre-Training · Next-token prediction

The sole objective is next-token prediction: given a sequence, predict the next token. This is called causal language modeling (CLM) because each position can only attend to past positions.

The loss is the average cross-entropy across all positions. Each token prediction is a V-way classification problem (one class per vocabulary token). This deceptively simple objective forces the model to learn grammar, facts, reasoning, code, and world knowledge — because all of these are necessary to predict text well.

L = -1/T × Σᵢ log P(xᵢ | x₁...xᵢ₋₁ ; θ) Perplexity = exp(L) Random baseline (50k vocab): PPL = 50,000 GPT-3 on Penn Treebank: PPL ≈ 20

Next-token prediction

Cross-entropy loss

Perplexity

Self-supervised

No labels needed

Optimizer & Learning Rate

05 · Pre-Training · AdamW with cosine decay schedule

AdamW (Adam with decoupled weight decay) is universal for LLM training. It maintains per-parameter running estimates of gradient mean (m) and variance (v), giving each parameter an adaptive learning rate.

The learning rate schedule has three phases: linear warmup from 0 (first ~2000 steps), cosine decay to ~10% of peak LR (majority of training), then a constant tail. Gradient clipping (max norm = 1.0) prevents catastrophic gradient explosions.

AdamW update: m = β₁m + (1-β₁)g ← gradient momentum v = β₂v + (1-β₂)g² ← gradient variance θ = θ - α(m̂/√v̂+ε) - λθ ← update + decay Typical: β₁=0.9, β₂=0.95, ε=1e-8, λ=0.1 Peak LR: 1e-4 to 3e-4

AdamW

Cosine LR decay

Linear warmup

Grad clipping (1.0)

β₁=0.9, β₂=0.95

Distributed Training Infrastructure

05 · Pre-Training · Three parallelism strategies combined

Data Parallelism: the same model is replicated on many GPUs, each processing different batches. Gradients are averaged via AllReduce. FSDP shards parameters across GPUs to address memory limits.

Tensor Parallelism: individual weight matrices are split across GPUs — the attention head dimension is a natural split point. Requires frequent inter-GPU communication within each forward pass.

Pipeline Parallelism: different layers run on different GPUs with micro-batches flowing through. Reduces communication but introduces pipeline "bubble" idle time.

FSDP

Megatron-LM

Data parallel

Tensor parallel

Pipeline parallel

AllReduce

Checkpointing Strategy

05 · Pre-Training · At scale, hardware failures are guaranteed

Checkpoints save the complete training state — model weights, optimizer states (m and v for every parameter), learning rate schedule position, and random seeds — at regular intervals (every 1000–2000 steps).

Optimizer states can be as large as the weights themselves. A 70B model at FP32 has ~280GB weights plus ~560GB optimizer state. Checkpoints are sharded across many files.

Teams maintain "golden checkpoints" from stable points to restart from if training crashes or a loss spike occurs.

Model weights

Optimizer state (m, v)

LR schedule

Loss spike recovery

Sharded files

Supervised Fine-Tuning (SFT)

06 · Alignment · Transforms a raw LM into an assistant

The pre-trained base model predicts text — it will complete any input, including harmful ones. SFT teaches it to be a helpful assistant by training on curated (instruction, response) pairs.

Human annotators write high-quality responses to diverse instructions. The model is fine-tuned with the same cross-entropy objective as pre-training, but loss is computed only on the response tokens — the instruction tokens are masked out.

SFT data is tiny compared to pre-training (10k–1M examples vs trillions of tokens) but extremely high impact — a few days of SFT transforms the base model.

Instruction tuning

Human-written responses

Loss on response only

LoRA / full finetune

Reward Model Training

06 · Alignment · Learning to predict human preferences

A separate model — the Reward Model (RM) — is trained to predict which of two responses a human would prefer. Humans rank response pairs, producing (prompt, chosen, rejected) triples.

The RM is initialized from the SFT model with the final layer replaced by a scalar output (the reward). It's trained with a preference loss that pushes reward(chosen) above reward(rejected).

The RM is a proxy for human judgment — scoring any response in milliseconds. Its quality is the bottleneck for RLHF; a flawed RM leads to reward hacking where the model finds degenerate high-reward responses.

RM loss = -log σ(r(chosen) - r(rejected)) where r(x) = scalar reward for response x σ = sigmoid function

Preference data

Scalar reward output

Bradley-Terry model

Reward hacking risk

RLHF / DPO

06 · Alignment · Optimizing the policy against human preferences

RLHF with PPO: the SFT model is the policy π. It generates responses; the RM scores them; PPO updates the policy to maximize reward. A KL-divergence penalty prevents the policy from drifting too far from the SFT model, stopping degenerate reward hacking.

DPO (Direct Preference Optimization): skips the separate RM. It reformulates the RLHF objective as supervised classification directly on (chosen, rejected) pairs — mathematically equivalent to implicit reward maximization but simpler and more stable. Used by LLaMA-3, Zephyr, many open models.

RLHF-PPO: reward = RM(prompt, response) - β·KL(π ‖ π_SFT) DPO: loss = -log σ(β·log π(chosen)/π_ref(chosen) - β·log π(rejected)/π_ref(rejected))

DPO (dominant)

PPO

KL penalty

Policy gradient

Constitutional AI (CAI)

06 · Alignment · Scalable AI-generated preference data

Anthropic's approach uses a constitution — a set of principles ("be helpful", "avoid harm", "be honest") — to generate preference data synthetically rather than requiring human labelers for every example.

RLAIF (RL from AI Feedback): a pre-trained model critiques and revises its own outputs according to constitutional principles, generating (original, revised) pairs used to train a preference model.

This scales the alignment process — rather than needing humans to rank every pair, the AI generates thousands of self-critique pairs guided by the constitution. Particularly effective for safety-relevant behaviors.

Constitutional AI

RLAIF

Self-critique

Scalable oversight

Token Embedding Matrix

07 · Model Artifacts · First learnable layer · Shared with output

The embedding matrix maps every token ID to a dense vector. Shape: [vocab_size × d_model]. For LLaMA-7B: [32,000 × 4,096] = 131M parameters.

The same matrix is used three times via weight tying: as the input lookup, the output embedding for the decoder target, and transposed as the pre-softmax linear layer. This reduces parameters and improves training stability.

After training, these vectors encode rich semantic relationships — similar tokens cluster nearby in the 4096-dimensional space.

Shape: [vocab_size × d_model] LLaMA-7B: [32000 × 4096] = 131M params GPT-3: [50257 × 12288] = 617M params Used as: 1. Input lookup: token_id → vector 2. Output linear: hidden → logits (transposed) 3. (shared with output embedding)

[vocab × d_model]

Weight tying

First + last layer

Semantic geometry

Q, K, V Projection Matrices

07 · Model Artifacts · The learned attention weights, per layer per head

For each of the L transformer layers and each of the H attention heads, there are separate W_Q, W_K, W_V projection matrices that project the token representation into the query, key, and value spaces.

Each W_Q, W_K, W_V has shape [d_model × d_k] where d_k = d_model/H. W_O has shape [H·d_v × d_model] projecting concatenated head outputs back to d_model.

In practice these are stored as a single fused matrix W_QKV of shape [d_model × 3·d_model] enabling a single GEMM operation — a critical inference optimization.

These are the most compute-intensive weights during inference — x@W_Q, x@W_K, x@W_V dominate FLOPs at short context lengths.

Per layer, per head (MHA): W_Q : [d_model × d_k] ← "what to look for" W_K : [d_model × d_k] ← "what I contain" W_V : [d_model × d_v] ← "what I provide" W_O : [H·d_v × d_model] ← "merge all heads" Fused for efficiency: W_QKV : [d_model × 3·d_model] Total QKV params: L × 4 × d_model² GPT-3: 96 × 4 × 12288² ≈ 58.7B params

GPT-2 (117M)

L=12, H=12, d_k=64

LLaMA-7B

L=32, H=32, d_k=128

LLaMA-70B

L=80, H=64, GQA-8

GPT-3 (175B)

L=96, H=96, d_k=128

[d_model × d_k] each

L × H matrices

Fused as W_QKV

~33% of total params

Feed-Forward Network Weights

07 · Model Artifacts · ~66% of all model parameters

Each transformer layer contains a two-layer MLP applied position-wise. Weights are W1: [d_model × d_ff] and W2: [d_ff × d_model] with the inner dimension d_ff = 4 × d_model.

Think of attention as the "communication" step (tokens exchange information) and the FFN as the "computation" step (each token reasons independently). Research suggests FFN layers act as key-value memory stores that recall factual associations.

Modern models replace ReLU with SwiGLU (three matrices W1, W2, W3 with Swish-gated activation) — empirically better with ~same parameter count.

Standard (ReLU): FFN(x) = ReLU(x @ W1 + b1) @ W2 + b2 W1: [d_model × d_ff], W2: [d_ff × d_model] SwiGLU (LLaMA): FFN(x) = (SiLU(x@W1) ⊙ (x@W3)) @ W2 Params per layer: 2 × d_model × d_ff GPT-3 total FFN: 96 × 2 × 12288 × 49152 ≈ 117B

~66% of all params

W1: [d_model × d_ff]

W2: [d_ff × d_model]

SwiGLU variant

Key-value memory

Normalization Parameters (γ, β)

07 · Model Artifacts · Tiny but critical learned scalings

Each LayerNorm or RMSNorm block contains two learned vectors: γ (gamma) — scale, initialized to 1 — and β (beta) — shift, initialized to 0. Both have shape [d_model].

These are tiny relative to attention and FFN weights — less than 0.5% of total parameters. Despite their small size they are critical: they allow the model to rescale and shift normalized activations to whatever range is most useful for the next layer.

Per LayerNorm/RMSNorm block: γ : [d_model] ← scale (init: ones) β : [d_model] ← shift (init: zeros) output = γ ⊙ normalize(x) + β LLaMA-7B: 65 norm layers × 2 × 4096 = 533K params (0.008% of 7B total params)

γ: [d_model]

β: [d_model]

Per norm block

< 0.5% of params

Critical despite small size

Attention Patterns & KV Cache

07 · Model Artifacts · Runtime activations, not stored weights

Unlike weight matrices, attention patterns are not stored in the model file — they are intermediate activations recomputed fresh every forward pass. For each layer and head, the attention pattern is a [seq × seq] matrix where entry [i,j] is the weight token i gives to token j.

The KV cache IS stored during inference: once K and V are computed for a token, they are cached and reused in all future steps. Without KV caching, generating token t requires recomputing all t-1 previous tokens — O(t²) total. With caching, it is O(t).

Attention pattern (computed, not stored): A = softmax(QKᵀ/√d_k) shape: [seq × seq] KV cache per token per layer (IS stored): K_vector : [d_k] V_vector : [d_v] KV cache size (LLaMA-7B, 4096 ctx): 2 × 32 layers × 32 heads × 128 × 4096 × 2B = 2.1 GB per active request

KV cache IS stored

Attn patterns NOT stored

O(t) vs O(t²)

~2GB per request (7B)

Visualizable at runtime

Quantization

08 · Deployment · Reducing precision for memory and speed

Training uses BF16 or FP32 weights (16–32 bits per parameter). Quantization reduces precision to INT8 or INT4 at inference time, shrinking model size 2–4× and speeding up inference 1.5–3×.

GPTQ and AWQ are the dominant post-training quantization methods for LLMs. They use a small calibration dataset and compensate for weight rounding using second-order gradient information.

INT8 loses <1% quality. INT4 loses 1–5% but enables running 70B models on consumer hardware. Sub-4-bit (1–2 bit) is an active research area.

FP16 → INT8: scale = max(|W|) / 127 W_int8 = round(W / scale) ← stored W_approx = W_int8 × scale ← at runtime Memory: 70B parameter model FP16: 140 GB (2 bytes/param) INT8: 70 GB INT4: 35 GB ← fits 2× A100 40GB

INT4 (GPTQ/AWQ)

INT8

2–4× compression

Post-training

Calibration dataset

KV Cache Management

08 · Deployment · The dominant inference memory challenge

The KV cache grows linearly with context length during generation. At long contexts or high concurrency it becomes the dominant memory consumer — exceeding the model weights themselves.

PagedAttention (vLLM) manages KV cache like OS virtual memory: cache blocks are allocated on demand and shared across requests with the same prefix. This enables near-100% GPU memory utilization vs naive allocation's ~30%.

For a 7B model serving 100 concurrent users at 4096 context length: 2.1GB × 100 = 210GB of KV cache alone — requiring careful management across multiple GPUs.

KV cache per token (LLaMA-7B): 2 × 32 layers × 32 heads × 128 dims × 2B = 524 KB per token At 4096 context: 2.1 GB per request At 100 users: 210 GB KV cache total PagedAttention: manages this as virtual memory → 100% GPU utilization vs ~30% naive

PagedAttention

vLLM

O(t) generation

Memory bottleneck

Prefix sharing

Continuous Batching

08 · Deployment · Maximizing GPU utilization across requests

Naive batching waits for a full batch, processes together, returns all results. This wastes GPU time — the batch idles waiting for the longest sequence to finish while others have already completed.

Continuous (iteration-level) batching: at each generation step the batch is dynamically updated — finished sequences are removed and new requests are inserted immediately. The GPU stays maximally utilized throughout.

Combined with quantization and KV cache optimization, continuous batching enables a single A100 80GB server to serve hundreds of concurrent users with low latency. Implemented in vLLM, TGI (HuggingFace), TensorRT-LLM.

Continuous batching

vLLM

TGI

TensorRT-LLM

GPU utilization

Dynamic scheduling