LLM Creation Pipeline

From raw training data to deployed artifacts
Data
Processing
Tokenization
Architecture
Pre-training
Alignment
Artifacts
Deployment
← click any block for details
🌐 Training Data Collection Web crawl · Books · Code · Academic · Curated Trillions of tokens
WEBCommon Crawl, Wikipedia, news sites
BOOKSBookCorpus, Project Gutenberg, arXiv
CODEGitHub, Stack Overflow, documentation
CURATEDHigh-quality filtered & licensed datasets
🔧 Data Processing & Filtering Deduplication · Quality filtering · PII removal · Domain mixing Data pipeline
DEDUPExact & fuzzy deduplication (MinHash)
FILTERPerplexity filtering, heuristic rules
SAFETYPII scrubbing, toxic content removal
MIXDomain weighting and sampling ratios
✂️ Tokenization BPE / SentencePiece → integer IDs · Vocabulary construction Artifact: vocab.json
ALGOBPE / SentencePiece training on corpus
VOCABVocabulary file — 32k–100k entries
SPECIAL[BOS], [EOS], [PAD], chat turn tokens
🏗️ Model Architecture Definition Hyperparameters · Layer config · Attention heads · Context window config.json
HYPERd_model, n_layers, n_heads, d_ff, ctx_len
ATTNMHA / MQA / GQA attention variants
POSRoPE, ALiBi, absolute positional encoding
NORMPre-norm, RMSNorm vs LayerNorm
Pre-Training (Causal LM) Next-token prediction · Cross-entropy loss · Distributed training Weeks · 1000s of GPUs
OBJCausal language modeling — predict next token
OPTIMAdamW, cosine LR schedule, warmup
INFRAFSDP / Megatron / Pipeline parallelism
CKPTCheckpoint strategy, loss monitoring
🧭 Alignment & Fine-tuning SFT → Reward Model → RLHF / DPO · Constitutional AI Artifact: adapter weights
SFTSupervised fine-tuning on instruction data
RMReward model trained on human preferences
RLHFPPO / DPO optimization against reward model
CAIConstitutional AI / RLAIF self-critique
📦 Model Artifacts Weights · Embeddings · Q/K/V matrices · Attention patterns The trained model
EMBEDToken embedding matrix [vocab × d_model]
Q·K·VPer-head projection matrices × L layers
FFNW1, W2 feed-forward weights per layer
NORMγ, β normalization parameters per layer
CACHERuntime KV cache — not stored in weights
🚀 Deployment & Serving Quantization · KV cache · Continuous batching · Inference optimization Production system
QUANTINT8/INT4 quantization — shrinks model 4–8×
CACHEKV cache — avoids recomputing past tokens
BATCHContinuous batching for throughput

Click any block to explore the full pipeline — from raw text to deployed model.

Follow the pipeline top to bottom. Each stage produces an artifact used by the next. Start with Training Data and finish at Deployment to see the complete picture.

Quick Reference — GPT-3 (175B) Scale
Training tokens: 300B
Parameters: 175B
d_model: 12,288
Layers: 96
Attn heads: 96
d_k = d_v: 128
d_ff: 49,152
Context: 2,048 tokens
Vocab size: 50,257
Training: ~3.14×10²³ FLOPs
Web Crawl Data
01 · Data Collection · Largest raw source

Common Crawl is the largest source — a petabyte-scale monthly snapshot of the web containing billions of pages. GPT-3 used ~410B tokens from filtered Common Crawl alone, representing ~60% of its training mix.

Wikipedia is small in volume (~4B tokens) but extremely high quality: factual, well-structured, multilingual. It is typically upsampled 3–4× so it appears more often than its raw byte fraction would suggest.

Raw web data is noisy — spam, boilerplate, duplicate content, SEO manipulation, adult content. It cannot be used directly and requires the full processing pipeline (stage 02).

Common Crawl
Wikipedia
News sites
Reddit
Multilingual
~60% of training mix
Books & Academic
01 · Data Collection · Long-form reasoning signal

BookCorpus and similar datasets provide long-form prose — crucial for learning coherent multi-paragraph reasoning. Books contain richer vocabulary and sentence structure than web text.

arXiv papers contribute scientific reasoning, mathematical notation, and structured argumentation. LaTeX source is often included so the model learns to handle equations and formal notation.

Books are heavily upsampled relative to their raw byte count — the signal density per token is far higher than web text. GPT-3 used ~22B book tokens at roughly 2× sampling weight.

BookCorpus
Project Gutenberg
arXiv
Long-form prose
Upsampled 2×
Code Data
01 · Data Collection · Improves general reasoning

Code training data (GitHub, Stack Overflow, documentation) dramatically improves reasoning and instruction-following — not just coding ability. Code is highly structured, contains explicit logic chains, and has clear input→output relationships that transfer to general problem-solving.

Models trained with code data show improved performance on chain-of-thought reasoning, mathematics, and multi-step problems — suggesting that code's structural patterns generalize well beyond programming tasks.

# Code is structured, explicit, verifiable def solve(args): # Explicit logic chain the model must learn result = process(args) return result # ← clear input→output
GitHub
Stack Overflow
Documentation
Improves reasoning
Multi-language
Curated High-Quality Data
01 · Data Collection · Quality over quantity

Not all tokens are equal. A small set of high-quality, carefully selected data — textbooks, encyclopedias, legal documents, scientific papers — can be worth orders of magnitude more per token than raw web text.

Microsoft's Phi models demonstrated this strikingly: a 1.3B parameter model trained on "textbook-quality" synthetic data outperformed much larger models trained on raw web data. This shifted the field toward data quality over data quantity.

Modern pipelines classify and score each document, then heavily oversample top-rated content regardless of its raw byte fraction.

Quality > Quantity
Textbooks
Encyclopedias
Scientific papers
Phi / Mistral finding
Deduplication
02 · Data Processing · Prevents memorization of repeated content

The web is full of duplicate content — the same article syndicated to 500 sites, templated pages, mirror repositories. Without deduplication the model memorizes repeated text verbatim and perplexity metrics become misleadingly low.

Exact deduplication hashes document content and removes identical copies. Fast but misses paraphrases.

Fuzzy deduplication uses MinHash + Locality Sensitive Hashing (LSH) to find near-duplicates. Documents are represented as sets of n-gram shingles; MinHash estimates Jaccard similarity in sublinear time; pairs above ~0.8 similarity have one copy removed.

Jaccard(A, B) = |A ∩ B| / |A ∪ B| MinHash approximates this in O(n) via random hash functions on n-gram shingles
MinHash + LSH
n-gram shingles
Exact + fuzzy
Prevents memorization
Quality Filtering
02 · Data Processing · Removes low-signal documents

Heuristic filters remove obvious low-quality documents: pages with very high symbol/word ratios (spam), too few unique words (boilerplate), below minimum length, or in unwanted languages.

Perplexity filtering trains a small reference model (e.g. KenLM on Wikipedia) and scores each document. Very high perplexity indicates gibberish or heavily non-natural-language content.

More sophisticated approaches use a classifier trained to distinguish Wikipedia-quality text from low-quality text, filtering on classifier confidence score.

KenLM perplexity
Heuristic rules
Classifier filtering
Language detection
PII & Safety Filtering
02 · Data Processing · Legal compliance and safety

PII removal targets email addresses, phone numbers, social security numbers, credit card numbers, IP addresses, and names in sensitive contexts. Regex patterns and NER models identify and either redact or remove affected documents.

Toxic content filtering uses classifiers trained on hate speech, graphic violence, CSAM, and other harmful categories. The threshold is carefully calibrated — aggressive filtering can remove legitimate historical, medical, or legal content.

This is best-effort at trillion-token scale. Some harmful content inevitably passes through, which is one reason alignment (stage 06) is critical.

Regex PII
NER models
Toxic classifiers
GDPR compliance
Best-effort
Domain Mixing & Sampling
02 · Data Processing · Weighting sources for optimal training

The final corpus is a carefully weighted mixture. Each domain is assigned a sampling weight determining how often its tokens appear in training batches. High-quality sources are upsampled; raw web text is downsampled.

The Chinchilla paper showed most LLMs were undertrained relative to their size — more data with a smaller model often outperforms less data with a larger model, changing how these ratios are set.

GPT-3 approximate domain weights: Common Crawl : 60% (0.44× per epoch) Books2 : 22% (2.0× per epoch) Wikipedia : 3% (3.4× per epoch) Books1 : 8% (1.9× per epoch) Other : 7% (2.0× per epoch)
Upsampling
Chinchilla scaling
Multi-epoch
Domain weights
Tokenizer Training (BPE)
03 · Tokenization · Trained separately, frozen before model training

BPE (Byte-Pair Encoding) starts with individual bytes/characters and iteratively merges the most frequent adjacent pair until the vocabulary reaches the target size (32k–100k tokens).

SentencePiece treats the raw Unicode byte stream directly, making it language-agnostic and robust to whitespace variations. Used by LLaMA, T5, Gemma.

The tokenizer is trained once, then permanently frozen. Every document is tokenized with this fixed vocabulary before any model training begins.

BPE iteration: "l o w e r" + "l o w e s t" + ... → merge most frequent pair: "lo" → "lo" → "l ow e r" ... → repeat until vocab_size tokens reached
BPE
SentencePiece
Frozen after training
32k–100k vocab
Vocabulary File (vocab.json)
03 · Tokenization · Permanent artifact shipped with the model

The vocabulary file maps every token string to an integer ID and back. This file must ship with the model — without it you cannot convert text to input IDs or decode output IDs back to text.

The vocabulary encodes implicit decisions about the corpus: what languages are well-represented (common tokens have short, efficient encodings) and what domains are present (Python keywords like def may be a single token).

vocab.json structure: { "hello": 31373, " the": 262, ← leading space included "def": 4299, ← Python keyword as 1 token "Ġhello": 31373, ← GPT-2 style prefix ... }
vocab.json
merges.txt
token ↔ ID mapping
Ships with model
Special Tokens
03 · Tokenization · Control signals, not natural language

Special tokens are reserved IDs with semantic meaning for the model's operation. They are inserted at training time to structure inputs and outputs.

[BOS] — beginning of sequence, prepended to every input. [EOS] — end of sequence, the model learns to predict this when done. [PAD] — fills short sequences in a batch. [MASK] — used in masked LM (BERT-style).

Chat models add structured turn tokens so the model distinguishes user from assistant: <|user|>, <|assistant|>, <|system|>.

[BOS] [EOS]
[PAD] [MASK]
Chat turn tokens
Control signals
Not natural language
Core Hyperparameters
04 · Architecture · Specified before training, permanent

The configuration file fully specifies the architecture before any weight is initialized. These choices are essentially permanent — changing them requires retraining from scratch.

d_model: width of the representation throughout the network. n_layers (L): how many attention+FFN blocks are stacked. n_heads (H): parallel attention heads; d_k = d_model / H. Context length: max tokens the model can see — attention is O(n²) so doubling context quadruples attention cost.

GPT-3 (175B) config: d_model = 12288 n_layers = 96 n_heads = 96 d_k = d_v = 128 (= 12288 / 96) d_ff = 49152 (= 4 × d_model) ctx_len = 2048
Small (1B)
d=2048, L=24, H=16
Medium (7B)
d=4096, L=32, H=32
Large (70B)
d=8192, L=80, H=64
XL (175B)
d=12288, L=96, H=96
config.json
d_model
n_layers
n_heads
ctx_len
d_ff = 4× d_model
Attention Variants
04 · Architecture · MHA / MQA / GQA tradeoffs

Multi-Head Attention (MHA): the original design. Every head has its own Q, K, V projections. H heads × 3 matrices × (d_model × d_k) parameters per layer.

Multi-Query Attention (MQA): all Q heads are separate but K and V are shared across heads. Reduces KV cache memory H× at modest quality cost. Used in PaLM, Falcon.

Grouped Query Attention (GQA): G groups share K/V, each group has its own Q heads. Best quality-vs-inference-cost tradeoff. Used by LLaMA-2 70B, Mistral, Gemma.

MHA: Q_i, K_i, V_i separate per head i params: 4 × d_model² per layer MQA: Q_i separate, K and V shared KV cache: 1/H of MHA GQA: Q_i separate, K_g/V_g shared per group KV cache: G/H of MHA
GQA (modern standard)
MHA (original)
MQA
KV cache tradeoff
Positional Encoding
04 · Architecture · Injecting sequence order into attention

Absolute sinusoidal (original Transformer): fixed sine/cosine encoding added to embeddings. Hard limit at training context length.

RoPE (Rotary Position Embedding): applies a rotation to Q and K vectors based on their positions. The dot product Q·K naturally encodes relative distance. Can be extended beyond training length via RoPE scaling. Used by LLaMA, GPT-NeoX, Mistral.

ALiBi (Attention with Linear Biases): adds a linear penalty to attention scores based on distance between positions. No learnable parameters; extrapolates well to longer sequences. Used by MPT, BLOOM.

RoPE (dominant)
ALiBi
Sinusoidal
Learned absolute
Length extrapolation
Normalization Strategy
04 · Architecture · Stabilizing deep network training

Post-norm (original Transformer): LayerNorm after residual — LayerNorm(x + sublayer(x)). Requires careful LR warmup; unstable with very deep networks.

Pre-norm (GPT-2 onward): LayerNorm before the sublayer — x + sublayer(LayerNorm(x)). More stable, allows removing warmup. Standard in modern LLMs.

RMSNorm: simplification that only normalizes by root mean square, dropping mean subtraction. ~10% faster, empirically equivalent quality. Used by LLaMA, Mistral, Gemma.

LayerNorm: (x - μ) / (σ + ε) × γ + β RMSNorm: x / RMS(x) × γ where RMS(x) = √(mean(x²))
RMSNorm (modern)
Pre-norm
Post-norm (original)
Training stability
Training Objective: Causal LM
05 · Pre-Training · Next-token prediction

The sole objective is next-token prediction: given a sequence, predict the next token. This is called causal language modeling (CLM) because each position can only attend to past positions.

The loss is the average cross-entropy across all positions. Each token prediction is a V-way classification problem (one class per vocabulary token). This deceptively simple objective forces the model to learn grammar, facts, reasoning, code, and world knowledge — because all of these are necessary to predict text well.

L = -1/T × Σᵢ log P(xᵢ | x₁...xᵢ₋₁ ; θ) Perplexity = exp(L) Random baseline (50k vocab): PPL = 50,000 GPT-3 on Penn Treebank: PPL ≈ 20
Next-token prediction
Cross-entropy loss
Perplexity
Self-supervised
No labels needed
Optimizer & Learning Rate
05 · Pre-Training · AdamW with cosine decay schedule

AdamW (Adam with decoupled weight decay) is universal for LLM training. It maintains per-parameter running estimates of gradient mean (m) and variance (v), giving each parameter an adaptive learning rate.

The learning rate schedule has three phases: linear warmup from 0 (first ~2000 steps), cosine decay to ~10% of peak LR (majority of training), then a constant tail. Gradient clipping (max norm = 1.0) prevents catastrophic gradient explosions.

AdamW update: m = β₁m + (1-β₁)g ← gradient momentum v = β₂v + (1-β₂)g² ← gradient variance θ = θ - α(m̂/√v̂+ε) - λθ ← update + decay Typical: β₁=0.9, β₂=0.95, ε=1e-8, λ=0.1 Peak LR: 1e-4 to 3e-4
AdamW
Cosine LR decay
Linear warmup
Grad clipping (1.0)
β₁=0.9, β₂=0.95
Distributed Training Infrastructure
05 · Pre-Training · Three parallelism strategies combined

Data Parallelism: the same model is replicated on many GPUs, each processing different batches. Gradients are averaged via AllReduce. FSDP shards parameters across GPUs to address memory limits.

Tensor Parallelism: individual weight matrices are split across GPUs — the attention head dimension is a natural split point. Requires frequent inter-GPU communication within each forward pass.

Pipeline Parallelism: different layers run on different GPUs with micro-batches flowing through. Reduces communication but introduces pipeline "bubble" idle time.

FSDP
Megatron-LM
Data parallel
Tensor parallel
Pipeline parallel
AllReduce
Checkpointing Strategy
05 · Pre-Training · At scale, hardware failures are guaranteed

Checkpoints save the complete training state — model weights, optimizer states (m and v for every parameter), learning rate schedule position, and random seeds — at regular intervals (every 1000–2000 steps).

Optimizer states can be as large as the weights themselves. A 70B model at FP32 has ~280GB weights plus ~560GB optimizer state. Checkpoints are sharded across many files.

Teams maintain "golden checkpoints" from stable points to restart from if training crashes or a loss spike occurs.

Model weights
Optimizer state (m, v)
LR schedule
Loss spike recovery
Sharded files
Supervised Fine-Tuning (SFT)
06 · Alignment · Transforms a raw LM into an assistant

The pre-trained base model predicts text — it will complete any input, including harmful ones. SFT teaches it to be a helpful assistant by training on curated (instruction, response) pairs.

Human annotators write high-quality responses to diverse instructions. The model is fine-tuned with the same cross-entropy objective as pre-training, but loss is computed only on the response tokens — the instruction tokens are masked out.

SFT data is tiny compared to pre-training (10k–1M examples vs trillions of tokens) but extremely high impact — a few days of SFT transforms the base model.

Instruction tuning
Human-written responses
Loss on response only
LoRA / full finetune
Reward Model Training
06 · Alignment · Learning to predict human preferences

A separate model — the Reward Model (RM) — is trained to predict which of two responses a human would prefer. Humans rank response pairs, producing (prompt, chosen, rejected) triples.

The RM is initialized from the SFT model with the final layer replaced by a scalar output (the reward). It's trained with a preference loss that pushes reward(chosen) above reward(rejected).

The RM is a proxy for human judgment — scoring any response in milliseconds. Its quality is the bottleneck for RLHF; a flawed RM leads to reward hacking where the model finds degenerate high-reward responses.

RM loss = -log σ(r(chosen) - r(rejected)) where r(x) = scalar reward for response x σ = sigmoid function
Preference data
Scalar reward output
Bradley-Terry model
Reward hacking risk
RLHF / DPO
06 · Alignment · Optimizing the policy against human preferences

RLHF with PPO: the SFT model is the policy π. It generates responses; the RM scores them; PPO updates the policy to maximize reward. A KL-divergence penalty prevents the policy from drifting too far from the SFT model, stopping degenerate reward hacking.

DPO (Direct Preference Optimization): skips the separate RM. It reformulates the RLHF objective as supervised classification directly on (chosen, rejected) pairs — mathematically equivalent to implicit reward maximization but simpler and more stable. Used by LLaMA-3, Zephyr, many open models.

RLHF-PPO: reward = RM(prompt, response) - β·KL(π ‖ π_SFT) DPO: loss = -log σ(β·log π(chosen)/π_ref(chosen) - β·log π(rejected)/π_ref(rejected))
DPO (dominant)
PPO
KL penalty
Policy gradient
Constitutional AI (CAI)
06 · Alignment · Scalable AI-generated preference data

Anthropic's approach uses a constitution — a set of principles ("be helpful", "avoid harm", "be honest") — to generate preference data synthetically rather than requiring human labelers for every example.

RLAIF (RL from AI Feedback): a pre-trained model critiques and revises its own outputs according to constitutional principles, generating (original, revised) pairs used to train a preference model.

This scales the alignment process — rather than needing humans to rank every pair, the AI generates thousands of self-critique pairs guided by the constitution. Particularly effective for safety-relevant behaviors.

Constitutional AI
RLAIF
Self-critique
Scalable oversight
Token Embedding Matrix
07 · Model Artifacts · First learnable layer · Shared with output

The embedding matrix maps every token ID to a dense vector. Shape: [vocab_size × d_model]. For LLaMA-7B: [32,000 × 4,096] = 131M parameters.

The same matrix is used three times via weight tying: as the input lookup, the output embedding for the decoder target, and transposed as the pre-softmax linear layer. This reduces parameters and improves training stability.

After training, these vectors encode rich semantic relationships — similar tokens cluster nearby in the 4096-dimensional space.

Shape: [vocab_size × d_model] LLaMA-7B: [32000 × 4096] = 131M params GPT-3: [50257 × 12288] = 617M params Used as: 1. Input lookup: token_id → vector 2. Output linear: hidden → logits (transposed) 3. (shared with output embedding)
[vocab × d_model]
Weight tying
First + last layer
Semantic geometry
Q, K, V Projection Matrices
07 · Model Artifacts · The learned attention weights, per layer per head

For each of the L transformer layers and each of the H attention heads, there are separate W_Q, W_K, W_V projection matrices that project the token representation into the query, key, and value spaces.

Each W_Q, W_K, W_V has shape [d_model × d_k] where d_k = d_model/H. W_O has shape [H·d_v × d_model] projecting concatenated head outputs back to d_model.

In practice these are stored as a single fused matrix W_QKV of shape [d_model × 3·d_model] enabling a single GEMM operation — a critical inference optimization.

These are the most compute-intensive weights during inference — x@W_Q, x@W_K, x@W_V dominate FLOPs at short context lengths.

Per layer, per head (MHA): W_Q : [d_model × d_k] ← "what to look for" W_K : [d_model × d_k] ← "what I contain" W_V : [d_model × d_v] ← "what I provide" W_O : [H·d_v × d_model] ← "merge all heads" Fused for efficiency: W_QKV : [d_model × 3·d_model] Total QKV params: L × 4 × d_model² GPT-3: 96 × 4 × 12288² ≈ 58.7B params
GPT-2 (117M)
L=12, H=12, d_k=64
LLaMA-7B
L=32, H=32, d_k=128
LLaMA-70B
L=80, H=64, GQA-8
GPT-3 (175B)
L=96, H=96, d_k=128
[d_model × d_k] each
L × H matrices
Fused as W_QKV
~33% of total params
Feed-Forward Network Weights
07 · Model Artifacts · ~66% of all model parameters

Each transformer layer contains a two-layer MLP applied position-wise. Weights are W1: [d_model × d_ff] and W2: [d_ff × d_model] with the inner dimension d_ff = 4 × d_model.

Think of attention as the "communication" step (tokens exchange information) and the FFN as the "computation" step (each token reasons independently). Research suggests FFN layers act as key-value memory stores that recall factual associations.

Modern models replace ReLU with SwiGLU (three matrices W1, W2, W3 with Swish-gated activation) — empirically better with ~same parameter count.

Standard (ReLU): FFN(x) = ReLU(x @ W1 + b1) @ W2 + b2 W1: [d_model × d_ff], W2: [d_ff × d_model] SwiGLU (LLaMA): FFN(x) = (SiLU(x@W1) ⊙ (x@W3)) @ W2 Params per layer: 2 × d_model × d_ff GPT-3 total FFN: 96 × 2 × 12288 × 49152 ≈ 117B
~66% of all params
W1: [d_model × d_ff]
W2: [d_ff × d_model]
SwiGLU variant
Key-value memory
Normalization Parameters (γ, β)
07 · Model Artifacts · Tiny but critical learned scalings

Each LayerNorm or RMSNorm block contains two learned vectors: γ (gamma) — scale, initialized to 1 — and β (beta) — shift, initialized to 0. Both have shape [d_model].

These are tiny relative to attention and FFN weights — less than 0.5% of total parameters. Despite their small size they are critical: they allow the model to rescale and shift normalized activations to whatever range is most useful for the next layer.

Per LayerNorm/RMSNorm block: γ : [d_model] ← scale (init: ones) β : [d_model] ← shift (init: zeros) output = γ ⊙ normalize(x) + β LLaMA-7B: 65 norm layers × 2 × 4096 = 533K params (0.008% of 7B total params)
γ: [d_model]
β: [d_model]
Per norm block
< 0.5% of params
Critical despite small size
Attention Patterns & KV Cache
07 · Model Artifacts · Runtime activations, not stored weights

Unlike weight matrices, attention patterns are not stored in the model file — they are intermediate activations recomputed fresh every forward pass. For each layer and head, the attention pattern is a [seq × seq] matrix where entry [i,j] is the weight token i gives to token j.

The KV cache IS stored during inference: once K and V are computed for a token, they are cached and reused in all future steps. Without KV caching, generating token t requires recomputing all t-1 previous tokens — O(t²) total. With caching, it is O(t).

Attention pattern (computed, not stored): A = softmax(QKᵀ/√d_k) shape: [seq × seq] KV cache per token per layer (IS stored): K_vector : [d_k] V_vector : [d_v] KV cache size (LLaMA-7B, 4096 ctx): 2 × 32 layers × 32 heads × 128 × 4096 × 2B = 2.1 GB per active request
KV cache IS stored
Attn patterns NOT stored
O(t) vs O(t²)
~2GB per request (7B)
Visualizable at runtime
Quantization
08 · Deployment · Reducing precision for memory and speed

Training uses BF16 or FP32 weights (16–32 bits per parameter). Quantization reduces precision to INT8 or INT4 at inference time, shrinking model size 2–4× and speeding up inference 1.5–3×.

GPTQ and AWQ are the dominant post-training quantization methods for LLMs. They use a small calibration dataset and compensate for weight rounding using second-order gradient information.

INT8 loses <1% quality. INT4 loses 1–5% but enables running 70B models on consumer hardware. Sub-4-bit (1–2 bit) is an active research area.

FP16 → INT8: scale = max(|W|) / 127 W_int8 = round(W / scale) ← stored W_approx = W_int8 × scale ← at runtime Memory: 70B parameter model FP16: 140 GB (2 bytes/param) INT8: 70 GB INT4: 35 GB ← fits 2× A100 40GB
INT4 (GPTQ/AWQ)
INT8
2–4× compression
Post-training
Calibration dataset
KV Cache Management
08 · Deployment · The dominant inference memory challenge

The KV cache grows linearly with context length during generation. At long contexts or high concurrency it becomes the dominant memory consumer — exceeding the model weights themselves.

PagedAttention (vLLM) manages KV cache like OS virtual memory: cache blocks are allocated on demand and shared across requests with the same prefix. This enables near-100% GPU memory utilization vs naive allocation's ~30%.

For a 7B model serving 100 concurrent users at 4096 context length: 2.1GB × 100 = 210GB of KV cache alone — requiring careful management across multiple GPUs.

KV cache per token (LLaMA-7B): 2 × 32 layers × 32 heads × 128 dims × 2B = 524 KB per token At 4096 context: 2.1 GB per request At 100 users: 210 GB KV cache total PagedAttention: manages this as virtual memory → 100% GPU utilization vs ~30% naive
PagedAttention
vLLM
O(t) generation
Memory bottleneck
Prefix sharing
Continuous Batching
08 · Deployment · Maximizing GPU utilization across requests

Naive batching waits for a full batch, processes together, returns all results. This wastes GPU time — the batch idles waiting for the longest sequence to finish while others have already completed.

Continuous (iteration-level) batching: at each generation step the batch is dynamically updated — finished sequences are removed and new requests are inserted immediately. The GPU stays maximally utilized throughout.

Combined with quantization and KV cache optimization, continuous batching enables a single A100 80GB server to serve hundreds of concurrent users with low latency. Implemented in vLLM, TGI (HuggingFace), TensorRT-LLM.

Continuous batching
vLLM
TGI
TensorRT-LLM
GPU utilization
Dynamic scheduling