Click any block to explore the full pipeline — from raw text to deployed model.
Follow the pipeline top to bottom. Each stage produces an artifact used by the next. Start with Training Data and finish at Deployment to see the complete picture.
Common Crawl is the largest source — a petabyte-scale monthly snapshot of the web containing billions of pages. GPT-3 used ~410B tokens from filtered Common Crawl alone, representing ~60% of its training mix.
Wikipedia is small in volume (~4B tokens) but extremely high quality: factual, well-structured, multilingual. It is typically upsampled 3–4× so it appears more often than its raw byte fraction would suggest.
Raw web data is noisy — spam, boilerplate, duplicate content, SEO manipulation, adult content. It cannot be used directly and requires the full processing pipeline (stage 02).
BookCorpus and similar datasets provide long-form prose — crucial for learning coherent multi-paragraph reasoning. Books contain richer vocabulary and sentence structure than web text.
arXiv papers contribute scientific reasoning, mathematical notation, and structured argumentation. LaTeX source is often included so the model learns to handle equations and formal notation.
Books are heavily upsampled relative to their raw byte count — the signal density per token is far higher than web text. GPT-3 used ~22B book tokens at roughly 2× sampling weight.
Code training data (GitHub, Stack Overflow, documentation) dramatically improves reasoning and instruction-following — not just coding ability. Code is highly structured, contains explicit logic chains, and has clear input→output relationships that transfer to general problem-solving.
Models trained with code data show improved performance on chain-of-thought reasoning, mathematics, and multi-step problems — suggesting that code's structural patterns generalize well beyond programming tasks.
Not all tokens are equal. A small set of high-quality, carefully selected data — textbooks, encyclopedias, legal documents, scientific papers — can be worth orders of magnitude more per token than raw web text.
Microsoft's Phi models demonstrated this strikingly: a 1.3B parameter model trained on "textbook-quality" synthetic data outperformed much larger models trained on raw web data. This shifted the field toward data quality over data quantity.
Modern pipelines classify and score each document, then heavily oversample top-rated content regardless of its raw byte fraction.
The web is full of duplicate content — the same article syndicated to 500 sites, templated pages, mirror repositories. Without deduplication the model memorizes repeated text verbatim and perplexity metrics become misleadingly low.
Exact deduplication hashes document content and removes identical copies. Fast but misses paraphrases.
Fuzzy deduplication uses MinHash + Locality Sensitive Hashing (LSH) to find near-duplicates. Documents are represented as sets of n-gram shingles; MinHash estimates Jaccard similarity in sublinear time; pairs above ~0.8 similarity have one copy removed.
Heuristic filters remove obvious low-quality documents: pages with very high symbol/word ratios (spam), too few unique words (boilerplate), below minimum length, or in unwanted languages.
Perplexity filtering trains a small reference model (e.g. KenLM on Wikipedia) and scores each document. Very high perplexity indicates gibberish or heavily non-natural-language content.
More sophisticated approaches use a classifier trained to distinguish Wikipedia-quality text from low-quality text, filtering on classifier confidence score.
PII removal targets email addresses, phone numbers, social security numbers, credit card numbers, IP addresses, and names in sensitive contexts. Regex patterns and NER models identify and either redact or remove affected documents.
Toxic content filtering uses classifiers trained on hate speech, graphic violence, CSAM, and other harmful categories. The threshold is carefully calibrated — aggressive filtering can remove legitimate historical, medical, or legal content.
This is best-effort at trillion-token scale. Some harmful content inevitably passes through, which is one reason alignment (stage 06) is critical.
The final corpus is a carefully weighted mixture. Each domain is assigned a sampling weight determining how often its tokens appear in training batches. High-quality sources are upsampled; raw web text is downsampled.
The Chinchilla paper showed most LLMs were undertrained relative to their size — more data with a smaller model often outperforms less data with a larger model, changing how these ratios are set.
BPE (Byte-Pair Encoding) starts with individual bytes/characters and iteratively merges the most frequent adjacent pair until the vocabulary reaches the target size (32k–100k tokens).
SentencePiece treats the raw Unicode byte stream directly, making it language-agnostic and robust to whitespace variations. Used by LLaMA, T5, Gemma.
The tokenizer is trained once, then permanently frozen. Every document is tokenized with this fixed vocabulary before any model training begins.
The vocabulary file maps every token string to an integer ID and back. This file must ship with the model — without it you cannot convert text to input IDs or decode output IDs back to text.
The vocabulary encodes implicit decisions about the corpus: what languages are well-represented (common tokens have short, efficient encodings) and what domains are present (Python keywords like def may be a single token).
Special tokens are reserved IDs with semantic meaning for the model's operation. They are inserted at training time to structure inputs and outputs.
[BOS] — beginning of sequence, prepended to every input. [EOS] — end of sequence, the model learns to predict this when done. [PAD] — fills short sequences in a batch. [MASK] — used in masked LM (BERT-style).
Chat models add structured turn tokens so the model distinguishes user from assistant: <|user|>, <|assistant|>, <|system|>.
The configuration file fully specifies the architecture before any weight is initialized. These choices are essentially permanent — changing them requires retraining from scratch.
d_model: width of the representation throughout the network. n_layers (L): how many attention+FFN blocks are stacked. n_heads (H): parallel attention heads; d_k = d_model / H. Context length: max tokens the model can see — attention is O(n²) so doubling context quadruples attention cost.
Multi-Head Attention (MHA): the original design. Every head has its own Q, K, V projections. H heads × 3 matrices × (d_model × d_k) parameters per layer.
Multi-Query Attention (MQA): all Q heads are separate but K and V are shared across heads. Reduces KV cache memory H× at modest quality cost. Used in PaLM, Falcon.
Grouped Query Attention (GQA): G groups share K/V, each group has its own Q heads. Best quality-vs-inference-cost tradeoff. Used by LLaMA-2 70B, Mistral, Gemma.
Absolute sinusoidal (original Transformer): fixed sine/cosine encoding added to embeddings. Hard limit at training context length.
RoPE (Rotary Position Embedding): applies a rotation to Q and K vectors based on their positions. The dot product Q·K naturally encodes relative distance. Can be extended beyond training length via RoPE scaling. Used by LLaMA, GPT-NeoX, Mistral.
ALiBi (Attention with Linear Biases): adds a linear penalty to attention scores based on distance between positions. No learnable parameters; extrapolates well to longer sequences. Used by MPT, BLOOM.
Post-norm (original Transformer): LayerNorm after residual — LayerNorm(x + sublayer(x)). Requires careful LR warmup; unstable with very deep networks.
Pre-norm (GPT-2 onward): LayerNorm before the sublayer — x + sublayer(LayerNorm(x)). More stable, allows removing warmup. Standard in modern LLMs.
RMSNorm: simplification that only normalizes by root mean square, dropping mean subtraction. ~10% faster, empirically equivalent quality. Used by LLaMA, Mistral, Gemma.
The sole objective is next-token prediction: given a sequence, predict the next token. This is called causal language modeling (CLM) because each position can only attend to past positions.
The loss is the average cross-entropy across all positions. Each token prediction is a V-way classification problem (one class per vocabulary token). This deceptively simple objective forces the model to learn grammar, facts, reasoning, code, and world knowledge — because all of these are necessary to predict text well.
AdamW (Adam with decoupled weight decay) is universal for LLM training. It maintains per-parameter running estimates of gradient mean (m) and variance (v), giving each parameter an adaptive learning rate.
The learning rate schedule has three phases: linear warmup from 0 (first ~2000 steps), cosine decay to ~10% of peak LR (majority of training), then a constant tail. Gradient clipping (max norm = 1.0) prevents catastrophic gradient explosions.
Data Parallelism: the same model is replicated on many GPUs, each processing different batches. Gradients are averaged via AllReduce. FSDP shards parameters across GPUs to address memory limits.
Tensor Parallelism: individual weight matrices are split across GPUs — the attention head dimension is a natural split point. Requires frequent inter-GPU communication within each forward pass.
Pipeline Parallelism: different layers run on different GPUs with micro-batches flowing through. Reduces communication but introduces pipeline "bubble" idle time.
Checkpoints save the complete training state — model weights, optimizer states (m and v for every parameter), learning rate schedule position, and random seeds — at regular intervals (every 1000–2000 steps).
Optimizer states can be as large as the weights themselves. A 70B model at FP32 has ~280GB weights plus ~560GB optimizer state. Checkpoints are sharded across many files.
Teams maintain "golden checkpoints" from stable points to restart from if training crashes or a loss spike occurs.
The pre-trained base model predicts text — it will complete any input, including harmful ones. SFT teaches it to be a helpful assistant by training on curated (instruction, response) pairs.
Human annotators write high-quality responses to diverse instructions. The model is fine-tuned with the same cross-entropy objective as pre-training, but loss is computed only on the response tokens — the instruction tokens are masked out.
SFT data is tiny compared to pre-training (10k–1M examples vs trillions of tokens) but extremely high impact — a few days of SFT transforms the base model.
A separate model — the Reward Model (RM) — is trained to predict which of two responses a human would prefer. Humans rank response pairs, producing (prompt, chosen, rejected) triples.
The RM is initialized from the SFT model with the final layer replaced by a scalar output (the reward). It's trained with a preference loss that pushes reward(chosen) above reward(rejected).
The RM is a proxy for human judgment — scoring any response in milliseconds. Its quality is the bottleneck for RLHF; a flawed RM leads to reward hacking where the model finds degenerate high-reward responses.
RLHF with PPO: the SFT model is the policy π. It generates responses; the RM scores them; PPO updates the policy to maximize reward. A KL-divergence penalty prevents the policy from drifting too far from the SFT model, stopping degenerate reward hacking.
DPO (Direct Preference Optimization): skips the separate RM. It reformulates the RLHF objective as supervised classification directly on (chosen, rejected) pairs — mathematically equivalent to implicit reward maximization but simpler and more stable. Used by LLaMA-3, Zephyr, many open models.
Anthropic's approach uses a constitution — a set of principles ("be helpful", "avoid harm", "be honest") — to generate preference data synthetically rather than requiring human labelers for every example.
RLAIF (RL from AI Feedback): a pre-trained model critiques and revises its own outputs according to constitutional principles, generating (original, revised) pairs used to train a preference model.
This scales the alignment process — rather than needing humans to rank every pair, the AI generates thousands of self-critique pairs guided by the constitution. Particularly effective for safety-relevant behaviors.
The embedding matrix maps every token ID to a dense vector. Shape: [vocab_size × d_model]. For LLaMA-7B: [32,000 × 4,096] = 131M parameters.
The same matrix is used three times via weight tying: as the input lookup, the output embedding for the decoder target, and transposed as the pre-softmax linear layer. This reduces parameters and improves training stability.
After training, these vectors encode rich semantic relationships — similar tokens cluster nearby in the 4096-dimensional space.
For each of the L transformer layers and each of the H attention heads, there are separate W_Q, W_K, W_V projection matrices that project the token representation into the query, key, and value spaces.
Each W_Q, W_K, W_V has shape [d_model × d_k] where d_k = d_model/H. W_O has shape [H·d_v × d_model] projecting concatenated head outputs back to d_model.
In practice these are stored as a single fused matrix W_QKV of shape [d_model × 3·d_model] enabling a single GEMM operation — a critical inference optimization.
These are the most compute-intensive weights during inference — x@W_Q, x@W_K, x@W_V dominate FLOPs at short context lengths.
Each transformer layer contains a two-layer MLP applied position-wise. Weights are W1: [d_model × d_ff] and W2: [d_ff × d_model] with the inner dimension d_ff = 4 × d_model.
Think of attention as the "communication" step (tokens exchange information) and the FFN as the "computation" step (each token reasons independently). Research suggests FFN layers act as key-value memory stores that recall factual associations.
Modern models replace ReLU with SwiGLU (three matrices W1, W2, W3 with Swish-gated activation) — empirically better with ~same parameter count.
Each LayerNorm or RMSNorm block contains two learned vectors: γ (gamma) — scale, initialized to 1 — and β (beta) — shift, initialized to 0. Both have shape [d_model].
These are tiny relative to attention and FFN weights — less than 0.5% of total parameters. Despite their small size they are critical: they allow the model to rescale and shift normalized activations to whatever range is most useful for the next layer.
Unlike weight matrices, attention patterns are not stored in the model file — they are intermediate activations recomputed fresh every forward pass. For each layer and head, the attention pattern is a [seq × seq] matrix where entry [i,j] is the weight token i gives to token j.
The KV cache IS stored during inference: once K and V are computed for a token, they are cached and reused in all future steps. Without KV caching, generating token t requires recomputing all t-1 previous tokens — O(t²) total. With caching, it is O(t).
Training uses BF16 or FP32 weights (16–32 bits per parameter). Quantization reduces precision to INT8 or INT4 at inference time, shrinking model size 2–4× and speeding up inference 1.5–3×.
GPTQ and AWQ are the dominant post-training quantization methods for LLMs. They use a small calibration dataset and compensate for weight rounding using second-order gradient information.
INT8 loses <1% quality. INT4 loses 1–5% but enables running 70B models on consumer hardware. Sub-4-bit (1–2 bit) is an active research area.
The KV cache grows linearly with context length during generation. At long contexts or high concurrency it becomes the dominant memory consumer — exceeding the model weights themselves.
PagedAttention (vLLM) manages KV cache like OS virtual memory: cache blocks are allocated on demand and shared across requests with the same prefix. This enables near-100% GPU memory utilization vs naive allocation's ~30%.
For a 7B model serving 100 concurrent users at 4096 context length: 2.1GB × 100 = 210GB of KV cache alone — requiring careful management across multiple GPUs.
Naive batching waits for a full batch, processes together, returns all results. This wastes GPU time — the batch idles waiting for the longest sequence to finish while others have already completed.
Continuous (iteration-level) batching: at each generation step the batch is dynamically updated — finished sequences are removed and new requests are inserted immediately. The GPU stays maximally utilized throughout.
Combined with quantization and KV cache optimization, continuous batching enables a single A100 80GB server to serve hundreds of concurrent users with low latency. Implemented in vLLM, TGI (HuggingFace), TensorRT-LLM.