NEW SECTION · LLM PAPER FIELD GUIDE
Read the papers without getting lost in the papers.
A practical reading path through Transformer foundations, scaling laws, instruction tuning, reasoning, agents, MoE, and the bonus roots that explain why modern LLMs look the way they do.
How to use this
Each paper gets one page: what it says, why it matters, what to implement, and where it fits in the stack. The reading list was put together by @TheAhmadOsman; this guide is my attempt to make it easier to actually work through.
26
core papers
5
bonus papers
6
big ideas: attention, scaling, tuning, reasoning, agents, MoE
The core reading order
01
Attention Is All You Need
The original Transformer paper. It replaced recurrence with self-attention and made the modern LLM stack possible.
Transformer core
02
The Illustrated Transformer
The best visual walkthrough of attention and tensor flow before you dive into code.
Intuition builder
03
BERT: Pre-training of Deep Bidirectional Transformers
The encoder-side foundation: masked language modeling, bidirectional representation learning, and pretrain-then-finetune NLP.
Encoder fundamentals
04
Language Models are Few-Shot Learners
The GPT-3 paper that made in-context learning impossible to ignore.
In-context learning
05
Scaling Laws for Neural Language Models
The first clean empirical scaling framework for parameters, data, and compute.
Scaling laws
06
Training Compute-Optimal Large Language Models
The Chinchilla paper: for a fixed compute budget, train smaller models on more tokens.
Compute-optimal training
07
LLaMA: Open and Efficient Foundation Language Models
The paper that kicked open the open-weight era and standardized many practical architecture defaults.
Open-weight era
08
RoFormer: Rotary Position Embedding
The positional encoding method that became the modern default for long-context LLMs.
Position encoding
09
FlashAttention
Memory-efficient exact attention that made longer contexts and higher throughput more practical.
GPU efficiency
10
Retrieval-Augmented Generation
The foundational RAG paper: combine parametric model knowledge with external retrieved documents.
Grounding
11
Training Language Models to Follow Instructions with Human Feedback
The modern post-training blueprint: supervised instruction tuning plus preference optimization/RLHF.
Instruction tuning
12
Direct Preference Optimization
A simpler, stable alternative to PPO-style RLHF that optimizes preferences directly through the loss.
Preference alignment
13
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Showed that reasoning behavior can be elicited through examples and intermediate reasoning steps.
Reasoning prompts
14
ReAct: Synergizing Reasoning and Acting
The foundation of practical agent loops: reasoning traces combined with tool use and environment interaction.
Agents
15
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
The R1 paper: large-scale reinforcement learning can induce self-verification and structured reasoning behavior.
RL reasoning
16
Qwen3 Technical Report
A modern architecture overview with hybrid thinking/non-thinking behavior and MoE variants.
Modern architecture
17
Outrageously Large Neural Networks: Sparsely-Gated Mixture of Experts
The modern MoE ignition point: conditional computation at scale.
MoE origin
18
Switch Transformers
Simplified MoE routing using single-expert activation, helping stabilize giant sparse models.
MoE routing
19
Mixtral of Experts
Open-weight MoE that proved sparse models can deliver dense-model quality at smaller active inference cost.
Open MoE
20
Sparse Upcycling
A practical technique for converting dense checkpoints into MoE models and reusing compute.
MoE from dense
21
The Platonic Representation Hypothesis
Evidence that scaled models across modalities may converge toward shared internal representations.
Representations
22
Textbooks Are All You Need
High-quality synthetic “textbook” data can let small models outperform much larger models on targeted domains.
Synthetic data
23
Scaling Monosemanticity
Anthropic’s major interpretability report decomposing Claude 3 Sonnet activations into millions of interpretable features.
Interpretability
24
PaLM: Scaling Language Modeling with Pathways
A masterclass in large-scale language-model training across thousands of accelerators.
Large-scale training
25
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Validated MoE scaling economics with huge total parameters but smaller active parameter counts.
MoE economics
26
The Smol Training Playbook
A practical handbook for efficiently training smaller language models.
Training practice
Bonus material
These are not optional if you want the deep roots: text-to-text training, tool use, distributed sparse models, and the original mixture-of-experts idea.
27
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Reframed many NLP tasks as text-to-text generation under one unified training format.
Bonus · text-to-text
28
Toolformer
A model learns to call tools through self-supervised API-use examples.
Bonus · tool use
29
GShard
Scaled giant multilingual models with conditional computation and automatic sharding.
Bonus · distributed MoE
30
Adaptive Mixtures of Local Experts
The early neural-network root of the mixture-of-experts idea.
Bonus · MoE roots
31
Hierarchical Mixtures of Experts
Extended mixture-of-experts into hierarchical gating structures.
Bonus · MoE roots