NEW SECTION · LLM PAPER FIELD GUIDE

Read the papers without getting lost in the papers.

A practical reading path through Transformer foundations, scaling laws, instruction tuning, reasoning, agents, MoE, and the bonus roots that explain why modern LLMs look the way they do.

How to use this

Each paper gets one page: what it says, why it matters, what to implement, and where it fits in the stack. The reading list was put together by @TheAhmadOsman; this guide is my attempt to make it easier to actually work through.

core papers

bonus papers

big ideas: attention, scaling, tuning, reasoning, agents, MoE

The core reading order

Attention Is All You Need

The original Transformer paper. It replaced recurrence with self-attention and made the modern LLM stack possible.

Transformer core

The Illustrated Transformer

The best visual walkthrough of attention and tensor flow before you dive into code.

Intuition builder

BERT: Pre-training of Deep Bidirectional Transformers

The encoder-side foundation: masked language modeling, bidirectional representation learning, and pretrain-then-finetune NLP.

Encoder fundamentals

Language Models are Few-Shot Learners

The GPT-3 paper that made in-context learning impossible to ignore.

In-context learning

Scaling Laws for Neural Language Models

The first clean empirical scaling framework for parameters, data, and compute.

Scaling laws

Training Compute-Optimal Large Language Models

The Chinchilla paper: for a fixed compute budget, train smaller models on more tokens.

Compute-optimal training

LLaMA: Open and Efficient Foundation Language Models

The paper that kicked open the open-weight era and standardized many practical architecture defaults.

Open-weight era

RoFormer: Rotary Position Embedding

The positional encoding method that became the modern default for long-context LLMs.

Position encoding

FlashAttention

Memory-efficient exact attention that made longer contexts and higher throughput more practical.

GPU efficiency

Retrieval-Augmented Generation

The foundational RAG paper: combine parametric model knowledge with external retrieved documents.

Grounding

Training Language Models to Follow Instructions with Human Feedback

The modern post-training blueprint: supervised instruction tuning plus preference optimization/RLHF.

Instruction tuning

Direct Preference Optimization

A simpler, stable alternative to PPO-style RLHF that optimizes preferences directly through the loss.

Preference alignment

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Showed that reasoning behavior can be elicited through examples and intermediate reasoning steps.

Reasoning prompts

ReAct: Synergizing Reasoning and Acting

The foundation of practical agent loops: reasoning traces combined with tool use and environment interaction.

Agents

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

The R1 paper: large-scale reinforcement learning can induce self-verification and structured reasoning behavior.

RL reasoning

Qwen3 Technical Report

A modern architecture overview with hybrid thinking/non-thinking behavior and MoE variants.

Modern architecture

Outrageously Large Neural Networks: Sparsely-Gated Mixture of Experts

The modern MoE ignition point: conditional computation at scale.

MoE origin

Switch Transformers

Simplified MoE routing using single-expert activation, helping stabilize giant sparse models.

MoE routing

Mixtral of Experts

Open-weight MoE that proved sparse models can deliver dense-model quality at smaller active inference cost.

Open MoE

Sparse Upcycling

A practical technique for converting dense checkpoints into MoE models and reusing compute.

MoE from dense

The Platonic Representation Hypothesis

Evidence that scaled models across modalities may converge toward shared internal representations.

Representations

Textbooks Are All You Need

High-quality synthetic “textbook” data can let small models outperform much larger models on targeted domains.

Synthetic data

Scaling Monosemanticity

Anthropic’s major interpretability report decomposing Claude 3 Sonnet activations into millions of interpretable features.

Interpretability

PaLM: Scaling Language Modeling with Pathways

A masterclass in large-scale language-model training across thousands of accelerators.

Large-scale training

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Validated MoE scaling economics with huge total parameters but smaller active parameter counts.

MoE economics

The Smol Training Playbook

A practical handbook for efficiently training smaller language models.

Training practice

Bonus material

These are not optional if you want the deep roots: text-to-text training, tool use, distributed sparse models, and the original mixture-of-experts idea.

T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Reframed many NLP tasks as text-to-text generation under one unified training format.

Bonus · text-to-text

Toolformer

A model learns to call tools through self-supervised API-use examples.

Bonus · tool use

GShard

Scaled giant multilingual models with conditional computation and automatic sharding.

Bonus · distributed MoE

Adaptive Mixtures of Local Experts

The early neural-network root of the mixture-of-experts idea.

Bonus · MoE roots

Hierarchical Mixtures of Experts

Extended mixture-of-experts into hierarchical gating structures.

Bonus · MoE roots

Start with Attention → Back to NVIDIA course