PAPER 15 · RL reasoning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo et al. 2025 Paper

The R1 paper: large-scale reinforcement learning can induce self-verification and structured reasoning behavior.

Core concept

DeepSeek-R1 showed that reinforcement learning can train models to spend more effort on reasoning and self-checking.

Why it mattered

It made open reasoning models and reasoning distillation feel much more practical.

Visual shortcut · Train the habit of thinking

R1 treats reasoning as a trainable behavior, especially when the task has answers that can be checked.

How it works

Start with tasks where correctness is measurable.

Reward better outcomes.

Let reasoning and self-checking emerge.

Distill the behavior into smaller deployable models.

The quick digest

DeepSeek-R1 is about teaching the model a behavior: slow down, explore a problem, check itself, and produce a better answer on tasks where correctness can be rewarded. Instead of only imitating human-written solutions, the model learns from reward signals tied to outcomes.

The striking part is that reasoning patterns can emerge from reinforcement learning, then be distilled into smaller models. That means expensive “thinking” behavior can become training material for cheaper models.

For builders, the paper asks a concrete question: which tasks have answers you can verify? Math, code, structured extraction, tests, and puzzle-like workflows are much easier to reward than vague judgment tasks.

What to remember

One-liner

Reasoning can be trained with rewards.

Why it matters

Checkable tasks are gold for learning.

Builder instinct

Distillation can compress expensive thinking.

Read it like this

First pass: Separate the RL story from the distillation story.
Second pass: Look for what reward signals are available.
Then build taste: Ask which of your workflows have checkable answers.

Build instinct

Take a small set of verifiable tasks and compare normal prompting with a reasoning model plus answer checking.

Read source → All papers