← Back to Forum

Autoregression & Next-Token Prediction

Imran Kutianawala
July 17, 2025
Introduction

Every time a language model generates text, it's doing something surprisingly simple: predicting one token at a time, with each choice shaped by everything that came before. This post breaks down how that works — the math behind next-token prediction, what goes wrong when models run too long, and what happens when you lean into the randomness on purpose. The experiment: write an AP Language & Composition essay one word at a time using the Gemini API, forcing the model to treat each generation as an isolated prediction with no accumulated intent. The result is about as coherent as you'd expect.