Autoregression & Next-Token Prediction
Introduction
Every time a language model generates text, it's doing something surprisingly simple: predicting one token at a time, with each choice shaped by everything that came before. This post breaks down how that works — the math behind next-token prediction, what goes wrong when models run too long, and what happens when you lean into the randomness on purpose. The experiment: write an AP Language & Composition essay one word at a time using the Gemini API, forcing the model to treat each generation as an isolated prediction with no accumulated intent. The result is about as coherent as you'd expect.