The surprise that frustrates everyone first
You write a careful prompt, run it, and get an excellent answer. You run it again ten minutes later and the answer is different. The structure is different, the wording is different, sometimes even the conclusion. Nothing you did changed.
This is the single most common surprise people hit when they start working with language models seriously. It is also one of the easiest to get under control once you understand the moving parts.
Why variation exists at all
Language models generate text one token at a time. At each step, the model assigns a probability to every possible next token, and then a sampler picks one. If the sampler always picked the single most likely token, output would be nearly deterministic. It does not, because that produces dull, repetitive text. So there is a controlled amount of randomness built into the system.
That randomness compounds. A different choice at token 12 changes the probabilities at token 13, which changes the probabilities at token 14, and so on. By the end of a paragraph, two runs can look completely different even though the prompt was identical.
This is not a bug. It is how the model is designed. The question is how much variation you want, and the answer depends on the task.
The two main knobs
Two settings control variation in almost every model:
Temperature. Roughly, how much the sampler favours less-likely tokens. At temperature 0, the model picks the most likely token every time — runs are very close to deterministic. At temperature 1.0, the sampling is more adventurous. At higher values, output becomes unstable and often nonsensical.
Top-p (or top-k). A cap on how many candidate tokens the sampler considers at each step. Top-p 0.1 means "only consider tokens that make up the top 10% of probability mass." Lower values constrain variety; higher values open it up.
For most production work, the rule of thumb is:
- Extraction, classification, structured output: temperature 0 to 0.2. You want stability.
- Writing, summarising, explanation: 0.3 to 0.7. Enough variation to avoid robotic output, not enough to spin off course.
- Brainstorming, creative writing: 0.8 to 1.0. You want the surprise.
Almost no real task benefits from temperature above 1.0. If you see it in tutorials, treat it as a curiosity.
Why temperature 0 is not fully deterministic
A common surprise: even at temperature 0, repeated runs sometimes differ. This happens for several reasons. Most serving systems batch requests together, which introduces small numerical differences. Some models have ties at the top of the probability distribution that break differently each time. Tool use, retrieval, and timestamped context can change the input itself between runs.
For most purposes "almost deterministic" is what you have. If you need bit-exact reproducibility, you need a fixed seed and a model that supports it, and even then implementations vary.
Things you can do beyond temperature
Settings help, but the bigger gains usually come from the prompt itself.
Constrain the output shape. A prompt that asks for "a summary" allows huge variation in length, structure, and tone. A prompt that asks for "exactly three sentences, each under 25 words, no adjectives" allows almost none. Specifying the shape collapses the space of plausible outputs.
Use schemas. Asking the model to fill in a JSON schema or a fixed markdown template makes the structural part of the output invariant. Only the content varies, and the content variation is usually the part you actually want.
Give examples. Two or three input-output pairs anchor the model to a specific style. Without examples, the model's idea of "professional tone" can mean anything from a press release to a hospital discharge note. With examples, it converges quickly.
Shorten the output. Long generations have more room to drift. If a single paragraph would do, do not ask for a page.
Break tasks into steps. A single prompt doing extraction, reasoning, and writing all at once has three sources of variation stacked. Splitting it into three prompts produces more consistent results per step, at the cost of more calls.
When you actually want variation
It is worth saying: low variation is not always the goal. For brainstorming, drafting, or any task where the value comes from seeing alternatives, you want temperature up and outputs different across runs. The mistake is using high temperature on tasks where you wanted stability and then being surprised by inconsistency.
A useful question to ask yourself: if I ran this prompt 50 times, what kind of distribution of outputs would I want? A tight cluster around one answer? A spread of plausible alternatives? The answer tells you whether to crank temperature down or up.
Measuring stability properly
If consistency matters in your application, do not eyeball it. Run the same prompt 20 times on the same input. Compare the outputs. Decide whether the variation is acceptable.
What "acceptable" means depends on the use case. For a customer-facing classifier you probably want unanimous agreement. For a writing assistant you want stylistic consistency but varied wording. For a brainstorming tool you want a useful range.
If the variation is unacceptable, the order of fixes to try is: tighten the prompt, lower temperature, add a schema, switch to a model with better instruction following. Tightening the prompt almost always helps more than the other three.
The takeaway
Variation between AI outputs is real, expected, and largely controllable. Most of it disappears when you treat the prompt as a contract — specifying the shape, length, and style precisely — and use temperature deliberately rather than leaving it at whatever the default is. The teams that get consistent results are not using a different model. They are writing prompts that leave less room for the model to be creative in the wrong places.