Chain-of-Thought Prompting: When It Helps and When It Hurts

What chain-of-thought prompting is

Chain-of-thought prompting, often shortened to CoT, is the technique of asking a language model to write out its reasoning before giving a final answer. The simplest version is the instruction "Think step by step." Slightly more structured versions ask the model to label its steps, list assumptions, or check its own conclusion at the end.

The reason this works is not magical. When the model writes intermediate steps, those steps become part of the context it uses to produce the final answer. A question that would have been answered in one token now has a few hundred tokens of relevant scaffolding in front of it. For tasks where the right answer depends on multi-step logic, that scaffolding is what makes the difference between a guess and a correct answer.

A quick example

Without CoT:

> Q: A pen costs three times as much as a pencil. Together they cost £4.40. How much is the pencil? > A: £1.10

With CoT:

> Q: A pen costs three times as much as a pencil. Together they cost £4.40. How much is the pencil? > Let me work this through. > Let pencil = x. Then pen = 3x. > x + 3x = 4.40 > 4x = 4.40 > x = 1.10 > A: £1.10

In this case both answers happen to be the same, but on harder problems the second style produces correct answers far more often. The bigger the chain of reasoning the question requires, the larger the gap.

When CoT actually helps

CoT is worth using for:

Multi-step arithmetic and word problems. Anything where you would reach for a piece of paper if you were solving it yourself.
Logical puzzles. Constraint satisfaction, deduction from rules, sorting by criteria.
Decision tasks with several factors. "Given these five trade-offs, which option should we pick?"
Code that requires planning. Generating a function where the structure matters more than the syntax.

The pattern: if a human would benefit from scratch work, the model usually does too.

When CoT does not help — or actively hurts

CoT is unhelpful or counterproductive for:

Simple factual lookup. "What is the capital of Norway?" The model knows it or does not. Reasoning out loud adds latency and sometimes lets the model talk itself into a wrong answer.
Style and tone tasks. Asking the model to rewrite a sentence in a friendlier voice does not need step-by-step thinking. It needs an example.
Tasks with a single deterministic answer. Format conversions, translations, JSON extraction. The reasoning trace adds noise to the output, which then has to be parsed out.
Latency-sensitive applications. CoT can multiply token output by 5–10x. For real-time use cases that matters.

A useful heuristic: if you are tempted to ask the model to "be careful," CoT helps. If you are asking the model to "be brief," CoT works against you.

The reasoning-model wrinkle

Recent reasoning models (OpenAI's o-series, Anthropic's extended thinking modes, Google's Gemini Thinking variants) have CoT built in. They produce a hidden or visible reasoning trace before answering, regardless of your prompt.

This changes two things:

Adding "think step by step" is mostly redundant. The model is already doing it. The instruction does no harm but also no measurable good.
Asking these models to skip reasoning is unreliable. "Just give me the answer, do not reason" is often ignored. If you need a fast non-reasoning response, you should use a non-reasoning model in the first place.

The practical advice for these models: stop writing CoT prompts and start writing problem statements. The reasoning will happen automatically.

How to write a good CoT prompt for non-reasoning models

If you are still working with a standard chat model, a few patterns produce better results than the bare "think step by step":

Show the reasoning structure you want. Instead of leaving it open, give a template:

1. Identify the variables.
2. Write the equation.
3. Solve algebraically.
4. State the final answer on a line beginning with "Answer:".

This both improves accuracy and makes the final answer easy to extract programmatically.

Ask the model to check its work. After the reasoning, add: "Now check your answer by substituting back into the original problem. If it does not match, redo step 3."

Separate the reasoning from the answer. Use a delimiter like --- ANSWER --- so your application can split the response and only show the final part to the user. The reasoning is useful but rarely what the end user wants to read.

Self-consistency: a small upgrade for hard problems

For especially tricky problems, you can run the same CoT prompt several times with a non-zero temperature and take the majority answer across runs. This is called self-consistency. It costs more (typically 5–10 runs), but on logic and maths problems it can recover correctness on questions the model would otherwise get wrong half the time.

This is overkill for everyday work and worth it for high-stakes decisions where the cost of being wrong dwarfs the cost of extra tokens.

When in doubt

The default rule of thumb: try CoT when the task involves reasoning, skip it when the task involves style or lookup. If the model is wrong on a reasoning task and you are not using CoT, adding it is almost always the first thing to try. If the model is wrong on a style task and you are using CoT, removing it and adding a concrete example is almost always the first thing to try.

Like most prompt techniques, it is a tool, not a rule. Knowing which jobs it fits is the whole skill.

What chain-of-thought prompting is

A quick example

Without CoT:

> Q: A pen costs three times as much as a pencil. Together they cost £4.40. How much is the pencil? > A: £1.10

With CoT:

When CoT actually helps

CoT is worth using for:

Multi-step arithmetic and word problems. Anything where you would reach for a piece of paper if you were solving it yourself.
Logical puzzles. Constraint satisfaction, deduction from rules, sorting by criteria.
Decision tasks with several factors. "Given these five trade-offs, which option should we pick?"
Code that requires planning. Generating a function where the structure matters more than the syntax.

The pattern: if a human would benefit from scratch work, the model usually does too.

When CoT does not help — or actively hurts

CoT is unhelpful or counterproductive for:

Simple factual lookup. "What is the capital of Norway?" The model knows it or does not. Reasoning out loud adds latency and sometimes lets the model talk itself into a wrong answer.
Style and tone tasks. Asking the model to rewrite a sentence in a friendlier voice does not need step-by-step thinking. It needs an example.
Tasks with a single deterministic answer. Format conversions, translations, JSON extraction. The reasoning trace adds noise to the output, which then has to be parsed out.
Latency-sensitive applications. CoT can multiply token output by 5–10x. For real-time use cases that matters.

A useful heuristic: if you are tempted to ask the model to "be careful," CoT helps. If you are asking the model to "be brief," CoT works against you.

The reasoning-model wrinkle

This changes two things:

Adding "think step by step" is mostly redundant. The model is already doing it. The instruction does no harm but also no measurable good.
Asking these models to skip reasoning is unreliable. "Just give me the answer, do not reason" is often ignored. If you need a fast non-reasoning response, you should use a non-reasoning model in the first place.

The practical advice for these models: stop writing CoT prompts and start writing problem statements. The reasoning will happen automatically.

How to write a good CoT prompt for non-reasoning models

If you are still working with a standard chat model, a few patterns produce better results than the bare "think step by step":

Show the reasoning structure you want. Instead of leaving it open, give a template:

1. Identify the variables.
2. Write the equation.
3. Solve algebraically.
4. State the final answer on a line beginning with "Answer:".

This both improves accuracy and makes the final answer easy to extract programmatically.

Ask the model to check its work. After the reasoning, add: "Now check your answer by substituting back into the original problem. If it does not match, redo step 3."

Self-consistency: a small upgrade for hard problems

This is overkill for everyday work and worth it for high-stakes decisions where the cost of being wrong dwarfs the cost of extra tokens.

When in doubt

Like most prompt techniques, it is a tool, not a rule. Knowing which jobs it fits is the whole skill.

Chain-of-Thought Prompting: When It Helps and When It Hurts

What chain-of-thought prompting is

A quick example

When CoT actually helps

When CoT does not help — or actively hurts

The reasoning-model wrinkle

How to write a good CoT prompt for non-reasoning models

Self-consistency: a small upgrade for hard problems

When in doubt

Related reading

Chain-of-Thought Prompting: When It Helps and When It Hurts

What chain-of-thought prompting is

A quick example

When CoT actually helps

When CoT does not help — or actively hurts

The reasoning-model wrinkle

How to write a good CoT prompt for non-reasoning models

Self-consistency: a small upgrade for hard problems

When in doubt

Related reading