Why 'Think Step by Step' Can Now Hurt Reasoning Models (And What to Do Instead)

Reasoning models already reason internally, so 'think step by step' can now hurt accuracy. How to spot them and what to prompt instead, with a copy-paste template.

Insight

2026-07-03

Most people add "think step by step" to almost every prompt. It became the one prompting trick everyone learned. But on the newest reasoning models, that instruction can now make your outputs worse, not better, and almost nobody has updated their habit.

The advice was correct for the models of 2023. It is quietly wrong for the reasoning models most of us use in 2026. Here is what changed, how to tell which mode you are in, and the small adjustment that gets you cleaner answers.

What is chain-of-thought prompting?

Chain-of-thought (CoT) prompting means instructing a model to reason through a problem in explicit intermediate steps before giving a final answer, usually by adding "think step by step". It was a genuine breakthrough: on older models it improved complex reasoning accuracy substantially, with published gains like a 19-point jump on the MMLU-Pro benchmark.

The technique works by making the model surface its intermediate steps instead of jumping straight to a guess.

Those visible steps did two things. They gave the model room to work through logic it would otherwise skip, and they let you catch and correct a wrong step mid-way.

That is why "think step by step" spread everywhere. For a 2023-era model with no built-in reasoning, it was close to free accuracy.

Why can "think step by step" now hurt reasoning models?

Because reasoning models already do chain-of-thought internally. Models like OpenAI's o-series, Claude with Extended Thinking, and Gemini Thinking Mode are trained to reason before answering by default. Adding an explicit CoT instruction on top can cause over-explanation and prompt overfitting, which a 2026 Wharton Generative AI Labs report links to reduced accuracy on some tasks.

You are asking the model to do out loud what it is already doing better on its own.

The Wharton technical report, "The Decreasing Value of Chain of Thought in Prompting", found many current models perform CoT-style reasoning even without being told to.

When you force the reasoning into the visible output anyway, two failure modes appear. The model over-explains a task that needed compact internal computation, and it overfits to the explanation format rather than the actual problem structure.

This is why many production systems in 2026 now suppress visible chain-of-thought while still letting the model reason internally. The reasoning stays; the forced narration goes.

Which models are reasoning models and which are not?

A reasoning model reasons internally before answering by default; a standard model answers directly unless you prompt it to slow down. This distinction decides whether "think step by step" helps or hurts. Get it wrong and you either waste a reasoning model's strength or leave a standard model guessing.

You almost always know which one you are using from the model name or mode.

Reasoning models (skip explicit CoT):

--- OpenAI o-series reasoning models.

--- Claude with Extended Thinking turned on.

--- Gemini Thinking Mode.

Standard models (CoT still helps):

--- Fast, non-thinking chat models used for quick replies.

--- Default modes where no "thinking" toggle or reasoning label is shown.

The rule of thumb: if the interface shows a thinking, reasoning, or extended-thinking indicator, the model is already doing CoT, so do not add it yourself.

What should you do instead of "think step by step"?

Give the model structure and constraints, not a command to narrate. Instead of telling a reasoning model how to think, tell it what a good answer contains: the criteria, the format, the edge cases to check, and the level of detail you want in the final output. Direct the destination, not the route.

This shifts your effort from performative reasoning to useful specification.

For reasoning models, replace "think step by step" with requirements. Name the decision criteria, ask for assumptions to be stated, and specify the output shape you need.

For standard non-reasoning models, keep CoT, but pair it with few-shot examples, which remain one of the highest-ROI techniques in 2026 for formatting and consistency.

Here is a copy-paste prompt built for a reasoning model. Notice there is no "think step by step", only structure:

Try this prompt:

You are evaluating whether our team should switch our project management tool. Do not narrate your reasoning. Instead, produce: (1) the three decision criteria that matter most for a 12-person marketing team; (2) how each option scores against those criteria, with one concrete reason per score; (3) any assumptions you are making, listed separately; (4) a single recommendation in one sentence. Flag anything you are uncertain about rather than guessing.

This gives the reasoning model a target to hit and gives you a checkable, well-shaped answer without forcing it to think out loud.

When does chain-of-thought still help?

It still helps on standard non-reasoning models and on genuinely multi-step tasks where you want to inspect the logic. If you are using a fast chat model without a thinking mode, or you need to audit each step of a calculation or a legal-style argument, explicit CoT remains valuable. The key is to match the technique to the model.

CoT is a tool, not a default to staple onto every prompt.

Keep CoT when the model has no built-in reasoning and the task has real intermediate steps, such as multi-part maths, structured planning, or debugging a chain of decisions.

Also keep it when transparency is the point. If you must show your working to a client or a compliance reviewer, visible reasoning is a feature, even if a reasoning model could reach the answer silently.

Drop it when you are on a reasoning model and you only care about a clean, correct final answer. In that case, structure and constraints beat narration every time.

How do you test whether CoT is helping your prompt?

Run the same task twice, once with the CoT instruction and once with a structured prompt, then compare the final answers. This 5-minute A/B test tells you far more than any general rule, because it reflects your exact model, task, and quality bar. Trust the comparison over the habit.

Testing turns a debate into evidence.

A simple test you can run today:

--- Pick one real task you do often, such as summarising a report or drafting a decision memo.

--- Version A: add "think step by step" and run it three times.

--- Version B: remove CoT, add clear criteria and output format, and run it three times.

--- Compare the final answers only, not the reasoning, and keep the version that is consistently cleaner.

Whichever wins for your model and your task is the correct answer for you, regardless of what any guide, including this one, claims.

The takeaway

"Think step by step" is not dead, but it is no longer a universal default. On the reasoning models most of us now use, the model is already doing chain-of-thought internally, so forcing it into the output can cost you accuracy. Switch from commanding the route to specifying the destination, and test it on your own tasks.

The people who keep pace with AI are the ones who update their habits as the models change. We understand AI. We understand you better. With UD by your side, AI doesn't feel cold.

🧠 How Sharp Is Your AI Instinct?

Knowing when a technique stopped working is exactly the kind of judgement that separates casual users from power users. Test where your AI knowledge really stands with UD's AI IQ Test, then let us walk you through every step of building prompting habits that keep up with the models.

Take the AI IQ Test