Let Your Models Think

Discovery

In a recent project of mine, I had to investigate why an AI pipeline gave the output that it did. This is an exercise I am becoming increasingly familiar with, unfortunately. This pipeline primarily used Claude models, but I recently introduced the use of Gemini models and thought it would be great to use the Vertex AI Studio to compare a prompt from my pipeline in Gemini 2.5 Pro vs the new Gemini 3 Pro Preview.

While exploring the interface, I noticed that I could configure the thinking done by the Gemini models. The UI was actually really nice and let me see the thought chains, which I thought would be really useful for debugging why an LLM output what it did and a few providers concur that this is a viable approach ¹.

Transparently, I reviewed the Google Gen AI SDK quite a bit for my implementation, but I did not realize that I could configure “thinking.”

“Enlightenment”

This led me down a rabbit hole where I realized that most model providers had configuration for thinking (or reasoning as some of them call it). This made a lot of sense and I was quickly reminded of the advancements in chain of thought (CoT) reasoning techniques that made the announcement of OpenAI’s o1-preview model that could

spend more time thinking through problems before they respond, much like a person would
— Open AI ²

so interesting and “ground-breaking” at the time.

All in all, this was a great discovery for me and my pipelines, but honestly I feel as though it was another case of failing to RTFM.

I thought it would be nice to explore this space a bit more and share my findings. We all know these LLMs and their tokens aren’t cheap³ so it would definitely be great to get more value out of them.

Thinking it Through

I think (pun intended) it would be best to focus on the “big three” providers and their primary models for my dive. I will primarily be covering:

Anthropic Claude
Google Gemini
OpenAI GPT-5

Claude

Anthropic refers to thinking as “extended thinking”

Anthropic models are a bit interesting when it comes to thinking as they have two primary “entry points” when it comes to Claude:

For Claude Code, Anthropic introduced a progressive thinking model with the following hierarchy

“think” < “think hard” < “think harder” < “ultrathink.”
— Anthropic ⁴

This seems like a reasonable interface for thinking in Claude Code given that it’s a CLI tool and adjusting per-message thinking budgets could be cumbersome.

Anthropic’s Messages API follows an approach similar to other providers where they let you configure a thinking budget via a thinking object with a budget_tokens property in your request:

To turn on extended thinking, add a thinking object, with the type parameter set to enabled and the budget_tokens to a specified token budget for extended thinking.
— Anthropic ⁵

Anthropic notes that there is a minimum budget of 1024 tokens and they recommend you bump this budget incrementally until you find a value that is ideal for your use-case. They do recommend, however, that workloads with a budget above 32k use batch processing.

Honorable Mentions

Claude has a lot of interesting tidbits as it pertains to thinking. A few of them are:

Summarized thinking - Get a summary of Claude’s full thinking process
Thinking encryption and Thinking redaction as Claude’s reasoning can sometimes be flagged by Anthropic’s safety systems
A host of best practices
A useful guide on chain-of-thought prompting engineering

Gemini

Google notes that

thinking features are supported on all 3 and 2.5 series models.
— Google ⁶

Google has followed a fairly standard thinking configuration with thinking budgets for Gemini models <= 2.5, but starting with Gemini 3 Pro that was recently released they’re moving to “thinking levels.”

You can still use the thinkingBudget parameter but Google warns you that this might lead to suboptimal performance.

For Gemini models, Google lets you:

disable thinking (Pro models are excluded)
use a set budget (within a predefined range for a specific model)
use dynamic thinking (this lets the model “adjust the budget based on the complexity of the request”)

The above can be a bit confusing as Google notes this about Gemini 2.5 and earlier models

if thinking_budget is not set, the model automatically controls how much it thinks up to a maximum of 8,192 tokens
— Google ⁷

The highest thinking budget for a Gemini model is Gemini 2.5 Pro’s whopping 32768 thinking budget. Gemini 3 Pro’s thinking_levels of low and high map to Gemini 2.5 Pro’s thinking_budget as such:

reasoning_effort (OpenAI)	thinking_level (Gemini 3)	thinking_budget (Gemini 2.5)
minimal	low	1,024
low	low	1,024
medium	high	8,192
high	high	24,576

You can see that this is all based on the OpenAI’s reasoning efforts (which we will touch on a bit later).

OpenAI GPT-5

OpenAI refers to thinking as “reasoning”

OpenAI’s reasoning models use a reasoning.effort parameter to control reasoning (thinking). Similar to Gemini 3 models, you can use a low, medium, or high value.

There doesn’t seem to be a way of specifying a thinking or reasoning budget, but you can control the cost of reasoning via the max_output_tokens parameter - this controls the total number of tokens generated by the model, which includes reasoning and output tokens (other models like Gemini consider thought tokens to be output tokens so tracks).

If you happen to hit the limit you’ve set via max_output_tokens, you’ll get back a response with incomplete as its status.

Just like the other two providers, OpenAI’s reasoning models offer reasoning summaries.

Final Thoughts

My final thoughts (again, pun intended) on this is that there is just so much to learn when it comes to using these LLM models and fully utilizing their abilities. Along with ensuring you RTFM, you just need to do a good amount of experimentation and play around with these models and utilize all the available playgrounds (like Vertex AI Studio, and the AWS Bedrock playgrounds).

Nonetheless, I am excited with what this means for my pipelines and workflows. Working with these tools have been nothing short of amazing (albeit frustrating).

Providers like Anthropic and Google note that this can and should be used in your prompt engineering process/iteration(s). ↩︎
This quote is from OpenAI’s o1-preview announcement post in late (September) 2024 ↩︎
Take a look at this price-per-token tracker and lament at the ever increasing prices: https://pricepertoken.com/ ↩︎
This is from Anthropic’s best practices guide - Claude Code: Best practices for agentic coding ↩︎
Noted in Anthropic’s guide on how to use extended thinking ↩︎
Google’s Gemini API docs on thinking ↩︎
Vertex AI docs on thinking ↩︎

Discovery#

“Enlightenment”#

Thinking it Through#

Claude#

Honorable Mentions#

Gemini#

OpenAI GPT-5#

Final Thoughts#