The thinking lever

Explore how Claude leverages test time compute to enhance reasoning and solve complex software engineering tasks with scalable effort levels.

Ask about this video. Answers come from its transcript only — with the timestamp, so you can check them.

Generated from the transcript and can be wrong — check the timestamp.

Key Takeaways

Test time compute scaling enhances model reasoning and output quality.
Effort levels let users trade off between speed, cost, and quality.
Claude uses distinct token types to reason, interact with tools, and communicate.
Token budgets help manage costs and control computation time.
Larger models with higher effort produce better results but require more tokens.

What the video covers

Claude uses test time compute to improve problem-solving by scaling the amount of compute spent during inference.
Increasing effort levels allows Claude to spend more tokens and time, resulting in higher quality outputs.
The video demonstrates this with a traffic simulation example using the Opus 4.7 model at different effort settings.
Three types of tokens are explained: thinking tokens (internal reasoning), tool calling tokens (interfacing with external tools), and text tokens (user communication).
Users can control Claude’s behavior through effort dials and token budgets to balance quality, cost, and response time.
Scaling test time compute benefits not only software engineering but also other knowledge work domains.
Higher effort levels produce more realistic and complex results, such as improved traffic patterns and graphics in the simulation.
Claude intelligently allocates tokens to maximize outcomes within user-defined constraints.
The video provides best practices for selecting effort levels and model sizes based on use case needs.
Future scaling may allow Claude to work on problems for extended periods, from hours to even years.

Chapters

What is test time compute and why is it important?

Test time compute refers to the amount of computational effort a model spends during inference to solve a problem. Increasing it allows models like Claude to reason more deeply and produce higher quality results.

How does Claude use tokens during problem solving?

Claude spends three types of tokens: thinking tokens for internal reasoning, tool calling tokens to interact with external tools, and text tokens to communicate with users.

How can users control Claude's performance and cost?

Users can adjust the effort dial to balance time, cost, and quality, and set token budgets to limit how many tokens Claude spends before checking in.

Full Transcript — Download SRT & Markdown

Speaker A

Falls down. All right. Hello everyone, and welcome. My name is Matt Bleifer. I am a product manager on the Anthropics research team, and today I will be sharing a little bit about how Claude leverages compute at inference time, otherwise known as test time compute, in order to break down and solve some of your hardest software engineering challenges.

Speaker A

known as test time compute in order to break down and solve some of your hardest software engineering challenges.

Speaker A

Along the way, I'll share a little bit about what levers you have at your disposal in order to influence how Claude spends tokens. I will also share some best practices to help you be able to get the most out of it.

Speaker A

So, one of the key developments in large language models over the last couple of years has been the scaling of test time compute, creating something that we've all come to know as reasoning models.

Speaker A

So, one of the key developments in large language models over the last couple of years has been the scaling of test time compute, creating something that we've all come to know as reasoning models.

Speaker A

you look at this graph on the left, you can see that when we move from haiku to sonnet to opus, as the model gets more intelligent, it's able to get a better score on our agentic coding evaluation.

Speaker A

Similar to how we can scale compute at training time by training bigger models over longer time horizons using more data, we can also scale compute at test time by allowing those models to spend more time working on a problem. So if

Speaker A

compute. Now, this isn't true just of software engineering. It's really true of a whole variety of knowledge work domains.

Speaker A

you look at this graph on the left, you can see that when we move from Haiku to Sonnet to Opus, as the model gets more intelligent, it's able to get a better score on our agentic coding evaluation.

Speaker A

So looking at a bunch of charts and graphs is always great for understanding the data and the different correlations.

Speaker A

And then similarly in the graph on the right, as that same model, Opus, actually just spends more time working on a problem, it's able to correspondingly get better and better scores. This is what we mean by scaling test time

Speaker A

prompt, where in this case I asked it to create a realistic simulation of cars going down a one-way street at a traffic light.

Speaker A

compute. Now, this isn't true just of software engineering. It's really true of a whole variety of knowledge work domains.

Speaker A

was fairly reasonable. We do in fact have cars going down a one-way street. Uh they are stopping at the traffic light, but overall it's a pretty basic simulation. The traffic flow is fairly basic. Uh the graphics are limited. And

Speaker A

Whether it's agentic search, computer use, or PhD-level academic reasoning, if we can allow models to spend more time working on a problem, they can achieve better and better results.

Speaker A

So, the next thing I did here is I went ahead and I cranked that effort dial up a bit. So, when I moved effort up to high for Opus 4.7, it took about twice the time working on our traffic

Speaker A

So looking at a bunch of charts and graphs is always great for understanding the data and the different correlations.

Speaker A

implemented what it called an intelligent driver model where every car would more uniquely respond to the dynamics of the car around it, doing a better job simulating a realistic traffic pattern. So again, twice the amount of time, better

Speaker A

But nothing really beats seeing a tangible example of what this looks like in practice. And so what I went ahead and did is I ran Opus 4.7 on a few different effort levels, scaling the amount of time that it works on a given

Speaker A

effort and 10x amount the amount of tokens. But as you can see, it was able to achieve the best results yet. We have the best graphics, my favorite traffic light of all of them, uh, and really realistic driving patterns.

Speaker A

prompt, where in this case I asked it to create a realistic simulation of cars going down a one-way street at a traffic light.

Speaker A

As we continue to scale test time compute further and further, Claude isn't just going to work for seconds or minutes or hours on a problem, it's going to work for days, weeks, months, even years spending tokens to try to

Speaker A

So the first result we have here is Opus 4.7 running on low effort. You can see in the simulation, it took about 50 seconds to produce a result for us and worked for about 4,600 output tokens. And I'd say it accomplished something that

Speaker A

three kind of distinct buckets. The first bucket that we have is thinking tokens. This is the classic form of tokens that underline uh underlines what we know as reasoning models. Thinking tokens represent Claude's internal monologue. It's Claude's space to reason step by step to

Speaker A

was fairly reasonable. We do in fact have cars going down a one-way street. They are stopping at the traffic light, but overall it's a pretty basic simulation. The traffic flow is fairly basic. The graphics are limited. And

Speaker A

deliver the best results. The second form of tokens that Claude can spend when taking on a task is tool calling tokens. Tool calling is Claude's way of interfacing with the rest of the world. Whether it's using tool calls to

Speaker A

for some reason, Claude thought it would be a great idea to put the traffic light right in the middle of the road, which maybe wasn't the best design decision, but we will still call it functionally passing.

Speaker A

scenarios, tool calling tokens are Claude's way of interfacing with its environment. The last type of tokens that Claude can spend is text. And this is Claude's way of interacting with you. whether it needs to give you updates as it's

Speaker A

So, the next thing I did here is I went ahead and I cranked that effort dial up a bit. So, when I moved effort up to high for Opus 4.7, it took about twice the time working on our traffic

Speaker A

simple question that you have. Text tokens are Claude's way of communicating with an end user.

Speaker A

simulation and double the output tokens. But as you can see, it was able to achieve a better result. It has cars of different types. It smartly moved the traffic light over to the side of the road. And Opus told me that it even

Speaker A

But all of these tokens that we're spending have really direct costs to users in the form of both practical token costs that we pay for uh as well as waiting time. When Claude spends more tokens, it means that we as users have

Speaker A

implemented what it called an intelligent driver model where every car would more uniquely respond to the dynamics of the car around it, doing a better job simulating a realistic traffic pattern. So again, twice the amount of time, better

Speaker A

dial that I talked about. Effort is a way for you to tell Claude how you want it to trade off time, cost, and quality when responding to your task.

Speaker A

results. Now, the last thing that I did here is I cranked that effort dial all the way up to max. And in this setting, Opus 4.7 took 10 times the amount of time that it did when executing this same prompt on low

Speaker A

it spends these tokens. Another form of constraints that users can provide is in the form of budgets.

Speaker A

effort and 10 times the amount of tokens. But as you can see, it was able to achieve the best results yet. We have the best graphics, my favorite traffic light of all of them, and really realistic driving patterns.

Speaker A

particular software engineering feature for me, but I don't want you to spend more than 100,000 tokens before you stop and check in with me as a user." Budgets could come in the form of tokens, but they could also come in the

Speaker A

This is all an example of how when you allow Claude, even on the same model, to just spend more time working on a problem, it can get better results.

Speaker A

on a particular problem before it stops to check in. Now given all of these preferences and constraints, it's really up to Claude to figure out the best way to spend those tokens in order to maximize outcome. So given the

Speaker A

As we continue to scale test time compute further and further, Claude isn't just going to work for seconds or minutes or hours on a problem, it's going to work for days, weeks, months, even years spending tokens to try to

Speaker A

When reasoning models were first introduced, they followed a really specific pattern in terms of how to spend these tokens. The first thing they would do is they would think and they would spend those thinking tokens to work through a problem. And then after

Speaker A

solve some of humanity's toughest challenges. So when I talk about test time compute, I really mean any form of Claude spending tokens at inference time in order to solve your problem. However, we can break these token types down into

Speaker A

We improved on this when we introduced interleaf thinking which allowed Claude to actually use thinking and reasoning in between tool calls.

Speaker A

three kind of distinct buckets. The first bucket that we have is thinking tokens. This is the classic form of tokens that underlines what we know as reasoning models. Thinking tokens represent Claude's internal monologue. It's Claude's space to reason step by step, to

Speaker A

Recently, we launched adaptive thinking, and adaptive thinking is the next evolution on top of interled thinking.

Speaker A

consider different potential options, do some chain of thought reasoning, create a scratch pad where it needs to work through a problem, and ultimately spend time thinking through what it needs to do in order to take the best actions and

Speaker A

tool use, and text in whatever order is needed in order to best meet the requirements of your task.

Speaker A

deliver the best results. The second form of tokens that Claude can spend when taking on a task is tool calling tokens. Tool calling is Claude's way of interfacing with the rest of the world. Whether it's using tool calls to

Speaker A

and so forth all the way until it provides that final answer of the work that it did.

Speaker A

execute a search in this example, giving me more information about the code with Claude Conference, or reading and writing files in order to build out software engineering projects. There are really millions of different tools that Claude can call, but in all of these

Speaker A

were to survey someone in the in the crowd here and say what is 2 plus 2 and I ask them to spend a little bit amount of time on the problem versus a lot of time on the problem. You're roughly

Speaker A

scenarios, tool calling tokens are Claude's way of interfacing with its environment. The last type of tokens that Claude can spend is text. And this is Claude's way of interacting with you. Whether it needs to give you updates as it's

Speaker A

effort there could be quite dramatic. Now, adaptive thinking is not a model router and it's not an automated thinking toggle. So, it's not taking your query, classifying it based off of difficulty, and figuring out whether it should use a thinking version of the

Speaker A

working on a really, really tough problem and let you know how it's progressing, give you a summary at the end to explain all of the things that it did in response to the tough task that you gave it, or simply just responding to a

Speaker A

problem. It's really about Claude having the option to think at every single step of the process.

Speaker A

simple question that you have. Text tokens are Claude's way of communicating with an end user.

Speaker A

So I want to dig a little bit more into effort and contrast it to the ways in which we've used thinking in the past.

Speaker A

So again, we have three different types of tokens: thinking, tool calling, and text. And all three of these we think of as really fundamental to the way in which Claude works and the way in which Claude responds to problems.

Speaker A

spend more time working in order to give you a better result. That's a pretty reasonable instinct.

Speaker A

But all of these tokens that we're spending have really direct costs to users in the form of both practical token costs that we pay for, as well as waiting time. When Claude spends more tokens, it means that we as users have

Speaker A

constraining how it's allowed to work, not how hard you want it to work. An effort dial is a much better expression of the idea of spend more tokens in order to get a better answer.

Speaker A

to wait longer for our result. And so we think it's really important that we give users the ability to influence or constrain how Claude spends tokens. Using Claude, users can express their preferences and constraints in a couple of ways. The first way is with that effort

Speaker A

based off of the problem at hand. And that's really what allows Claude to be agentic in response to your query.

Speaker A

dial that I talked about. Effort is a way for you to tell Claude how you want it to trade off time, cost, and quality when responding to your task.

Speaker A

teammates decide how hard they are going to think on that problem and what actions that they will take in response.

Speaker A

Should Claude spend more time in order to get a better answer? Should it spend less time in order to get a faster answer? These are all preferences that you can give to Claude as a user that it will take into account when it goes and

Speaker A

First, whenever possible, it's always best to run evals and then chart performance where you compare on your x-axis here something like total tokens time or cost and on your yaxis performance. This allows you to create an effort curve like this and get a

Speaker A

spends these tokens. Another form of constraints that users can provide is in the form of budgets.

Speaker A

will spend whatever tokens I need in order to get the best intelligence. Or you might say, the relative improvement in performance between extra high and max is not worth the difference in tokens that I will spend, and so extra

Speaker A

Recently, we launched a feature that we call task budgets, which allows you to tell Claude an upper bound of the amount of tokens that it will spend when working on a task. So, you might say, "Hey, I want you to build out this

Speaker A

is that when using low effort, Claude is really trying to save tokens as much as possible. And so sometimes you may catch it taking unexpected shortcuts that you didn't expect it to. And so in addition to looking at evals, we always think

Speaker A

particular software engineering feature for me, but I don't want you to spend more than 100,000 tokens before you stop and check in with me as a user." Budgets could come in the form of tokens, but they could also come in the

Speaker A

On the flip side, uh low efforts have also surprised us in some really interesting ways. So, one of my favorite evaluations that we've created is called Claude Plays Pokemon, where uh Claude gets the opportunity to work its way

Speaker A

form of time or cost. And I think this will get increasingly important as we continue to move up that exponential and Claude is working for days, weeks, months, or more to be able to give some

Speaker A

When we ran Claude Plays Pokemon on low effort, something really interesting happened and it actually ended up treating the game much like a speedrun.

Speaker A

So, it would skip trainer battles in order to save itself some time. It would use healing items that it stocked up on instead of wasting time going back to Pokemon healing centers. And it would spam an item called a repel that would

Speaker A

limit disruptive encounters with other Pokemon, making it through caves much more quickly. And what I find most interesting about this is often times we might correlate low effort with lower intelligence. But for any of us that grew up playing this

Speaker A

game, what you really realize is this is a super clever strategy. uh it takes a certain amount of intelligence to figure out how you might minimize token spend in order to get through these levels as fast as possible. And so it was

Speaker A

interesting to see how Claude's interpretation of loweffort translated to beat the game as fast as possible, employing actually very clever strategies along the way.

Speaker A

So evalion them any time that you have them. Um, but I wanted to just give you some quick rules of thumb on how you might select an effort setting uh in the absence of eval or even if you have them, this is

Speaker A

just a good way to think about things. First, uh, max effort, no surprise, can deliver gains on your hardest tasks, but as I mentioned before, it can sometimes show signs of diminishing returns. I recommend testing it for your most

Speaker A

intelligence demanding use cases, but don't always assume that this is going to be either the ceiling on performance uh or really the best bang for your buck. It could be the case that a level down is going to give you roughly

Speaker A

equivalent performance at a real fraction of the cost. Extra high effort is a new setting that we introduced with Claude Opus 4.7 and we found this to be the best setting for most coding and agentic use cases. This

Speaker A

is currently our default in cloud code and cloud.ai for opus 4.7. And like I said, it really does a good job maximizing intelligence without kind of going overboard.

Speaker A

High effort is a great setting if you're trying to balance token usage and intelligence. And this is probably the value that I would recommend for any intelligence sensitive use case. High is a good place to start and test up from

Speaker A

there. Medium is good for costsensitive use cases where you're willing to trade a little bit of intelligence. uh in order to get a much faster result. And then low is good for reserving for kind of your short scope tasks and latency

Speaker A

sensitive workloads. Uh however, as I mentioned before, it's always good to just put it in practice and see what actually happens because it might surprise you.

Speaker A

I mentioned at the start of the talk that test time compute is a second way of scaling intelligence as compared to training time compute. So it kind of begs the question if both give similar trade-offs with respect to performance,

Speaker A

speed and cost. Uh when should I use a smaller model or when should I use a lower effort level on a bigger model?

Speaker A

As some quick guidelines I would say first low effort on a bigger model is good for an intelligence demanding use case where you're trying to optimize for speed.

Speaker A

Going back to our example here of our traffic light simulation, you can see that Opus 4.7 on low effort spent about the same amount of output tokens and only took a little bit longer than Haiku 4.5 on max effort, but I would say it

Speaker A

achieved a much better result. So often the low effort on the larger, more intelligence model can give you a better bang for your buck when trading off speed versus intelligence on uh an intelligence demanding use case.

Speaker A

On the flip side, smaller models uh can be really good if you're trying to optimize cost and your use case is not too intelligence demanding. So if you have some simpler LLM tasks, especially if you need to do them in bulk,

Speaker A

something like classification, information extraction, basic summarization, that's where small models are going to come in handy, and they're going to be able to save you a lot of cost when you don't need peak intelligence.

Speaker A

Another case where small models are really useful is if your application demands a really low time to first token. So if you want claude to give responses as fast as possible in response to a user query, the nature of

Speaker A

the smaller models means that they will produce tokens oftent times much sooner sooner and give you a better time to first token. The way that I think about this is use small models for a fast time to first token. Use bigger models at

Speaker A

lower effort for a fast time to last token. Wherever possible, as I said before, I recommend evaluating both. It's good to build these eval curves across a few different model types and across various effort levels and then look at what the

Speaker A

trade-offs give you for your use case that you're trying to optimize. All right, so before closing the talk, I wanted to just summarize three key actionable items that I hope that you take away from this one. enable thinking

Speaker A

whenever possible in order to give Claude that space to reason. Thinking, like I said, is really core to how Claude works and gives it that space and that inner monologue to be able to work through your problem as efficiently as

Speaker A

possible. If you want to modulate the amount of time that Claude is spending thinking on your problem, I recommend using effort levels or budgets as your way of influencing Claude's behavior.

Speaker A

Second, I might sound like a broken record here, but if you have evals, use them. Use that to find your ideal balance. Chart your curves. Test on different effort settings. Test with different budgets. Test with different models. Look at what the performance

Speaker A

gives you and decide what makes sense for your use case without forgetting to always dig in and read those transcripts.

Speaker A

And lastly, if you're not going to do any of that and you just need to make a choice and you're working on anything coding and software engineering related, my advice would be go with extra high.

Speaker A

It's a pretty good setting and gives a great bang for your buck while delivering great intelligence.

Speaker A

Our northstar for Claude overall is that it allocates compute incredibly well when asked for it and that you can set a quality bar and a budget and Claude will just go ahead and figure out the rest and give you the best performance for

Speaker A

your use case. Adaptive thinking and effort levels and budgets are all a step in this direction. Uh but they're really just the beginning. And there's a lot more to come. I'm excited to share more with you in the future. So stay tuned

Speaker A

and thanks so much for taking the time. If you want to chat more about this, uh I'll be around the conference in the audience. I'm always happy to uh nerd out about these things. So thank you.

Topics:Claudetest time computereasoning modelslarge language modelssoftware engineeringtoken budgetingeffort levelsAnthropicOpus 4.7AI inference

Get More with the SozAI App

Transcribe recordings, audio files, and YouTube videos — with AI summaries, speaker detection, and unlimited transcriptions.

App Store Google Play

Or transcribe another YouTube video here →

Free tools: TXT to SRT · SRT Validator · Merge SRT · Subtitle to Text · All tools