The thinking lever

Alexander Bricken from Anthropic explains how Claude uses test time compute to improve reasoning and performance by spending more tokens.

Ask about this video. Answers come from its transcript only — with the timestamp, so you can check them.

Generated from the transcript and can be wrong — check the timestamp.

Key Takeaways

Test time compute is crucial for improving Claude's reasoning by allowing it to spend more tokens thinking through problems.
Increasing token usage at inference time leads to better performance across various complex benchmarks.
Claude can adjust its effort level dynamically, trading off between latency and intelligence.
Using tools and external resources at test time enhances Claude's problem-solving capabilities.
Balancing token count and compute time is key to optimizing Claude's effectiveness in real-world applications.

What the video covers

Alexander Bricken from Anthropic discusses the concept of the thinking lever in Claude, focusing on test time compute.
Test time compute involves using more tokens at inference time to enhance Claude's reasoning and problem-solving abilities.
Increasing model size and token usage leads to improved performance across benchmarks like agentic coding and PhD-level tests.
Claude's simulation of cars on a one-way street demonstrates how higher token usage results in more realistic and intelligent outcomes.
Different effort levels (low, high, max) control the amount of tokens and compute Claude uses, balancing latency and intelligence.
Test time compute can involve various tools and interactions, such as searching or calling external APIs, allowing Claude to reason about when to use them.
Performance improvements from test time compute are analogous to scaling model size and training compute.
Claude can dynamically decide how much effort to put into a task, optimizing token usage for better results.
The video includes real-time examples and benchmarks to illustrate the impact of test time compute on Claude's capabilities.
The default setting for Claude balances token usage and latency to achieve efficient and intelligent responses.

Chapters

What is test time compute in the context of Claude?

Test time compute refers to the tokens and compute resources Claude uses at inference time to think through problems more thoroughly, improving its reasoning and performance.

How does increasing token usage affect Claude's performance?

Spending more tokens allows Claude to perform more detailed reasoning, which leads to higher accuracy and better outcomes on complex benchmarks and tasks.

What are the different effort levels mentioned for Claude?

Claude can operate at low, high, or max effort levels, which control how many tokens and how much compute time it uses, balancing speed and intelligence.

Full Transcript — Download SRT & Markdown

Speaker A

Good to see you all today. I'm happy to have so many Claude lovers in one room. My name is Alexander Bricken, and I'm on the Applied AI research team here at Anthropic. Today, we're going to be talking about the thinking lever.

Speaker A

Specifically, we're going to talk about how Claude leverages compute at runtime, at inference time, typically called test time compute to make more effective use of tokens in solving some of the hardest problems that Claude has in front of it.

Speaker A

Specifically, we're going to talk about how Claude leverages compute at runtime, at inference time, typically called test time compute, to make more effective use of tokens in solving some of the hardest problems that Claude has in front of it.

Speaker A

So looking back a couple years, one of the key developments in large language models has been this idea of reasoning models, which is using test time compute to spend more tokens for a model to become more efficient at answering a

Speaker A

I'm going to share some of the best practices when it comes to using different levers and using different tokens to essentially try to solve those problems better. Hopefully, you'll learn something as well.

Speaker A

So as you can see here on the left we have our different models in our typical uh range from haiku sonnet to opus and as you increase the size of the model or the number of um parameters you can see

Speaker A

Looking back a couple of years, one of the key developments in large language models has been this idea of reasoning models, which is using test time compute to spend more tokens for a model to become more efficient at answering a question.

Speaker A

Equally on the right hand side here we have a logarithmic axis on the on the x- axis and you can see that as claude spends more tokens we see the actual performance increase as well and so both of these the max on the right and the

Speaker A

Similar to how we can scale model performance at training time, such as train time compute, test time compute also results in higher intelligence results.

Speaker A

or humanity's last last exam which is a PhD level series of test cases. In all of those results we see that the model becomes better at producing outcomes when it uses more tokens to think through the problem before answering the

Speaker A

As you can see here on the left, we have our different models in our typical range from Haiku, Sonnet to Opus. As you increase the size of the model or the number of parameters, you can see that the performance increases up to nearly roughly below 80% for an internal agentic coding benchmark that we run.

Speaker A

run at three different levels of effort for Claude. Low, high, and max. And I'll show you how the performance increases as Claude spends more tokens. So this is our prompt. Creating a realistic simulation of cars on a one-way street

Speaker A

Equally, on the right-hand side here, we have a logarithmic axis on the x-axis, and you can see that as Claude spends more tokens, we see the actual performance increase as well. Both of these, the max on the right and the result that you're seeing here on the left, are actually the same score.

Speaker A

it's actually quite a good simulation. We have a one-way road. The cars are on two lanes. Um, they're pulling up to the traffic light, stopping at a regular cadence, and then I think we have a few kind of adjustments we can make in this

Speaker A

This is actually true of every knowledge domain work. So you take a reasoning problem in DARC QA, which is a benchmark computer use through OS world, or humanity's last exam, which is a PhD-level series of test cases.

Speaker A

And so, as you can see, the the cars are kind of just moving through and then stop when the the light becomes yellow.

Speaker A

In all of those results, we see that the model becomes better at producing outcomes when it uses more tokens to think through the problem before answering the questions.

Speaker A

Now, you can see on the left, Claude is actually spending double the amount of time it takes to run this sim uh to create this simulation and roughly double the amount of tokens. And I would say that this simulation's a little bit

Speaker A

As a tangible example, and even though we love looking at graphs at Anthropic, I wanted to show this in real time. I have this prompt in front of us here, which is going to be run at three different levels of effort for Claude: low, high, and max.

Speaker A

isn't in the middle of the road. Like if we flip back to the previous example, you'll see that the traffic light was positioned in the middle road, which makes absolutely no sense. whereas now cla to itself, okay, I should probably

Speaker A

I'll show you how the performance increases as Claude spends more tokens. So this is our prompt: creating a realistic simulation of cars on a one-way street at a traffic light. Note the one-way street and the traffic light.

Speaker A

hey, I've actually worked a little bit through making the drivers a bit more intelligent. So depending on how the cars move, the cars around it also react, which is a more intelligent simulation than than the previous version.

Speaker A

Okay, so our first one, low effort on Opus 47, took roughly 50 seconds, and there were roughly 4,600 output tokens. As you can see, it's actually quite a good simulation.

Speaker A

Arguably, this is the best traffic light we've seen. I like that it's sort of up upside down hanging following the laws of physics. We also have this beautiful skyscape in the back. Um, and the cars also reflect this more intelligent

Speaker A

We have a one-way road. The cars are on two lanes. They're pulling up to the traffic light, stopping at a regular cadence, and then I think we have a few kind of adjustments we can make in this simulation to change the spawn rate of cars or how often it turns red or green.

Speaker A

weeks or months of work. And so this is the meter benchmark. We're showing that over generations of models and this is a combination of both train time compute, so larger models, as well as better test time compute, so spending more tokens on

Speaker A

As you can see, the cars are kind of just moving through and then stop when the light becomes yellow.

Speaker A

human work to a 50% uh level of accuracy. Now test time compute can be any form of spending tokens.

Speaker A

Cool. It's quite simplistic, though. Now, let's move over to high effort. Let's turn that effort level up.

Speaker A

The first way is thinking space for reasoning. It's basically a scratch where Claude considers the question that was asked of it, uses whatever data it has available to it in the prompt and thinks about the next steps it should take to solve a problem.

Speaker A

Now, you can see on the left, Claude is actually spending double the amount of time it takes to run this sim to create this simulation and roughly double the amount of tokens.

Speaker A

really be anything though. It's, you know, 1 million types of tools. Interact with your Salesforce, call the MCP server, even write into a file system.

Speaker A

I would say that this simulation's a little bit more detailed. We now have the same one-way road, different types of vehicles showing up. There's a few lorries in there for the Brits out in the crowd.

Speaker A

question up front in its in its response to gather more information from the user. Test time compute has direct costs in the form of tokens token count and time that it takes. And so naturally your might your mind might be coming to

Speaker A

As well as that, there's a traffic light, but the traffic light isn't in the middle of the road. If we flip back to the previous example, you'll see that the traffic light was positioned in the middle of the road, which makes absolutely no sense.

Speaker A

compute that cla to max and depending on the effort you assign model it will work for a longer amount of time and spend more tokens. So often you're kind of asking yourself the question of okay do I trade intelligence

Speaker A

Whereas now Claude to itself, okay, I should probably position it sort of overhanging the road, but it's sort of upside down, which I don't love.

Speaker A

spend the rest of the session elaborating a little more on effort. Now the ideal state is you ask cloud a question and it knows how much effort it should put into it. Uh but humans always want to have that additional lever they

Speaker A

However, I would say it's a more complex simulation. One of the things when we ran this prompt as well that Opus said in this version is, "Hey, I've actually worked a little bit through making the drivers a bit more intelligent."

Speaker A

allocated to thinking and then it would execute a series of tool calls reading each one until the output was formulated and then you get the response. Now if you think of the analogy of like how humans work typically we don't do that.

Speaker A

So depending on how the cars move, the cars around it also react, which is a more intelligent simulation than the previous version.

Speaker A

Right? Instead, which is how we resulted in developing interled thinking, you do something, you think about it, you do another thing, you think about it, and then you come back with a result. And that's exactly what interle thinking does. So it allows Claude to

Speaker A

Finally, we ran Claude Opus 47 on max effort. This is using roughly 10 times the number of tokens and time to run to create this simulation.

Speaker A

depending on the question at hand claude will choose to call either a tool call output some text like that question I was referring to earlier or even think in it in whatever order it likes. And so looking back to the analogy I was

Speaker A

As you can see, it's much more detailed. Arguably, this is the best traffic light we've seen. I like that it's sort of upside down hanging, following the laws of physics.

Speaker A

not thinking in that instance. If I'm doing an academic problem set, though, I am probably thinking at every step of the process.

Speaker A

We also have this beautiful skyscape in the back. The cars also reflect this more intelligent motion of vehicles.

Speaker A

what is 10 + 10? You'd immediately spawn respond with 20. Whereas, if I asked you, you know, work through this really difficult PhD level problem set, you'd probably think a lot, but different members of the audience here might think

Speaker A

So, what does this mean? Well, arguably, the more tokens it spends, the more time it takes.

Speaker A

Adaptive thinking isn't a model reader. We're not classifying the request that comes through the door. Instead, it's actually telling Claude, "Hey, you have this thinking tool.

Speaker A

Over time, we might see Claude eventually go from seconds, minutes, or hours of work to even days, weeks, or months of work.

Speaker A

to interle thinking the former way in which we served our models. So historically users had thought about thinking as this effort dialer you can turn on thinking for a better answer.

Speaker A

This is the meter benchmark. We're showing that over generations of models, and this is a combination of both train time compute, so larger models, as well as better test time compute, so spending more tokens on our higher reasoning models.

Speaker A

You're not expressing how hard you want Claude to think when you turn a thinking toggle on or off. You're actually just turning off a core capability. As I mentioned, there's these three capabilities: thinking, tool calling, and text. And so when you turned

Speaker A

We see that Claude is able to work more autonomously to cover human-level tasks to a higher degree of hours.

Speaker A

analogy with tool use, we don't tell Claude to either never search or always search the web. We just give Claude a search tool and allow it to reason as to when it should search.

Speaker A

Mythos, which is one of our latest models, works to an extent of roughly 16 hours of human work to a 50% level of accuracy.

Speaker A

Don't think about this problem set. Just come up with an answer for me. I'll tell you the constraints of the problem, some sort of knowledge worker task. You'll go off and execute on it pending who you are and what context you have, and then

Speaker A

Now, test time compute can be any form of spending tokens.

Speaker A

Ideally, we want Claude to work in the same way. So I want to dig a little bit more into some of the best practices around using effort and this graph is an articulation as we saw before of effort levels increasing

Speaker A

This is typically at inference time, and there are three ways in which we like to break this down.

Speaker A

challenges that you're proposing to the model and evaluating at different effort levels how well the model performs is one of the best ways to just figure out what e effort level you should start with. Now one of the things you might

Speaker A

The first way is thinking space for reasoning. It's basically a scratch where Claude considers the question that was asked of it, uses whatever data it has available to it in the prompt, and thinks about the next steps it should take to solve a problem.

Speaker A

in performance. Now loweffort levels I would say can accomplish a lot of things but you're often trading off intelligence for speed. And so sometimes you might want to think about loweffort things as things that aren't intelligencebound.

Speaker A

Equally, there's tool calling right after. This is Claude's interface with the outside world.

Speaker A

be familiar with Claude plays Pokémon. It's probably my favorite eval. This eval is we put Claude into Pokemon Red and we gave it access to tools to trigger buttons on like a Game Boy for example and we gave it vision over the

Speaker A

In this example, we're asking Claude to do a web search, learning more about the Anthropic Developer Conference. Funnily enough, we're all here right now.

Speaker A

is it would run all sorts of mechanisms like using repels, using potions to have to avoid going back to the Pokemon center, using escape routes to get out of caves quickly, running away from poking Pokemon battles whenever encountered one in the in the shrubs.

Speaker A

Tools can really be anything though. It's, you know, a million types of tools. Interact with your Salesforce, call the MCP server, even write into a file system.

Speaker A

doing loweffort because you're explicitly constraining how much the model is thinking through the problem set and maybe it does end up in really unique attractor states. Now while evals are always ideal, I understand they're quite hard. I speak to customers a lot

Speaker A

All of those things are tool calls.

Speaker A

I mentioned, max effort can typically deliver gains on the hardest tasks, but they might show diminishing marginal returns. And so, I wouldn't recommend starting here unless you absolutely know that the intelligence of your use case is necessary. You know, the problem set

Speaker A

Finally, there's text. This is the output that Claude makes whenever you ask a question of it, and it responds with something.

Speaker A

We would argue that this is one of the best trade-offs between intelligence, speed, and number of tokens. As you move down to high, this is a still a good balance of token usage and intelligence.

Speaker A

It might be a summary of all the work it did. It might be a question up front in its response to gather more information from the user.

Speaker A

Medium and low are ways to just toggle down that amount of tokens used. And as a result of that at low as I mentioned you're really looking at latency sensitive use cases where maybe it's classification summarization or data

Speaker A

Test time compute has direct costs in the form of tokens, token count, and time that it takes.

Speaker A

do I know whether or not I should use a really small model like Haiku and make that effort level really high versus having a really big model and making the effort level really low? Like what are the differences there? So I want to give

Speaker A

Naturally, your mind might be coming to the conclusion of, "Hey, wait a minute, as a user, I want more control over what Claude actually does on a day-to-day basis."

Speaker A

of time developing the simulation but the same number of tokens and I would say the result is not nearly as good. I don't even know if those are cars to be honest. Um, and so the conclusion that we come to here is arguably if the if

Speaker A

There are essentially two ways in which users can change the number or the amount of test time compute that Claude maxes.

Speaker A

Now, the way that you should think about using smaller models though is these low intelligence use cases where you're not really caring as much about the outcome because the outcome is so simplistic that you know cla evals to figure out what the best way to

Speaker A

Depending on the effort you assign, the model will work for a longer amount of time and spend more tokens.

Speaker A

thumb in a second. You should enable Claude that space to reason. give it the scratch pad so it knows that it can use that thinking tool when it needs to. You can control that length through the effort levels that I described.

Speaker A

Often, you're kind of asking yourself the question of, "Okay, do I trade intelligence off for speed?"

Speaker A

And then finally, when in doubt, go with extra high. It's the default that we've set for our products and I would argue that it's a great kind of purto efficient outcome between latency and number of tokens and intelligence. The

Speaker A

Secondly, we have budgets. Budgets are basically a way of assigning more strict constraints to the way Claude works.

Speaker A

Claude, "Hey, I'm only going to spend this amount on whatever you do." or hey only take this long a week to do it and then eventually claude just knows how to allocate that compute appropriately. So it knows how many tokens it should spend

Speaker A

That might be through max token constraints or what we call task budgets, which is another feature in our API.

Speaker A

to solve them. Thanks so much for listening. I'm going to be around the conference if anyone has any questions and uh hope you enjoy the rest of the conference.

Topics:ClaudeAnthropictest time computethinking leverlarge language modelstoken usageAI reasoningmodel performanceinference timesimulation

Get More with the SozAI App

Transcribe recordings, audio files, and YouTube videos — with AI summaries, speaker detection, and unlimited transcriptions.

App Store Google Play

Or transcribe another YouTube video here →

Free tools: TXT to SRT · SRT Validator · Merge SRT · Subtitle to Text · All tools