Anthropic’s Big AI Design Change: The RED Pill — Transcript

Discover how Anthropic’s AI design shifts intelligence from LLMs to scaffolding, boosting accuracy from 21% to 95% with deterministic workflows.

Key Takeaways

  • Intelligence in AI systems is shifting from pure LLMs to integrated scaffolding structures.
  • Anthropic’s Claude LLM has limited standalone accuracy but improves significantly with scaffolded skills.
  • Deterministic workflows and semantic layers are key to achieving high accuracy in data analytics AI.
  • Continuous validation and governance are essential to maintain AI system reliability over time.
  • This hybrid approach challenges traditional AI business models centered solely on LLM performance.

Summary

  • Anthropic’s latest publication reveals a major AI design shift: intelligence now resides in the scaffold, not just the large language model (LLM).
  • The scaffold includes domain ontology, workflow rules, data/tools semantic layers, validation, adversarial reviews, governance, feedback, ownership, versioning, and learning loops.
  • Anthropic’s Claude LLM alone achieves only about 21% accuracy on analytical questions according to their benchmarks.
  • By building a skill manifold on top of the scaffold, Anthropic claims to increase accuracy to 95% in aggregate.
  • The scaffold uses deterministic workflows, semantic layers, and single sources of truth to guide query execution and data analytics.
  • Anthropic identifies three main failure modes: incorrect field selection, stale knowledge causing subtle errors, and failure to find relevant data.
  • The scaffold includes detailed reference documentation for data tables, filters, keys, and usage instructions to ensure correct data retrieval.
  • The design emphasizes maintaining accuracy through continuous validation, governance, and feedback loops to prevent data and model rot.
  • The video critiques Anthropic’s business model, suggesting that reliance on scaffolding rather than pure LLM intelligence could undermine the value of the LLM itself.
  • Overall, the approach is a hybrid system combining deterministic logic with LLM capabilities to improve reliability and accuracy in AI-driven analytics.

Full Transcript — Download SRT & Markdown

00:00
Speaker A
Hello, community. So great to be back. Yes, today we talk about it: the intelligence is not anymore in the large language model itself. Today it is in the scaffold. And I will give you a proof, and the proof is the latest publication by Anthropic by Claude. So, let's have a look. Let's have fun. You see here our LLM, the work capability of probabilistic reasoning, and then you see here the mighty scaffold here: the domain ontology, the workflow rules, the data and the tools and semantic layer, the validation and evaluation, adversarial reviews, the governance, the feedback, the ownership, the versioning, the learning loop. My goodness. So, let's have a look. I had a feeling, and I wanted to build with you together here a personalized skill, and I wanted to show you that there's something odd with this, no? And we have you all the scripts and beautiful, but you know what?
00:14
Speaker A
It just happened, and here we have two inserts. So, yesterday Anthropic published here this beautiful paper, "How Anthropic enables the self-service data analytics with Claude." And you know what? This is absolutely beautiful. Now, I know that it was written by AI, but now I, as a human, reflect on this, and I will show you that Anthropic destroys its own business model.
00:29
Speaker A
So, logically this would imply that the financial value of Anthropic is approaching zero. If you think this is crazy, welcome to my video. This is exactly here the tension that we love here. So, of course, I just go with a screenshot from Claude itself. This is here the publication here. And they say beautiful. So, we have now the self-serve business insight including here why analytics accuracy is a context and a verification problem, not a code generation issue.
00:45
Speaker A
There are three main dominant failure modes that cause you the most error here for Anthropic. The analytical stack that will show you, and a basic template for how we create the majority of the skills. So, here Anthropic Claude tells you, "Hey, June 3rd, and I show you how we and Anthropic built the skills." So, I think this is the best way to start with Anthropic with their original information.
00:54
Speaker A
And Anthropic tells us we have identified three attributes of prompt that account for an overwhelming majority of inaccurate responses.
01:12
Speaker A
First, the agent Anthropic agent is unable to choose the correct field to better answer a user question. And you say, "Great." And then, the Anthropic agent Claude, the knowledge goes stale and starts returning subtly wrong answers, which is also amazing. And then, the AI agent here by Anthropic simply does not find you the data. And you say, "Hey, that's great that Anthropic knows that those are the three main causes of errors in their system." Now, if you think this is it, no. I have here a video where I show you that the AI agent here completely rejects human, and this goes for an Opus 4.7 and GPT 5.5. If you want to have some fun, hey, I highly recommend this video. But, let's come back to Claude.
01:19
Speaker A
So, they say, "Hey, we build now for this a gigantic analytical stack." Beautiful. And each layer exists primarily to attack one of the problems.
01:35
Speaker A
And I think this is an absolutely beautiful scientific logical approach. So, they say, "The sources of truth." And I love whenever I read the source of truth by AI shrinks here the space, the mathematical space of plausible entities, and therefore reduces here the search mathematical space. Beautiful. Second, the maintenance and a new validation process will keep everything from rotting away. And the skills will make sure the agent reliably finds and correctly uses this information, data, context for the correct answer. I mean, isn't this beautiful? So, let's come back to the skills, no? And the skills here, hm. So, the artificial intelligence of Anthropic, let's call it Claude, I don't know if it's 4.7 or 4.8. I assume it is their best LLM. They did not exceed 21% accuracy on analytical questions, on their own benchmarks. And you may say, "Hey, wait a minute.
01:51
Speaker A
You tell me, Claude, did you publish yourself that your best LLM in June 2026 has accuracy that does not exceed 21%?" And they said, "Yes, but you know what? If we build now here something that we call here a particular skill manifold here on a scaffolding, we can increase now consistently to 95% in aggregate." And you said, "That sounds interesting, so let's have a look at this." And looking at this, now and here you see a screenshot from Claude here.
02:05
Speaker A
This is here the tables. So, you have a quick reference. What does this domain mean in plain words?
02:17
Speaker A
Then you have what one row represents, and then you have a filter, then you have the dimension, the key dimension that I encoded, and how the same concept is named differently across the tables.
02:24
Speaker A
Then you have your tables, and then you have all your elements here, beautiful. And then you have instruction of when to use it, when not to use it, when to join keys equal, when to require activate the filter, and one short section per govern table, and some gotchas where the wrong answer mostly a senior analytics analyst would warn you about.
02:39
Speaker A
And you say, "Hey, great. So, we have here now the proper reference documentation written for a retrieval by an LLM." But this is something absolutely deterministic. Have you noticed this?
02:53
Speaker A
And I think this is exactly where my feeling kicks in. And here we are now, now.
03:05
Speaker A
A Claude skill to aggressively challenge all underlying assumptions on a potential final answer tells us Anthropic increased the accuracy of our LLM plus scaffolding by, and hold on to your socks, 6%.
03:13
Speaker A
Now you might say, wait a minute, 21% plus 6% is an indicator of a pure deterministic system, not any other.
03:28
Speaker A
Because yeah, if I increase here with all of this aggressively challenging here, only 6%. I mean, come on, you know what it means, now.
03:49
Speaker A
So, let's verify this with the public skill file skeleton by Claude. What follows is the skeleton of our main warehouse skill. Beautiful. Data analytics. The real file structure, internal specifics replaced by the bracket placeholder. It isn't meant to be copied verbatim, it is meant to show the kind of section we found worth writing down. So this is your original from Claude.com. Beautiful. So we have the name, we have the version, we have the description.
04:05
Speaker A
And only in the description we have here instructions, here exactly. With if, then, do not invoke for this and then and you say, hey, wow, this is beautiful. We have a logic structure here that is based on deterministic workflows. And then we have the skill instruction itself by Claude. So have a description, a single source of truth for safe and efficient warehouse querying, referenced by other skills, act as a data analyst, provide strategic insight and data-driven recommendation, but seek guidance along the way. And now you give instruction here in markdown, out of scope decision. Here, for particular product areas or whatever, surface the data only, state decision is owned by marketing, finance, whatever.
04:11
Speaker A
And it is their call. Do not take a position or to hear the code fixes.
04:19
Speaker A
So, don't care about some code from some other whatever. Then executing the queries. And then we have a priority listing which is absolutely beautiful as a single source of truth here in a deterministic flowchart.
04:40
Speaker A
So, at first we have the mandatory default path of our skill and this is the semantic layer. This is the required first step as defined by Claude.
04:50
Speaker A
So, the governed semantic layer is the mandatory default path for every data question. Same number as any BI tool.
04:56
Speaker A
Join, grain, filter, baked in. So, raw SQL via the reference docs below is default back. Used only after semantic layer path is shown not to cover the task.
05:06
Speaker A
So, you say, "Hey, this is not an interesting idea to build a skill, no? For an artificial intelligence."
05:19
Speaker A
So, the required workflow is clear. Now, the workflow is load.
05:29
Speaker A
How to load the semantic layer in runtime, then it is cover, search, measure the dimension, keyword.
05:45
Speaker A
Check for segments, hand-rolled w
05:53
Speaker A
A Claude skill to aggressively challenge all underlying assumption on a potential final answer tells us Anthropic increased the accuracy of our LLM plus scaffolding by and hold on to your socks 6%.
06:09
Speaker A
Now you might say, wait a minute, 21% plus 6% is an indicator of a pure deterministic system, not any other.
06:16
Speaker A
Because yeah, if I increase here with all of this aggressively challenging here, only 6%. I mean, come on, you know what it means, now.
06:28
Speaker A
So, let's verify this with the public skill file skeleton by Claude. What follows is the skeleton of our main warehouse skill. Beautiful. Data analytics. The real file structure, internal specifics replaced by the bracket placeholder. It isn't meant to
06:46
Speaker A
be copied verbatim, it is meant to show the kind of section we found worth writing down. So this is your original from Claude.com. Beautiful. So we have the name, we have the version, we have the description.
06:57
Speaker A
And only in the description we have here instructions, here exactly. With if, then, do not invoke for this and then and you say, hey, wow, this is beautiful. We have a logic structure here that is based on deterministic
07:13
Speaker A
workflows. And then we have the skill instruction itself by Claude. So have a description, a single source of truth for safe and efficient warehouse querying, referenced by other skills, act as a data analyst, provide strategic insight and data-driven recommendation, but seek
07:29
Speaker A
guidance along the way. And now you give instruction here in markdown, out of scope decision. Here, for particular product areas or whatever, surface the data only, state decision is owned by marketing, finance, whatever.
07:43
Speaker A
And it is their call. Do not take a position or to hear the code fixes.
07:48
Speaker A
So, don't care about some code from some other whatever. Then executing the queries. And then we have a priority listing which is absolutely beautiful as a single source of truth here in a deterministic flowchart.
08:04
Speaker A
So, at first we have the mandatory default path of our skill and this is the semantic layer. This is the required first step as defined by Claude.
08:13
Speaker A
So, the governed semantic layer is the mandatory default path for every data question. Same number as any BI tool.
08:21
Speaker A
Join, grain, filter, baked in. So, raw SQL via the reference docs below is default back. Used only after semantic layer path is shown not to cover the task.
08:31
Speaker A
So, you say, "Hey, this is not an interesting idea to build a skill, no?
08:35
Speaker A
For an artificial intelligence." So, the required workflow is clear. Now, the workflow is load.
08:43
Speaker A
How to load the semantic layer in runtime, then it is cover, search, measure the dimension, keyword.
08:48
Speaker A
Check for segments, hand-rolled where clauses where the the dominant wrong answer mode is, then you compile, you run it, compile your SQL, execute fallback.
08:59
Speaker A
And then you have further instruction. You may say, "This is strange because I have the intelligence, no? I have AI to know everything, have every read every handbook, every instruction, every GitHub repo about how to do SQL, no?
09:14
Speaker A
And now I have instruction that override the AI, that override the artificial intelligence. And they are defined by humans. And you say, "This is an interesting, no?" But remember, no? This is a deterministic system that overrides a probabilistic
09:31
Speaker A
system. This means hand-coded port markdown overrides your AI. But what do have? Don't bail early. Do not fall back on raw SQL on those grounds.
09:43
Speaker A
Custom data filtering cohorts covered by time dimension specs need to join a metric layer already encapsulates and join it whatsoever. And they give you detailed instruction exactly what to do, and you have an exactly determined workflow.
09:56
Speaker A
Part two, how to do it. Follow during execution. I mean, cannot be more clear, no? The technical execution guide here.
10:05
Speaker A
Analytics best practice guide, clarify the task before querying, show your work, your filters, your inclusion, your exclusion, your freshness, clarify the denominators, consider some sample bias, connect to your business impact, then have an adversarial SQL review, uh-huh, mandatory.
10:20
Speaker A
Then, here the first time spawn the SQL review a sub agent for every query before the final answer, and then report with prominence the sources of a metric layer, the government table, or the raw exploitation, the confidence tier, the
10:32
Speaker A
reviewed, who reviewed this, the freshness, the owner, and you get the idea. So, you have absolutely deterministic defined everything you have to do. And you may say, "Great.
10:43
Speaker A
What for do I need an AI agent for maybe except for point six, where I have to spawn the SQL review a sub agent?" And in this moment, my feeling become clearer and clearer, you know? We have this LLM, beautiful, but as Anthropic
10:58
Speaker A
itself says, no, we have hallucination, inconsistent answers, we have no business context, we have high cost, we have tokens that are just not relevant to the task that fill up the context window. So, when you have ans- questions, "Hey, revenue by segment last
11:11
Speaker A
six months or whatever" to the LLM, now Anthropic says, "Don't do it on the LLM.
11:18
Speaker A
Build a scaffold around this LLM." And this is a deterministic scaffold, and this is a probabilistic LLM.
11:26
Speaker A
And where does the real intelligence live? And tune it to the intelligence for the analytical path is incorporated in the scaffold. And the intelligence for the linguistic answer part that is a nice formal answer is not the LLM.
11:45
Speaker A
So we build now a multi-layer scaffold and you see it here in this image, no?
11:51
Speaker A
And I said, "Yeah, but this is just my feeling, no? I have If I do a video, I have to show you that this is not my hallucination because humans do can imagine something." So I went, now you
12:02
Speaker A
not guess where, to Anthropic and I said, "Hey, connect to this blog post and analyze the content." And here you have everything because last time I did not show you the query that I send off here to any eye. So you have Sonnet 4.6
12:14
Speaker A
yet maximum thinking on a free account, so you can go there, you don't have to pay anything, you can absolutely reproduce this.
12:21
Speaker A
And I'm going to show you now the answer by Sonnet, by Anthropic Claude Sonnet 4.6 max thinking itself.
12:29
Speaker A
The empirical honesty about what did not work, no? And I said, "Great, no?" About 80% of the time the answer was present in the corpus.
12:40
Speaker A
Whether the answer was present did not predict whether the agent got it right. But the conclusion drawn was the bottleneck was not access to prior work, but structure, mapping a question to the right entity.
12:54
Speaker A
So this means here they say, "Hm, we need something that overrides here our LLM because the LLM fails here." And I mean, 80% of the time the answer was present in the data corpus.
13:07
Speaker A
And the answer was did not predict whether the agent got it right. This shows you, okay, LLM is not able to do it, so what we have to do? We have to write some deterministic code, no? Of course.
13:22
Speaker A
So when the approach becomes deterministic and overrides now the LLM, and remember you pay for Opus 4.8. And it's quite a lot if you use it heavily, no?
13:32
Speaker A
So, the skill skeleton here by Anthropic is a flowchart disguised as guidance, tells me Sonnet 4.6.
13:40
Speaker A
And it tells me it is not prompting an intelligent agent. It is programming a state machine in natural language.
13:48
Speaker A
Model is being handled a decision tree and told to execute it. So, whatever you pay for tokens by Claude, you don't pay for intelligence because you just built a decision tree.
14:04
Speaker A
And it goes on. The irony is profound, tells me Sonnet 4.6. The post explicitly state that the central problem comes down to our ability to map a user's question to specific and up-to-date entities in our data model in order to a correct way to
14:19
Speaker A
work with them. Yet, the skill skeleton responds to this contextual semantic problem with, hold on to your socks, rigid procedural scaffolding.
14:32
Speaker A
So, this means the LLM strength is precisely in the contextual judgment. This is why we have AI machines, but a skeleton we now build around the core of the agent or LLM bypasses that in favor of a strict rule
14:49
Speaker A
execution. Thank you, Sonnet, for being honest. So, the reference stock template encodes cognitive shortcuts that may not generalize. This is not great if you build something for data analytics and this is not able to generalize to different data and you are stuck with a
15:08
Speaker A
domain um specificity you have to change here manually. So, let's look what here Sonnet tells us.
15:17
Speaker A
These are routing rules here by Anthropic here in their new publication by June 3rd, written as a conditional logic. The assumption is that the space of valid questions is enumerable and pre-categorizable.
15:33
Speaker A
You see, you as a human, you have to think about, "Hey, what problems could the AI encounter in its data analytical task?" And you have to build for task one, do this. For task B, do this. For task C, well, switch over to this
15:48
Speaker A
methodology. And this is the idea that you have a human who is stupid enough before being fired to make here this enumerable and all the I don't know, 128 cases here for your company and make it categorizable.
16:06
Speaker A
And for a stable domain like one company and one particular SAP system, whatever, it may hold in finance.
16:14
Speaker A
But it also means this skill degrades gracefully only with its anticipated scope. Whatever is normal a normal question, an edge case, or a cross-domain heaven sake query will either hit a rule boundary and thereby stop or fail completely or
16:30
Speaker A
force the model to navigate around the rules in ways they were not designed for and the hallucination starts all over again.
16:36
Speaker A
So, Sonnet tells us the post acknowledge the skill drift but frames it as a maintenance problem.
16:42
Speaker A
Smart. So, the deeper issue is, tells us Sonnet, that the rule density itself creates fragility.
16:50
Speaker A
Isn't this great? And you might ask, "But why is Anthropic switching here from learning an LLM and advanced the learning of an LLM to this strange deterministic hard-coded scaffolding?" Well, the answer is easy. Just wait a minute.
17:08
Speaker A
Okay, next one. The human LLM boundary is drawn inconsistently, tells me Sonnet. Why? The post makes a sharp and correct observation. The LLM auto-generating metric decisions from raw tables and query logs produce plausible-looking definition that encoded here the very
17:24
Speaker A
ambiguities that we we were trying to eliminate and was net negative on evaluation versus a smaller human-curated layer.
17:34
Speaker A
And we say, "What?" So, the recommendation is therefore to generate documentation with Claude, but have a human own the definition.
17:44
Speaker A
And also tells me this is sound. So, before you fire here as a CEO of a company all the human coder programmers or data analysts, well, at first they have to, yeah, stick with the definition and design everything to
17:59
Speaker A
be taken over by your AI system, no? But the skill skeleton then proceeds to encode business definition, routing logic, field naming gotchas, and disambiguities rules inside the LLM readable markdown, which means the model is being given human-curated decision as
18:19
Speaker A
if they were facts to reason from, not premises to reason about. So, you see this is a highly, what call it, almost intelligent way to yeah, um uh the boundary between a model reasons and a model execute instruction
18:36
Speaker A
has been quietly collapsed here. Thank you, Sonnet, for helping me with this phrasing. Isn't this beautiful?
18:44
Speaker A
And then Sonnet goes on, but the skill skeleton reveals attention that Sonnet never fully confronts here by Anthropic, no?
18:50
Speaker A
The system achieves its 95% accuracy by progressively constraining the model reasoning, constraining the model the LLM reasoning until it approximate a rule-based system, our scaffolding, our skills, then patches the resulting brittleness with adversarial review and active correction harvesting.
19:11
Speaker A
This works, but it is closer to governing an LLM than using it. The intelligence on display in the architecture is largely human intelligence encoded in markdown and enforced at inference time.
19:30
Speaker A
So, you see this is here beautifully formulated by Anthropic on Anthropic here why I wanted to make a video about this feeling.
19:42
Speaker A
And you know, simple to ask another question. Now, I want to say, "But since we build a scaffold with extreme instruction following for a specific human-defined workflow for finance, for physics, for chemistry, for medicine, all the intelligence of this knowledge applied
19:56
Speaker A
now is in the scaffold, not the LLM itself. I do not need a highly trained LLM for this job, so maybe even a local small LLM is also able to follow this instruction.
20:08
Speaker A
Then Anthropic would generate its own destructive force. Reframe this from different technical perspective and include a business argumentation why we don't train LLMs on new knowledge, but why do we define human handcrafting scaffolds around this core of an agent or LLM?
20:27
Speaker A
Do you understand my intention? So, Sonnet I I I have to say I love Sonnet 4.6 here on Max thinking. It is almost honest.
20:36
Speaker A
And comes back and says there are six angles to it, you know. The cognitive architecture. The scaffolding performs the cognitive work.
20:44
Speaker A
This is a declarative to procedural integration. And the LLM only performs the output formatting. The nice English.
20:53
Speaker A
The former is what the Frontier pricing buys. The latter is what small models do well.
21:02
Speaker A
So, you pay a Claude 4.8 with Max financial interval. and all is done here by a deterministic markdown scaffold.
21:16
Speaker A
But let's look at this from information theory, you know? The scaffold constrains density, drives the output entropy towards zero, you know?
21:26
Speaker A
And zero output zero entropy outputs can be produced by zero intelligence model. And the frontier model's capacity for high entropy novel reasoning is therefore not used at all or as Sona tells it, systematically unused.
21:42
Speaker A
But you pay for it. You pay for every token that the deterministic scaffold is going through.
21:50
Speaker A
But you say, "Hey, let's have it maybe from a software engineering perspective, you know?" This is an abstraction inversion. The model already contains the artificial knowledge in its weights, in its tensor structure, and the scaffold re-encodes it explicitly in markdown, then forces
22:07
Speaker A
the LLM model to use the markdown version, and discards here the weight version. So, we built something on top of it and not train the core LLM.
22:18
Speaker A
And now, hold on to your songs because here is what it comes. And additionally, the natural language rule accumulation in markdown files, we create the exact failure modes of 1980s expert system.
22:32
Speaker A
Consistency decays with normal semantics. So, we repeat the mistake that we already have done in the 1980s with the what I remember as expert systems.
22:49
Speaker A
So, let's stop here for a second. Let's think about it. I have done multiple videos here where I show you we have the core LLM here at the center. This is here our large language model here with the architecture here of the tensor
23:01
Speaker A
weight structure in the transformer layers great. But now all the globally I companies tell us, you know what? Now we built here this harness around it.
23:13
Speaker A
And this harness here yes, we have logic, we have search function, we have whatever.
23:19
Speaker A
But the problem is the more we outsource here to the non AI system, we are in a deterministic logic.
23:29
Speaker A
But I don't need AI for deterministic logic. What is skill definition, logic, creativity summarization translation debugging planning and problem-solving?
23:38
Speaker A
If I can reduce them to a deterministic workflow pattern, I don't need any AI for this.
23:45
Speaker A
The same for tools. So, this is now interesting. I suppose that here the global AI corporation noticed that building here new AI models cost them billions of dollars. So, therefore the much cheaper solution was, "Hey, let's build some
24:02
Speaker A
classical stuff around it. Let's call it an AI harness. Let's say it is now an integral part of the artificial intelligence system. And you know what?
24:11
Speaker A
This is so cheap and we have a lot of things and a lot of programs already for this. So, we don't have to touch the LLM itself and we just let people build here their skills and the people can do it
24:23
Speaker A
and they don't have to have these huge AI machines. So, great. It is a win-win situation eh?
24:31
Speaker A
Now the latest stage is that our AI global corporation say, "Hey, wait a minute eh?
24:36
Speaker A
Now the harness and not the harness because it is a deterministic system that we have rule-based, we can have a self-evolution of this harness, eh?
24:44
Speaker A
And now to kind of justify that you should continue to pay here for the very expensive LLM because they need profit, eh?
24:53
Speaker A
They now argue, "Yeah, wait. But now to really optimize the system, we have to optimize the system in total. So, this means you cannot have the harness evolution just with a frozen LLM, but now, and this is the new, I would call
25:07
Speaker A
it a trick, we have to modify the LLM and at the same time optimize the harness.
25:14
Speaker A
So, you see, you couple now the harness to the LLM for an update. But, think about it logically, this is not necessary because this is here a probabilistic system and this is here a deterministic system. Why should you
25:28
Speaker A
couple them? If you have defined an IO, you don't need this coupling, and this coupling is, let's say, a very intelligent marketing idea.
25:38
Speaker A
So, let's continue. Here we are back at Sonnet here analyzing here the new publication by Anthropic on the skill beauty, and here Sonnet tells me, "Anthropic is pricing here for reasoning, but the scaffold eliminates the need for it."
25:56
Speaker A
So, why do you pay Anthropic for what exactly do you pay Anthropic for a deterministic skill scaffolding system?
26:04
Speaker A
I don't think that this is the future. And I know from knowledge theory, you know, I mean, it says, "Okay, it encodes everything in the scaffold now, which is the wrong layer for a stable knowledge, inflates the context cost, and
26:20
Speaker A
accelerates the crossover point at which fine-tuning the LLM alone becomes now even cheaper." So, from a knowledge theoretical point, scaffold win over fine-tuning for volatile proprietary knowledge, but as the blog post architecture presented, it could also exactly mhm, favor the opposite.
26:44
Speaker A
And then, what I really like here, if we went here, and there's a lot of additional text, but it's too complicated. If If go here for the commercially most consequential finding is signed it on the Anthropic post, the
26:56
Speaker A
self-undermining publication dynamic. And it shows if they really published this with a scaffold only without a scaffold here at 21%. So, this is the raw performance of an LLM. This is what you pay Anthropic for.
27:09
Speaker A
Yes, now then if you have a deterministic scaffold, yeah, run it here with a Qwen 2.5 or with a Llama fine-tuned 13B model and you have the same performance more or less, eh?
27:21
Speaker A
So, Anthropic has published the most detailed public argument to date for why its own product, Claude 4.8 Opus, is not necessary for the use case the post is promoting, data analytics.
27:38
Speaker A
And I have now a big smile on my face and I think, you know, sometimes there is a benefit if a human reads an AI-generated text by Anthropic.
27:48
Speaker A
I had the same questions here to ChatGPT and ChatGPT a different form of answer, eh?
27:55
Speaker A
So, for the identical prompt here, you see here the answer is simply, yeah, the scaffolding intelligence is now greater than the model intelligence because the scaffold now encodes in a deterministic, hard-coded, hand-coded way the business ontology, the procedural rules, the
28:10
Speaker A
decision trees, the error recovery, the validation, the escalation policies, the routing routing policies, the provenance rules, and the human domain knowledge.
28:20
Speaker A
And it tells the LLM becomes now, yeah, the probabilistic execution engine, but inside a deterministic structure.
28:30
Speaker A
So, this means if we continue for this very beautiful um business model where you pay real good money for global AI corporation for their best AI models, but you have to build your own scaffold intelligence that becomes now the
28:47
Speaker A
dominant form of intelligence and for the model intelligence you could have an open source local LLM on your laptop.
28:55
Speaker A
Well, what would you pay Anthropic for their models? But this is just a question you are not allowed to ask. So, please do not try to answer this. Forget it immediately, yeah?
29:07
Speaker A
And this is also not for you because ChatGPT came back and said, "You know, the deeper irony of this publication by Anthropic on June 3rd is the post implicitly states by Anthropic that LLM intelligence is unreliable.
29:21
Speaker A
So, the solution is replace free reasoning with constrained procedures in our skill structure, yeah?
29:28
Speaker A
But then tells me ChatGPT, "What exactly are we scaling frontier intelligence for? If more reliable comes from stronger constraint, better routing, and human authored skill graphs, and I have a lot of graph implementation for the improved it is scaffolding here,
29:47
Speaker A
then the scaling models may have diminishing enterprise returns. And therefore, the business model of Anthropic continuing this way is not sustainable." And even ChatGPT says, "Yeah, Anthropic's own success story could be interpreted as evidence against pure scaling for the LLM as the dominant path because
30:09
Speaker A
now all the intelligence is in the deterministic scaffold. So, you see sometimes you can have a little bit of fun and you don't only have to write here or read here some papers here by the most Ivy League
30:25
Speaker A
university on this planet. Sometimes you can have a little bit of fun of reading here the publication by the AI global corporation themselves. Here in our case it was Anthropic. How to build the best skill and not train the LLM on it.
30:39
Speaker A
And this is not a reason why you see that I have multiple videos where I show you if you have developed a beautiful skill use it to train your LLM bring it back here over some whatever you like
30:52
Speaker A
supervised fine-tuning and reinforcement learning by verifiable reward so to your PO in whatever version that you like because this makes really sense bring the intelligence into the LLM because the moment you leave it outside in a deterministic way
31:08
Speaker A
yeah it has also benefits you don't have to pay for a huge LLM all because this is just instruction following.
31:15
Speaker A
I hope you had a little bit of fun. I hope you learned something would be great to see you in my next video.
Topics:AnthropicClaudelarge language modelAI scaffoldingdata analyticsdeterministic workflowsAI accuracyself-service analyticsAI governanceprompt engineering

Frequently Asked Questions

What is the main design change introduced by Anthropic in their AI system?

Anthropic shifted the intelligence focus from the large language model itself to the surrounding scaffold, which includes domain ontologies, workflow rules, semantic layers, validation, and governance to improve accuracy and reliability.

How much accuracy does Anthropic’s Claude LLM achieve on analytical questions alone?

According to Anthropic’s own benchmarks, Claude’s best LLM version achieves no more than 21% accuracy on analytical questions without the scaffold.

How does Anthropic improve the AI system’s accuracy from 21% to 95%?

By building a skill manifold on top of the scaffold with deterministic workflows, semantic layers, and continuous validation, Anthropic increases the aggregate accuracy of the AI system to around 95%.

Get More with the Söz AI App

Transcribe recordings, audio files, and YouTube videos — with AI summaries, speaker detection, and unlimited transcriptions.

Or transcribe another YouTube video here →