最近爆火的 Harness Engineering 到底是个啥？一期讲透！ — Transcript

An in-depth explanation of Harness Engineering in AI, covering its evolution from prompt to context to harness engineering for real-world AI applications.

Key Takeaways

Harness Engineering addresses the limitations of prompt and context engineering by focusing on execution and long-term task management.
Effective AI systems require not just good prompts but also dynamic context management and real-time feedback integration.
Optimizing information flow to the model involves strategic timing and selective data presentation rather than maximal input.
AI engineering has evolved to meet increasingly complex real-world demands, moving from simple language prompts to sophisticated multi-agent coordination.
Harness Engineering is essential for building reliable AI agents that can perform complex, multi-step tasks in dynamic environments.

Summary

Harness Engineering is a newly popular concept in AI focusing on improving the reliability and execution of AI models in real-world applications.
The video explains the evolution of AI engineering through three stages: Prompt Engineering, Context Engineering, and Harness Engineering.
Prompt Engineering focuses on language design to guide models by creating partial probability spaces through prompts.
Context Engineering addresses the need for accurate and timely information delivery to the model, managing complex task chains and dynamic data.
Harness Engineering solves the problem of maintaining correct execution over long tasks and multi-step processes beyond input optimization.
The video highlights challenges like managing historical dialogue, tool integration, and balancing information quantity to optimize model performance.
It discusses practical techniques like RIG and agent skills that optimize information delivery and task execution in AI systems.
The importance of giving the model the right information at the right time, rather than overwhelming it, is emphasized.
The video also covers how OpenAI and Astrobic have contributed to the development and application of Harness Engineering.
Overall, Harness Engineering is presented as a critical advancement for deploying AI agents capable of sustained, accurate, and context-aware task completion.

Full Transcript — Download SRT & Markdown

Speaker A

Today, we are going to talk about a word that is very popular in the AI circle recently, but many people don't really understand it. Harness Engineering. If you are also doing agents recently or are focusing on the landing of AI applications, you may

Speaker A

encounter such problems more or less. Why is the same model made by others can be connected for a long time and the success rate is very high, but it is always hard to get it in your hand? Many people may think that the

Speaker A

model is not strong enough, that the prompt is not good, that the RIG is not clear. Of course, these all have an impact.

Speaker A

Uh, to be honest, I didn't have a specific word to describe it until recently. The concept of "Harness Engineering" became more and more popular. I realized that the things I changed at that time were essentially "Harness." So in today's video,

Speaker A

I want to tell you the concept thoroughly. We mainly divide into three parts: How did "Harness" come out step by step? What parts does a mature "Harness" include? What did OpenAI and Astrobic do? In the past two years, AI engineering has experienced three

Speaker A

obvious changes. From Prompt Engineering, Context Engineering, to the latest Harness Engineering. On the surface, it seems to have only changed a few new nouns, but if you just translate it into a buzzword, it will completely underestimate them. These three words correspond to the

Speaker A

three-stage problems of the current AI system development. Does the model understand what you are saying? Does the model have enough and correct information? Can the model continue to do the right thing in the real execution? You will find that these problems are

Speaker A

expanding layer by layer. When the big model just started to work, the most intuitive feeling is that you change the way you see the model, the result may be much worse. For example, you say, "Please help me summarize this article." It may only

Speaker A

give you a very flat summary. But if you change the way you see it, the effect will be different immediately. So at that stage, everyone believed in one thing: the model is not that it doesn't know, but that you don't understand the problem.

Speaker A

So everyone started to study the keywords, what role setting, style constraints, the strength of the field, distribution guidance, output format, etc. Why are these things effective? Because the big model is essentially a very sensitive probability generation system for the above and below text.

Speaker A

What identity do you give it? It is easy for it to answer with that identity. What kind of example do you give it? It is easy to make up for it with that style. What kind of constraints do you emphasize? It is easy

Speaker A

to take that part as the focus. So the essence of the prompt is not a mini model, but to create a partial probability space. The most important ability of this stage is not the system design, but the language design. But the prompt project

Speaker A

soon encountered a ceiling. Because many tasks are not "you just say it clearly," but "you really have to know." For example, you let the model analyze an internal document of the company, answer a product's latest configuration, and write a code according to a

Speaker A

very long standard, and complete complex tasks between multiple tools. At this time, you will find that no matter how beautiful the prompt is written, it cannot replace the fact itself. So, what is the prompt good at? Long-term tasks, 約束輸出, 激發模型的已有能力. But it is not good

Speaker A

at "air-to-air" to make up for the lack of knowledge, manage a large number of dynamic information, deal with the state of the long-distance task. To put it bluntly, the key word is the problem of expression, not the problem of information. So the second

Speaker A

stage started. When everyone is just a chat robot, the words are very important. Because the task is short, the chain is short, the status is low, many problems can be solved by saying the words. But later, Agent started to be popular. The model

Speaker A

not only wants to answer questions, but also wants to do things in the real environment. It needs to have multiple conversations, download browsers, write code, input these tools, and transmit the middle result between multiple steps. It also needs to continue to build plans

Speaker A

based on external feedback. Then the problem will change. The system is not facing the same answer, but the whole chain of tasks. For example, if you don't ask a simple question, help me summarize this article, but let it do a more real task.

Speaker A

For example, help me analyze this narrative article, find the potential risk, combine the opinions of historians, give recommendations, and then produce a version of the feedback draft sent to the product manager. You will find that this is not a problem that can be

Speaker A

solved with just one question. It needs to get the current data, history, evaluation records, and the current goals, and analyze the conclusion, who is the subject, how to adjust the tone, etc. So the core of context engineering becomes a sentence, the model may

Speaker A

not know, the system must send the correct information at the right time. The context here is not just a few pieces of background information. In the sense of engineering, it represents the overall information of all the impact models' current decisions, including input from

Speaker A

users, history dialogue, results from unlocking, tools' return, current task status, middle-stage production, system rules, safety restrictions, and other agents' results from the structure. So you will see Prompt is actually just a part of context. And because of this, the attack mechanism of this

Speaker A

set of upper and lower text is very important. Speaking of context engineering, I think RIG is also a typical practice. The value of RIG is very direct. There is no knowledge in the model parameters. How to fill it in when running? Everyone knows

Speaker A

the method. Unlock first, then insert the relevant content into the upper and lower text.

Speaker A

But the real mature context engineering is not only concerned with unlocking. It is concerned with the entire complete chain. For example, how to cut the file, how to order the result, how to compress long sentences? When do you want to keep historical dialogue?

Speaker A

When do you want to remove the medicine? Do you want to reveal all the tools to the model? Between multiple agents, is it the original text, the removal of the medicine, or the self-created text? Including the recent popular agent skills, I think it

Speaker A

is essentially a high-level practice of the above and below language engineering, because it solves a particularly realistic problem. If you put all the parameters of the explanation of a dozen different tools, all the parameters of the explanation of a dozen different tools, all

Speaker A

of them are set up to set up a model, in theory, the model will know more, but the practice will be worse. Why? Because the window of the upper and lower lines is a very scarce resource. The more information, the more attention will be

Speaker A

spent. So Skill uses a very typical idea called "finding the missing pieces." It's not about showing the whole ability to the model at the beginning, but only showing the least amount of source information. When it really starts to show some kind of ability,

Speaker A

it will add that part of the SOP, the detailed warehouse information, script, and dynamic.

Speaker A

This idea is very important because it tells us that the optimization of the upper and lower text is not just to give more, but to give, to divide, to give at the right time. But the upper and lower text project is not just

Speaker A

the end point, because later we found a more troublesome problem. Even if the information is given correctly, the model may not be able to perform correctly. It may have done well in the planning, but the execution went wrong, and the tool was lost,

Speaker A

but the result was misunderstood, and it was slowly parallel in a long chain, but the system did not find it. At this point, we find that the prompt and the above and below text are mainly solving the problem of input. The prompt is

Speaker A

optimized for the purpose of expressing, and the above and b

Speaker A

When the model starts to act continuously, who will supervise it, restrain it, and deceive it? At this point, the third stage comes. The word "Hernes" originally means "Giant, mighty, and powerful device". It is a very simple thing to remind us

Speaker A

that when the model goes from answering questions to performing tasks, it not only has to be able to answer the questions, but also to be able to control the whole process. This is the starting point of "Hernes Engineering". If the previous two generations

Speaker A

of engineering focused on how to make the model more "smart", then Harness is more concerned about how to make the model "steadier", "steadier", "if it goes wrong, it can be pulled back". Here I use a more similar example to explain these three concepts.

Speaker A

Suppose you want to send a new person to complete a very important customer visit work. Prompt Engineering is that you have to tell him to explain the task first.

Speaker A

For example, meet Han Xuan first, then introduce the solution, then ask for the demand, and finally confirm the next step. This is prompt. The point is to Context engineering is to tell the client that the data must be fully prepared, such as the

Speaker A

client's background, past communication records, product price, product situation, and the goal of this meeting.

Speaker A

These are all context. The key is to give the information correctly. If this meeting is really important, you will continue to do a lot of things, such as let him take the checklist to let him report on the key points. After the meeting,

Speaker A

verify the details and recording. If you find that the correction is immediately corrected, Finally, the results are checked according to the clear standards. These are the "HARNES" The point is not whether the information is clear or not, but whether there is a set

Speaker A

of continuous observation, continuous evaluation, and final verification of the results. So these three are not replaceable relationships, but are included relationships. "Prompt" is the process of commanding, "Context" is the process of inputting the environment, and "HARNES" is the process of the entire operating

Speaker A

system. Their boundaries are one layer bigger than one layer. LaunchCent's engineer gave Harness a very typical definition: "agent = model + harness" Harness = agent - model Translated into human language, in an agent system, except for the model itself,

Speaker A

almost everything that can determine whether it can be stable and reliable can be calculated into Harness If I look at it from a different angle, I would divide a mature Harness Engineering into six levels. The first level is that we stand in the

Speaker A

perspective of Harness to see if the context model can be stable. Many times it is not only dependent on whether it is smart, but also dependent on what it sees. So the first responsibility of Harness is to make the model think in the

Speaker A

correct information boundary. The first layer usually includes three things. First, the goal and definition of the character. The model needs to know who it is, what the task is, what the success standard is. Second, the selection and selection of information. The more the

Speaker A

better, the better. Third, the structure of the organization. Where to put the fixed rules.

Speaker A

Where to put the task at the time, where to put the status of the task, where to put the external evidence, it is best to be clear. Because once the information is messed up, the model will easily miss the focus, forget the constraints,

Speaker A

or even self-inflict. The second is the tool system. Without tools, the big model is essentially a text-based predictor. It will explain and summarize, but it can't get in touch with the real world. Once the tools are connected, the model can do things for

Speaker A

real. For example, search the web, read the text, write the code, drop the API, etc. But what Harness does here is not to simply hang the tools up. but also solve three problems. First, what tools should he give? Too few tools, not enough

Speaker A

ability, too many tools, the model will be used randomly. Second, when should the tool be used? Don't check when you don't need to check, don't answer when you should check. Third, how to restore the model to the tool result? The dozens of results

Speaker A

that have been collected should not be stuck back and forth, but to be trained, selected, maintained, and related to the task. The third layer, execution and programming. The core problem solved on this layer is what the model should do next. Many agent problems

Speaker A

are not a step away, but not all the steps are connected. He will search, he will summarize, he will write code, but the whole process is thinking about where to do it, and finally hand over a bunch of semi-finished products. So a complete

Speaker A

task usually needs to have such a path. First understand the goal, then see if the information is enough, but continue to capture, continue to analyze the results, then generate output, check output, not satisfied with the request, then redo or redo. At this time,

Speaker A

you will find that this is very close to the work of the person. The difference is that people rely on experience, agents rely on experience. The fourth layer is memory and state. An agent without state will lose memory every round. He doesn't know

Speaker A

what he just did, or which conclusions have been confirmed, or which problems have not been solved. So Harness still has to manage the state. Here we have to at least let him distinguish three things. First of all, the current task status, the middle

Speaker A

result in the drawing, the long-term memory and user number. If these three types are mixed together, the system will be more and more messy. After a clear understanding, the agent will become a stable collaborator. The fifth is evaluation and observation. This is the

Speaker A

most easily ignored layer in many teams. Many systems are not not generated, but they don't know if they are doing well after being generated. If there is no independent evaluation and observation ability, the agent will remain in a good state of self-feeling for

Speaker A

a long time. This layer usually includes output and verification, environment verification, automatic testing, date and indicator, error response, etc. In other words, the system not only needs to do and know if you can really do it right. The sixth layer is the last

Speaker A

layer, which is often the key part of the system's decision. Because in the real environment, failure is not an exception, but a norm. Maybe the search is not accurate, maybe the API is over-examined, or the file format is confused, or the model misunderstands

Speaker A

the task. If there is no mechanism to recover, the agent can only come back from the beginning every time it makes a mistake. So a mature Harness must include three things: the constraints, what can be done and what cannot be done, Test, such

Speaker A

as how to check before and after output, recovery, how to return to a stable state after failure. After the concept is finished, let's look at the most valuable part.

Speaker A

The real practice of a first-tier company. The reason why the word "Hernes" is so popular is not because of the methodology, but because many companies have put it into the product and engineering system. For example, LungChain, in the case of the model of

Speaker A

the base model, only through the transformation and replacement of "Hernes", it has killed its own intelligent enterprise from the top 30 to the top 5. OpenAI, relying on a team of only a few human engineers, has built a production and application with over

Speaker A

a million code numbers with agent. 100% of the code is written by agent, and only one tenth of the time is developed by pure artificial intelligence. Atropic also built a system that can fully automate its own code. With just a natural source requirement,

Speaker A

it can run for several hours continuously without human intervention, and finally make a complete game, a complete digital audio workstation. Let's take a look at the practice of Atropic.

Speaker A

First, they summed up two very typical problems on the task of long-term autonomy. The first question, I translated it myself as "up and down, text and交lis". The more the text is filled, the more the model starts to lose details and focus. There will

Speaker A

even be a very interesting phenomenon. It seems that it knows that it can't fit in, so it starts to hurry to finish. Many systems face this problem will do context complication, which is to compress the previous history text and then continue to run.

Speaker A

But Athrobic found that this is not enough for some models. Because the compression is just getting shorter, it does not mean that the burden really disappears. So they did something more radical called Context Reset. Instead of continuing to press in the original upper

Speaker A

and lower text, they replaced a very clean new agent to hand over the work to him. What does this idea look like? It's like the project encountered a memory leak, and it didn't continue to reset, but it was directly restarted and restored. This

Speaker A

is actually a very typical Harness design. The second problem solved by Anthropic is the problem of self-improvement. First of all, the model works on its own, and then let it score for itself. It is often more optimistic. Especially in the field of design

Speaker A

experience, product completeness, and other issues without standard answers, bias is more obvious. So they adopted a very key idea, to separate the people who work and the people who do the verification. They are divided into two parts. The Planner is responsible for the

Speaker A

blurred demand and the complete specifications. The Generator is responsible for the gradual realization. Evaluator is responsible for real testing like QA. More importantly, the evaluator will not only look at the code, but also the actual operation page to see the actual interaction and

Speaker A

check the actual results. In other words, it is not a abstract review, but a verification of a specific environment. This is very important because it has a very clear engineering principle behind it: production and inspection must be separated. As long as the evaluator

Speaker A

is independent enough, the system can form a real effective loop, OpenAI has redefined the work of engineers in the Age of Agents. They came up with a very interesting idea. Humans don't need to write a single line of code in this environment. Humans

Speaker A

only need to be responsible for designing the environment. In short, the work of engineers has become three things. First, to reach the product target. to solve the small tasks that the agent can solve. When the agent fails, it's not to make him work

Speaker A

harder, but to ask what ability is missing in the environment. Finally, to establish a feedback chain to let the agent truly see the results of his work. I agree with this sentence. When the agent has a problem, the repair method is almost never

Speaker A

to work harder, but to determine what structural ability he lacks. This is actually a typical harness thinking. OpenAI also has a typical practice, which is also a gradual leak.

Speaker A

They made a mistake that many teams made in the early stage, and wrote a huge "agents.md", which filled all the regulations, frames, and agreements. As a result, the agent became more confused, because the window of the online store is a scarce resource, and

Speaker A

it is too full to say anything. How did they change it later? They turned Agent's M/D into a log page, and kept the core content of the page. The more detailed content was taken to the design document, the design document, the implementation plan,

Speaker A

the quality rating, the safety rules, and the specific paper documents. When Agent looks at the log first and then goes into it when it is needed, we will find that this is basically the same idea as the "skills" we mentioned earlier, not a

Speaker A

one-time authority, but a safety policy. Another practice is that OpenAI not only allows agents to write code, but also allows agents to see the entire application. Because once the output speed comes up, the frequency is no longer written, but checked. Humans can't check

Speaker A

it at all, so they let agents check it themselves. How to check? First, the browser can screenshot the real operation of the user by clicking the page. and then connect the daily system and the indicator system to the agent so that it can

Speaker A

check the log and monitor. Finally, each task is running independently in a separate environment.

Speaker A

The result is that the agent no longer writes the code and says it is finished, but can run up and see the results, discover bugs, fix bugs, and verify again. This is actually a very complete set of attack systems in Harness, performing programming,

Speaker A

evaluation and observation, and the restructuring and recovery. Another thing to note is that OpenAI will not only rely on human beings to make the quality of the code in the last code review, because the agent is too fast to be ordered by human

Speaker A

beings. So they directly write the system rules for many experienced engineers. For example, how to divide the modules, which layer can't rely on which layer, and when it must be intercepted, and how to fix it after finding the problem. The point is that

Speaker A

these rules are not just responsible for reporting errors, but will also return the how to fix to the agent. Enter the next round of the upper and lower gate.

Speaker A

You will find that this is no longer the traditional code definition, but a continuous automatic system. This is also a typical form of Harness. Finally, let's talk about how Prompt Engineering solves the problem of explaining tasks clearly. Context Engineering solves how to give

Speaker A

all the information correctly. Harness Engineering solves how to make the model continue to do the right thing in real execution. So Harness is not replacing Prompt or Context. It is including the first two in a larger system border. When the task is still

Speaker A

a simple single-row production, PROM is very important. When the task starts to rely on web knowledge to run the information, context is very important. When the model really enters the real scene of the long chain of low tolerance, Harness is almost inevitable. This

Speaker A

is why the difference in performance between the same model and different products is so big. Because the real decision to go online may be the model, but the real decision to land or not, is Harness. At this stage, we also see a reality

Speaker A

of the core challenge of AI landing. Thank you for watching. See you next time.

Topics:Harness EngineeringAI engineeringPrompt EngineeringContext EngineeringAI agentsRIGOpenAIAstrobicAI task executionAI application deployment

Frequently Asked Questions

What is Harness Engineering in AI?

Harness Engineering is an advanced stage of AI engineering that focuses on ensuring AI models can reliably execute complex, multi-step tasks by managing context, execution flow, and dynamic information.

How does Harness Engineering differ from Prompt and Context Engineering?

Prompt Engineering focuses on language design to guide AI responses, Context Engineering manages the delivery of accurate information, while Harness Engineering ensures correct execution and long-term task management beyond input optimization.

Why is managing the timing and amount of information important in AI systems?

Because AI models have limited attention windows, giving too much information at once can reduce performance. Harness Engineering emphasizes providing the right information at the right time to optimize model effectiveness.

Get More with the Söz AI App

Transcribe recordings, audio files, and YouTube videos — with AI summaries, speaker detection, and unlimited transcriptions.

App Store Google Play

Or transcribe another YouTube video here →