AI Engineering in 76 Minutes (Complete Course/Speedrun!) — Transcript

Comprehensive 76-minute overview of AI Engineering covering foundation models, Transformers, training, and challenges in the field.

Key Takeaways

AI Engineering leverages existing foundation models to build applications, focusing on adaptation rather than training from scratch.
Transformers and their attention mechanism are central to modern foundation models, enabling efficient and effective processing of large data sequences.
Training data quality and distribution significantly affect model knowledge and biases, necessitating filtering and specialized models.
Compute resources and energy consumption are major constraints in scaling AI models, with ongoing research into optimization and alternative architectures.
Understanding model architecture and training principles is essential for AI engineers to effectively build and improve AI systems.

Summary

AI Engineering focuses on building applications using foundation models rather than training models from scratch.
Foundation models are large AI systems trained via self-supervision on vast web-crawled data, enabling learning without manual labeling.
Transformers, based on the attention mechanism, revolutionized sequence-to-sequence tasks by allowing parallel input processing and dynamic token referencing.
Training data biases and quality issues, such as misinformation and language skew, impact foundation model performance and applicability.
Model size, parameter count, and compute resources critically influence training efficiency and model capabilities, with sparse models offering efficiency gains.
The Chinchilla scaling law guides optimal model and data size for given compute budgets, highlighting the trade-offs in training large models.
Future bottlenecks include scarcity of high-quality training data and the significant electricity consumption of data centers.
Alternative architectures like RWKV are emerging, combining RNNs with parallelization for specific use cases.
Foundation models power diverse applications including coding assistants, image generation, customer support, and data analysis.
Small improvements in model performance can have large impacts on downstream applications despite high costs.

Full Transcript — Download SRT & Markdown

Speaker A

Hey everyone. Today we're diving into the book AI Engineering by Chip Win. 800 pages of really great content about this in-demand field that's offering salaries of $300,000 or more. In this video, I'm summarizing everything from the book to

Speaker A

help you get a high-level overview of the field. We'll talk about foundation models, prompt engineering, RAG, fine tuning, agents, how to build a system, improving inference, and more. I also want to mention this is a super high-level overview of a very detailed technical

Speaker A

book. Don't expect to learn all the details just from watching this video. I really recommend using this as a way to get an overview of what the field looks like and use it as a jumping-off point for your own research and exploration. So

Speaker A

what exactly is AI engineering and how is it different from traditional machine learning? Let's break it down. AI engineering has exploded recently for two simple reasons. AI models have gotten dramatically better at solving real problems, while the barrier to building

Speaker A

with them has gotten much lower. This perfect storm has created one of the fastest growing engineering disciplines today. At its core, AI engineering is about building applications on top of foundation models, those massive AI systems trained by companies like Open

Speaker A

AI or Google. Unlike traditional machine learning engineers who build models from scratch, AI engineers leverage existing ones, focusing less on training and more on adaptation. These foundation models work through a process called self-supervision. Instead of requiring humans to painstakingly label data, these

Speaker A

models can learn by predicting parts of their input data. This breakthrough solved the data labeling bottleneck that held back AI for years. As these models scaled up with more data and computing power, they evolved from simple language models to what we now call large

Speaker A

language models or LLMs, and they didn't stop there. They've expanded to handle multiple types of data including images and video, often becoming large multimodal models. Nowadays, we're seeing foundation models power everything from coding assistance like GitHub Copilot to image generation tools, writing aids,

Speaker A

customer support bots, and sophisticated data analysis systems. Now that we've covered what AI engineering is, let's dig deeper into foundation models themselves, how they're trained, how they work, and why understanding their architecture matters for AI engineers. Foundation models at their core can only know what

Speaker A

they've been trained on. This might seem obvious, but it has profound implications. If a model hasn't seen examples of a specific language or concept during training, it simply won't have that knowledge. Most large foundation models are trained on web-crawled data, which

Speaker A

brings some inherent problems. This data often contains clickbait, misinformation, toxic content, and fake news. To combat this, teams use various filtering techniques. For instance, OpenAI only used Reddit links with at least three upvotes when training GPT-2. The language

Speaker A

distribution in training data is also heavily skewed. About half of all crawled data is in English, which means languages with millions of speakers are often underrepresented. This is why specialized models for specific languages and domains are becoming increasingly

Speaker A

important. Also, the distribution of domains in one of the main training data sets leans heavily towards business, tech news, and art. In terms of model architecture, most foundation models use Transformer architectures based on the attention mechanism. But to understand

Speaker A

why Transformers were such a breakthrough, we need to look at what came before. Transformers were invented to solve the problems of sequence-to-sequence models, which used current neural networks for tasks like translation. These had two main components: an encoder that processes

Speaker A

inputs and a decoder that generates outputs. Both worked sequentially, token by token. The problem is that the decoder only has access to a compressed representation of the entire input. Imagine trying to answer detailed questions about a book when all you have

Speaker A

is a brief summary. Also, input processing and output generation are done sequentially, so it's slow for long sequences. Transformers solved this with the attention mechanism, which allows the model to weigh the importance of different input tokens when generating

Speaker A

each output token. It's like being able to reference any page in the book while answering questions. Plus, Transformers can process input tokens in parallel, making them much faster during inference. Transformers work in two steps: first, pre-fill, process all the input tokens in

Speaker A

parallel to create the intermediate state; and second, decode, generate one output token at a time. The attention mechanism uses three types of vectors: first, query vectors, which represent what information the model is looking for; next, key vectors, like indices of

Speaker A

previous tokens; and finally, value vectors, the actual content of the previous tokens. The model computes how much attention to give each input token by comparing the Q and K vectors. A high similarity score means that the token's content V will heavily influence the

Speaker A

output. This is why longer context windows are computationally expensive. More tokens mean more K and V vectors to compute and store. Attention is almost always multi-headed, allowing the model to focus on different groups of tokens simultaneously. In LLaMA 27B, there are 32

Speaker A

attention heads, for example. A complete Transformer consists of multiple Transformer blocks, each containing an attention module and a neural network module. The number of blocks is often called the number of layers. Before and after each block, there's an embedding

Speaker A

module that converts tokens and their positions into vectors, and finally an unembedding layer that maps output vectors to token probabilities. So that's a super high-level look at this. I would really recommend either reading the book or checking out StatQuest for an awesome

Speaker A

overview of Transformers and the attention mechanism. I'll link that in the description. That's really how I learned. While Transformers dominate, they're not the only architecture. Models like RWKV, which combines RNN-based approaches with parallelization capabilities, are gaining traction for

Speaker A

certain applications. In general, larger models with more parameters have greater capacity to learn and perform better. The number of parameters helps us estimate the compute resources needed for training and inference as well. However, note the parameter count can be

Speaker A

misleading with sparse models, so those with many zeros, which can be more efficient. A large sparse model might require less compute than a smaller dense one. When designing models, compute is often the limiting factor. The Chinchilla scaling law helps calculate

Speaker A

the optimal model size and data size for a given compute budget. It suggests that the number of training tokens should be about 20 times the model size. So a 3 billion parameter model needs about 60 billion training tokens. While the cost

Speaker A

for achieving the same model performance is decreasing over time, the cost for improvements remains high. Going from a 3% to a 2% error rate might require an order of magnitude more data, compute, or energy. But even small performance

Speaker A

improvements can make a huge difference for downstream applications. As we keep scaling models, we're approaching two significant bottlenecks. First, training data. There's concern we will run out of high-quality internet data in the next few years, forcing models to train on AI

Speaker A

generated content, potentially causing performance degradation or requiring access to proprietary data like copyrighted books and medical records. Second, electricity. Data centers already consume 1 to 2% of global electricity, limiting how much larger they can grow without significant energy breakthroughs.

Speaker A

Pre-trained foundation models face two main issues: they're optimized for text completion, not conversation, and their outputs can be factually incorrect or ethically problematic. Post-training aims to address these issues through two main steps. First, supe...

Speaker A

model for conversations instead of completion this requires high quality instruction data showing the kinds of requests the model should handle and how it should respond it's essentially teaching the model what good responses look like second preference fine tuning

Speaker A

preference fine tuning aligns the model with human values using reinforcement learning often called reinforcement learning from Human feedback this involves training a reward model that scores outputs based on human preferences and optimizing the foundation model to generate responses

Speaker A

that maximize these scores while reinforcement learning from Human feedback has been the standard approach newer methods like direct preference optimization DPO are gaining traction some companies even skip the reinforcement learning step entirely instead generating multiple outputs and selecting those with high reward model

Speaker A

scores this is a strategy called best of end Foundation models don't just produce a single definitive answer they generate probabilities for possible outputs how we sample from these probabilities dramatically affects the model's responses the simplest approach is greedy sampling always picking the

Speaker A

highest probability token but this leads to repetitive predictable text to introduce creativity we use sampling techniques temperature controls how confident the model is in its predictions higher temperature values like 0.7 to 1 make outputs more creative but potentially less accurate while

Speaker A

lower temperatures close to zero make outputs more deterministic and focused top K sampling restricts the model to choosing from only the K most likely next tokens typically between 50 and 500 depending on how diverse you want the responses to be top P sampling selects

Speaker A

the smallest set of tokens whose cumulative probability exceeds a threshold p a value of 0.9 means the model will only consider tokens that together make up 90% of the probability Mass this probalistic nature explains many of the behaviors we in Foundation

Speaker A

models like inconsistency with minor input changes and hallucinations where models confidently State incorrect information now that we understand Foundation models a little more let's talk about one of the most crucial yet underappreciated aspects of AI engineering evaluation for some

Speaker A

applications figuring out evaluation can consume the majority of your development effort it's how you mitigate risks uncover opportunities and gain visibility into where your system is failing evaluating AI systems is significantly harder than traditional ml models for several reasons first the

Speaker A

problems these models solve are often inherently complex evaluating a mathematical proof or the quality of a summary requires deep expertise you might need to read an entire book just to judge if a summary captures the key points correctly second tasks are

Speaker A

typically open-ended with many possible correct responses unlike classification where there's one right answer a question like write me a poem about resilience has countless valid responses third Foundation models are black boxes you can only evaluate them by observing

Speaker A

their outputs not by understanding their internal workings fourth publicly available evaluation benchmarks quickly become saturated which is when the model achieves perfect scores what was a challenging test yesterday becomes an easy exercise today and finally for general purpose models you need to

Speaker A

evaluate not just known tasks but discover new capabilities that might extend beyond human abilities all of this is made worse by a general underinvestment in evaluation compared to model development so let's start with some fundamental metrics used to

Speaker A

evaluate language models during training most autor regressive language models are trained using cross entropy or its relative perplexity these metrics essentially measure how well the model predicts the next token in a sequence entropy measures how much information on

Speaker A

average a token carries the higher the entropy the more information dense each token is and the more unpredictable the language if you can perfectly predict what I'll say next what I say carries no new information language models learn

Speaker A

the distribution of their training data the better a model learns this distribution the better it becomes at predicting what comes next resulting in lower cross entropy a perfectly trained model would would achieve cross entropy equal to the entropy of the training

Speaker A

data itself and the KL Divergence between the two will be zero perplexity is simply the exponential of cross entropy it measures the amount of uncertainty a model has when predicting the next token higher perplexity means that there are more possible options the

Speaker A

model is considering what counts as good perplexity depends entirely on the data more structured data has lower expected perplexity because it's more predictable the larger the vocabulary the higher the perplexity because there are more possible options and the long longer the

Speaker A

context length the lower the perplexity tends to be while perplexity is useful for guiding training and serves as a proxy for a model's General capabilities it becomes less reliable for models that have undergone significant posttraining with sft or rhf as models get better at

Speaker A

completing tasks they might actually get worse at predicting the next token in a statistical sense perplexity can also be used to detect if a text was in a model's training data because it would be unusually good at predicting those

Speaker A

tokens and to identify nonsensical text which would have abnormally High perplexity for some tasks we can perform exact evaluation where there's no ambiguity about the correct answer like multiple choice questions this is in contrast to subjective evaluation like

Speaker A

grading an essay the gold standard here is functional correctness evaluating whether the system performs its intended functionality for example if I ask a model to book a restaurant reservation did it make the correct reservation this is the ultimate metric for any

Speaker A

application though it's not always clear how to measure it in coding tasks functional correctness translates to execution accuracy does the code run and produce the expected output for gaming Bots we can measure objective performance metrics like win rates when

Speaker A

reference data is available we can evaluate outputs by comparing their similarity to this ground truth this approach is bottlenecked by how much and how fast reference data can be generated either by humans or AI there are three main ways to compare outputs to

Speaker A

references first exact match a binary measure that works for simple questions with definitive answers like who was the first woman to win the Nobel Prize second lexical similarity a continuous measure of how much the tokens over overlap between the output and reference

Speaker A

this can use techniques like edit distance how many changes are needed to transform one text into another or engram overlap metrics like blue and Rouge the drawback is that you need a comprehensive set of reference responses and the references themselves can be

Speaker A

wrong plus higher lexical similarity doesn't necessarily mean a better response there are many ways to express the same idea third semantic similarity this is a continuous measure of whether two texts have the same meaning regardless of the specific words used

Speaker A

this is typically implemented by comparing text embeddings using metrics like cosine similarity the advantage is that it doesn't require references but it does depend on the quality of the underlying embedding algorithm one of the most powerful and common methods for

Speaker A

evaluating AI models in production is using another AI model as a judge these AI judges are fast easy to use and relatively cheap compared to human evaluators they can work without reference data and can judge attributes like correctness toxicity hallucinations

Speaker A

and more Studies have shown that AI judges can correlate strongly with human evaluators sometimes showing higher agreement than between different human judges they can also explain their decisions which helps with transparency you can use AI judges to score outputs

Speaker A

compare outputs to references or pick the best of two responses since language models are generally better with text than numbers AI judges tend to perform better with classification tasks than numerical scoring when creating prompts for AI judges you need to include the

Speaker A

evaluation task criteria and scoring system few shot examples generally work better than zero shot which we'll talk about later in the prompt engering section though longer prompts do increase costs interestingly you don't always need your strongest model as the

Speaker A

judge specialized smaller models can often perform evaluation tasks effectively which helps reduce costs and latency however of course AI judges have limitations like all AI applications they're probalistic the same judge given the same input can produce different scores if prompted differently or simply

Speaker A

run twice this makes evaluation results harder to reproduce or trust additionally metrics aren't standardized across different systems one system system's definition of faithfulness might differ from anothers Models also exhibit biases they might prefer responses from the same model this is

Speaker A

called Self Bias favor the first answer in a comparison this is position bias or prefer lengthier answers verbosity bias you can mitigate these biases through techniques like randomizing the order of responses but this also increases costs now that we understand evaluation let's

Speaker A

tackle one of the most crucial decisions in AI engineering model selection with the increasing number of readily available Foundation models models the challenge isn't developing models but selecting the right one for your application during application development you'll go through model

Speaker A

selection multiple times as you progress through different adaptation techniques for instance when doing prompt engineering you might start with the strongest model to evaluate feasibility then work backwards to see if smaller models would suffice if you decide to

Speaker A

fine-tune you might start with a small model to test your code before moving to a larger one the selection process typically involves two key steps first finding the best achievable performance on the task and then second mapping models along a cost performance axis and

Speaker A

choosing the model that gives the best performance for your budget your criteria for evaluating a model can be organized into four buckets first domain specific capabilities how well does the model understand your specific domain for example if you're summarizing legal

Speaker A

documents how well does it understand legal terminology second General capabilities how coherent faithful or factually consistent are the outputs third instruction following capabilities does the model follow the format and structure you requested and fourth cost and latency how expensive is the model to run and

Speaker A

how quickly does it respond sometimes rather than evaluating absolute quality you just need to determine which model is best for your use case this can be done through point-wise evaluation so you score each model independently or comparative evaluation where you

Speaker A

directly compare outputs when evaluating models you also need to differentiate between hard attributes and soft attributes hard attributes are impossible or impractical to change these include license restrictions training data composition model size privacy requirements and the level of

Speaker A

control you need these are often determined by the model providers or your own internal policies and they can significantly limit your pool of options soft attributes on the otherand can be improved through adaptation techniques like prompt engineering or fine-tuning

Speaker A

these include things like accuracy toxicity and factual consistency a high level workflow for model selection looks like this filter out models whose hard attributes don't work for you then use publicly available information like Benchmark performance to narrow down to

Speaker A

the most promising candidates third run your own experiments to find the best model given all of your objectives fourth continually monitor your chosen model in production to detect failures and collect feedback most companies won't build Foundation models from

Speaker A

scratch so another question is whether to use commercial model apis or host an open source model yourself let's clarify some terminology first originally open source meant any model you could download and use but some argue that a model should only be considered truly

Speaker A

open source if its training data is also publicly available this allows for more flexible usage like retraining from scratch with modifications models with open weights but closed training data are sometimes called open weight models while those with both open weights and

Speaker A

open data are open models so most so-called open source models are actually just open weight these models also come with different licenses that may restrict commercial use or limit how you can use the model's outputs for training other models for a model to be

Speaker A

accessible to users a machine needs to host and run it the service that hosts the model and handles queries is often called the inference service while the interface the users interact with is the model API after creating a model

Speaker A

developers can choose to open source it make it accessible via an API or both typically model providers open source their weaker models and keep their best ones behind pay walls whether to host a model yourself or use a model API

Speaker A

depends on several factors First Data privacy if your company has strict data privacy policies that prevent sending data outside the organization externally hosted model apis are not an option there's also the risk that API providers might use your data to train their

Speaker A

models next data lineage and copyright most models aren't transparent about their training data and intellectual property laws around AI are still evolving it's unclear whether using a model train on copyrighted data could create legal issues for your product

Speaker A

next performance the gap between open sourced and proprietary models is closing but the strongest models will likely remain proprietary commercial apis often provide additional capabilities out of the box like scalability function calling so accessing external tools for example

Speaker A

structured outputs and output guard rails these can be challenging to implement yourself so many companies turn to API providers however this means you'll be restricted to their functionality you might not be able to fine-tune or access log probabilities

Speaker A

for example typically proprietary models are easy to start with and scale but they can become expensive with heavy usage and offer less flexibility it's wise to design your application with a standard internal API so you can easily swap between models if needed control is

Speaker A

another consideration what happens if your API provider goes out of business changes their terms of service or is banned in certain regions and if you want to run a model on device thirdparty apis aren't an option there are numerous

Speaker A

benchmarks for different use cases and a tool tool that helps you evaluate a model on multiple benchmarks is called an evaluation harness for example open AI evals lets you run any of around 500 existing benchmarks to evaluate their

Speaker A

models when using public leaderboards you need to consider which benchmarks to include in your aggregated ranking how to weigh different benchmarks and how to handle benchmarks that use different metrics like accuracy F1 blue Etc keep in mind that the goal is to select a

Speaker A

small subset of models for more rigorous testing with your own benchmarks and metrics public benchmarks rarely represent your applications need perfectly and they may suffer from data contamination which is when the models were trained on the same data they're

Speaker A

being evaluated on to deal with contamination you first need to detect it using her istics like engram overlapping and perplexity if perplexity on the evaluation data is unusually low it's possible the model has seen this during training once you've narrowed

Speaker A

down your model candidates you need a robust evaluation pipeline evaluate both the endtoend output and each component intermediate outputs independently you can use something called turn-based evaluation where you assess the quality of each output and task-based evaluation where you measure whether the system

Speaker A

completes a task and how many turns it takes first think about what makes a good response factors like relevance factual consistency and safety then create test queries and generate multiple responses to see how models perform develop detailed rubrics with

Speaker A

examples for your scoring system whether you use binary scores continuous scales or something else depends on your data and your needs the key is to make your rubric unambiguous so that human evaluators can follow it consistently ly most importantly tie your evaluation

Speaker A

metrics to business metrics if your customer support chatbots factual consistency is 80% what does that mean for the business perhaps you can automate 30% of customer support requests at that level but at 90% consistency you could automate 50% this

Speaker A

lets you quantify the business impact of model improvements you'll also need to establish a usefulness threshold for instance your chat bot must be 90% factually consistent to be viable in production different criteria might require different evaluation methods you might use a specialized toxicity

Speaker A

classifier semantic similarity metrics to measure relevance and an AI judge to assess factual consistency you can even mix and match evaluation methods for the same criteria for example maybe use a cheap classifier on all your data and an

Speaker A

expensive AI judge on just 1% for high quality signals while automated metrics are preferable for scale don't hesitate to include human evaluation even in production just do it on a subset of data to keep costs manageable it's also

Speaker A

crucial to evaluate application on different slices of data or users to ensure it performs well across segments and avoid biases this helps you identify areas for improvement and prevent Simpsons Paradox where a model performs better on aggregate but worse on each

Speaker A

individual subset how much evaluation data you need depends on your application and methods generally you want enough to be reliable but not so much that costs become prohibitive a good way to test reliability is to create multiple bootstrap samples of

Speaker A

your evaluation set and see if they yield similar results if you get 90% on one bootstrap but 70% on another your evaluation pipeline isn't trustworthy finally evaluate the reliability of your pipeline itself first is it getting signals right do better responses indeed

Speaker A

get higher scores next do better evaluation metrics lead to Better Business outcomes third how reliable is the pipeline if you run it twice do you get the same results fourth how correlated are your metrics you don't need two metrics if they're perfectly

Speaker A

correlated but completely uncorrelated metrics might indicate problems and finally what cost and latency does your evaluation pipeline add your your application model selection remains one of the hardest but most important topics in AI engineering with the rapidly growing number of foundation models

Speaker A

available your challenge isn't developing models but selecting the right one for your specific needs balancing performance cost privacy and control now let's dive into what might be the most accessible yet surprisingly nuanced aspect of AI engineering prompt engineering if you've ever used chat GPT

Speaker A

you've already done some form of prompt engineering but there's much more to it than just typing questions prompt engineering refers to the process of crafting instructions that guide a model to generate your desired outcome it's the easiest and most common model

Speaker A

adaptation technique because unlike fine tuning it doesn't change the model's weights you're just telling the model what you want it to do while it's the most accessible entry point to AI engineering don't be fooled into thinking that it's simplistic effective

Speaker A

prompt engineering requires the same experimental rigor as any machine learning task you should extract maximum value from prompting before moving to more resource intensive techniques like fine-tuning that said understanding prompt engineering alone isn't enough for production ready systems you'll

Speaker A

still need knowledge of Statistics engineering and classical ml for experiment tracking evaluation and data set curation prompts typically consist of one or more of these components first the task description this includes the model's role and expected output format

Speaker A

for example you are a helpful medical assistant analyze the following symptoms and suggest possible conditions listing them in order of likelihood next examples these show the model how to perform the task for instance if you want a model to classify text as toxic

Speaker A

or non-toxic you might include examples of each third the concrete task this is the specific job you want the model to do like answering a question or summarizing a book how much prompt engineering you need depends on the

Speaker A

model's robustness to prompt perturbation a robust model shouldn't produce dramatically different outputs if you write the number five versus write it out FIV this robustness is strongly correlated with a model's overall capability it's also worth noting that different models have

Speaker A

different preferred prompt structures for example GPT 4 typically performs better when the task description is at the beginning of the prompt while llama 3 does better when the task appears at the end teaching models what to do via

Speaker A

prompts is known as in context learning each example in your prompt is called a shot so we get the terms few shot zero shot and one shot learning how many examples you need depends on both the model and your application so

Speaker A

experimentation is necessary the number of examples you can include is limited by the model's context length and for API models your cost constraints many modern models distinguish between system and user prompts the system prompt contains the Tas task description

Speaker A

telling the model what role to play its goals and constraints the user prompt contains the specific task or query almost all applications like trpt have system prompts usually created by the application developers rather than end users these system and user prompts are

Speaker A

combined using a template that can vary between models and versions if you use the wrong template you might experience unexpected performance issues even small mistakes like an extra new line can cause problems when constructing inputs make sure to follow the model's chat

Speaker A

template exactly this is especially important if you're using using third party tools to construct prompts as template mismatches often lead to silent failures models typically understand instructions better when they appear at the beginning or end of The Prompt

Speaker A

rather than buried in the middle let's go through some key strategies for Effective prompt engineering first write clear and explicit instructions if you want a model to score an essay explain the scoring system you want it to use

Speaker A

should it allow fractional scores what should it do if it can't determine an answer be specific to reduce ambiguity second ask the model to adopt a Persona asking a model to respond as a particular character or expert can

Speaker A

significantly change its output style and focus for example respond as an experienced pediatrician or answer as if you were explaining it to a 10-year-old third provide examples examples can dramatically shift a model's response style for instance asking will Santa

Speaker A

bring me presents without examples might get a straight no Santa is fictional response but if you provide an example of a Whimsical answer about the Tooth Fairy the model is more likely to play along four specify the output format

Speaker A

tell the model exactly how you want the response structured this might mean requesting things like no preambles so none of this based on the content of this essay I'd give it a score of dot dot dot you can also ask for specific

Speaker A

formats like Json or markdown and particular sections or headings five break complex tasks into simpler subtasks this not only improves performance but also makes monitoring debugging and parallelization easier however it can increase the latency perceived by users if they don't see the

Speaker A

intermediate outputs you can also use cheaper models for simpler steps to reduce cost six give the model time to think several techniques can improve model reasoning Chain of Thought prompting so think this through step by step process instructions so something

Speaker A

like first analyze the key themes second identify the author's perspective and so on next self-critique ask the model to check its own work these approaches generally improve quality but increase latency and token usage seven iterate systematically this is so important

Speaker A

different techniques work better for different models so experimentation is crucial always version your prompts and use an experiment tracking tool with standardized evaluation metrics and data also separate prompts from code store them in configuration files rather than hardcoding them this will make it way

Speaker A

easier to update various tools aim to automate The Prompt engineering workflow including open prompt and dspi these tools let you specify input and output formats evaluation metrics and evaluation data then essentially they perform automl to find the optimal

Speaker A

prompts however these tools can be expensive if if they make many API calls under the hood they also might produce prompts with typos or other issues and they may not keep up with changing model requirements for these reasons it's best

Speaker A

to start with manual prompt engineering before moving to automated tools you can also use AI models themselves to write and refine prompts once your application is available to users it may face attacks from malicious actors trying to exploit it three main types of prompt

Speaker A

attacks include prompt extraction attacks where attackers might try to extract your system prompt to either replicate or exploit your application jailbreaking and prompt injection the attacks attempt to subvert the model's safety features or get it to perform unauthorized actions like providing

Speaker A

instructions for harmful activities or executing dangerous code and third information extraction these attacks try to get the model to reveal sensitive information from its training data or context to defend against these attacks consider the following strategies use benchmarks to evaluate safety against

Speaker A

adversarial attacks conduct security red teaming to proactively find weaknesses be explicit in your prompts about what information the model should not return repeat the system prompt before and after user inputs to remind the model of its constraints Design Systems with

Speaker A

safety boundaries like running generated code only in isolated environments require human approval for potentially impactful actions define out of scope topics for your application use anomaly detection to identify unusual prompts and Implement guardrails on both inputs and outputs when evaluating your system

Speaker A

security track both the violation rate so how often attack succeed and the false refusal rate how often the model incorrectly refuses legitimate requests you need to balance these metrics perfect security with too many false refusals creates a really frustrating

Speaker A

user experience by approaching prompt engineering with this combination of creativity and riger you can extract remarkable performance from Foundation models without the complexity and expense of fine-tuning remember that small changes in your prompts can lead to significant improvements in output

Speaker A

quality so experiment widely and measure carefully now that we've covered prompt engineering let's explore how to give Foundation models access to information beyond what they were trained on to solve a task effectively a model needs two things instructions on how to

Speaker A

perform the task and the necessary information to complete it two dominant patterns have emerged for providing models with the information they need retrieval augmented generation or rag and the agentic pattern rag allows models to retrieve relevant information from external data sources while the

Speaker A

agentic pattern enables models to use tools like web search and apis to gather information actively while rag is primarily used for context construction the agentic pattern can do much more let's start with rag first so what is rag retrieval augmented generation is a

Speaker A

technique that enhances a model's generation capabilities by retrieving relevant information from external memory sources these sources could be an internal database a user's previous chat sessions or even the internet you can think of rag as a technique to construct

Speaker A

context specific to each query connecting the model with information it wasn't trained on or might have forgotten a rag system consists of two main components a retriever that fetches the information from the external memory source and a generator the foundation

Speaker A

model that produces a response based on the retrieved information in today's rag systems these components are often trained separately with many teams using off-the-shelf Retriever and models however fine tuning the entire rag system from end to end can significantly

Speaker A

improve performance the success of a rag system heavily depends on its retriever a retriever performs two main functions indexing and querying indexing involves processing data so that it can quickly be retrieved later this is the preparatory step where you organize your

Speaker A

knowledge base querying is the process of sending a search query to retrieve data relevant to it how you index your data determines how you retrieve it later let's walk through a simple example imagine your external memory as a database of documents like contracts

Speaker A

or meeting notes these documents can range from 10 tokens to a million tokens in length naively retrieving whole documents would make your context arbitrarily long potentially exceeding the model's context window to avoid this you typically split each document into

Speaker A

smaller chunks which we'll discuss later for each user query your goal is to retrieve the data chunks most relevant to that query then with some postprocessing to join the retrieved chunks with the user's prompt you the final prompt that goes to the model many

Speaker A

existing retrieval algorithms can be used for rag retrieval works by ranking documents based on their relevance to a given query and algorithms differ in how they compute these relevant scores first term-based retrieval this is also called lexical retrieval and this approach

Speaker A

finds relevant documents based on keywords while this is straightforward it has several limitations so many documents might contain a term without truly being about it and queries can be long with many terms that aren't equally important so tfidf can help address this

Speaker A

this also simple tokenization can miss semantic relationships term-based retrieval is generally faster than embedding based approaches during both indexing and querying it also works well out of the box with existing systems like elastic search embedding based retrieval is another option this

Speaker A

approach computes relevance at the semantic level rather than a lexical one ranking documents based on how closely their meaning aligns with the query the process works like this convert your original data to embeddings using an embedding model store these embeddings

Speaker A

in a vector database when a query comes in convert it to an embedding using the same model fetch the K data chunks whose embeddings are closest to the query embedding and return them Vector search is typically framed as a k nearest

Speaker A

neighbor search problem this can be computationally expensive for large data sets so approximate nearest neighbors algorithms are often used instead in practice most developers won't Implement Vector search themselves but will use existing Vector databases these databases organize vectors into buckets

Speaker A

trees or graphs using various fistic to increase the likelihood that similar vectors are stored close to each other embedding based retrieval can sign significantly outperform term-based retrieval over time especially if you fine-tune your embedding model and retriever but it has its downsides it

Speaker A

can make it harder to search for specific names or error codes and generating embeddings can be expensive and introduce latency a production retrieval system typically combines several approaches for example a cheaper less precise retriever like turn-based search might first fetch candidates and

Speaker A

then a more precise but expensive mechanism like KNN finds the best options among those candidates depending on your task certain tactics can increase the chance of retrieving relevant documents the simplest approach is to divide documents into chunks of

Speaker A

equal length based on characters words sentences or paragraphs overlapping chunks can ensure that important boundary information is included in at least one chunk smaller chunk sizes allow for more diverse information since you can fit more chunks into the model's

Speaker A

context but this can also result in the loss of important context smaller chunks also increase computational overhead especially for embedding based retrieval there's no Universal best chunk size or overlap percentage you just need to experiment based on your specific data

Speaker A

and task the initial document rankings generated by the retriever can be further refined to be more accurate this is especially useful when you need to reduce the number of retrieve documents due to context window limitations documents could be reranked based on

Speaker A

various factors such as recency so maybe you give more weight to newer data or additional relevant signals next let's talk about query rewriting also known as query reformulation normalization or expansion this technique involves rewriting queries to include necessary

Speaker A

context for example if a user asks what's its population after previously asking about Paris the query might be expanded to what's the population of Paris each chunk can be augmented with relevant context to make it easier to retrieve this might include metadata

Speaker A

like tags and keywords or for e-commerce products it could be information like descriptions and reviews you can also augment chunks with context from the full document to help them retain more of the original meaning for example maybe a summary of the entire document

Speaker A

when choosing a retrieval solution consider first first what retrieval mechanisms it supports term-based embedding based and or hybrid for Vector databases what embedding models and Vector search algorithms are supported also consider scalability both for data storage and query traffic you'll need to

Speaker A

think about indexing speed and batch processing capabilities query latency pricing structure and compliance requirements as well it's also important to note that rag isn't limited to just text it can also be used with multimodal and tabular data for instance if a user

Speaker A

asks what's the color of the house in the Pixar movie up a multimodal rag system might first retrieve an image of the house to help the model answer similarly rag can work with tabular data using text to SQL conversations the

Speaker A

system can execute a query on a database and then generate a response based on the results for complex database schemas you might need an intermediate step to predict which table to use for each query especially if there are too many

Speaker A

tables to fit all the schemas in the context window in the next part we'll explore the agentic pattern which goes beyond passive retrieval to actively interact with external tools and apis the agentic pattern is a more active approach to extending AI capabilities

Speaker A

this is a rapidly evolving field so consider this section more experimental than the others we've covered at its broadest definition an agent is anything that can perceive its environment and act upon it for AI systems this means that a model can observe its environment

Speaker A

make decisions based on those observations take actions that affect the environment and learn from the outcomes of those actions the environment is defined by the use case for a game playing Agent the game is the environment for a web scraping agent the

Speaker A

internet is the environment what makes agents powerful is the set of tools they have access to for example chat GPT is an agent that can search the web execute python code and generate images among other capabilities remember our rag

Speaker A

example with tabular data that was actually a simple agent with three actions generating SQL queries executing those queries and producing a response let's see how this works in practice if a user asks project the sales revenue over the next 3 months the agent might

Speaker A

first reason about how to accomplish the task then generate a SQL query to fetch historical sales data next it would execute that query against the database analyze if the retrieved information is sufficient possibly generate and execute additional queries and then create a

Speaker A

projection based on the gathered data finally it would conclude that the task has been successfully completed compared to simpler AI applications agents require more powerful models because they often need to perform multiple steps to complete a task the overall

Speaker A

success rate decreases with each step because of compounding errors and the stakes are higher since agents have access to potentially powerful tools speaking of tools agents can be equipped with various tools which fall into several categories first knowledge

Speaker A

augmentation tools these could be things like text or image retrievers as in rag SQL executors for database access web search capabilities apis for accessing inventory systems email readers Etc and web browsers for navigating online content whether public or private next

Speaker A

we have capability extension tools like calculators since AI models often struggle with complex math time zone or unit converters translation services and code interpreters we also have WR action tools so tools that enable the agent not just to read but also write to systems

Speaker A

these can automate workflows but require strong security protocols complex tasks require planning and there are many possible ways to decompose a task not all approaches will be successful and not all will be efficient to help with debugging and to prevent cases where a

Speaker A

model executes unnecessary API calls planning should be decoupled from execution the process typically works like this first ask the agent to generate a plan then validate the plan before execution and then only execute once validated plans can be validated

Speaker A

using heris STS like removing plans with invalid actions or too many steps or by using another AI model as a judge you can even generate several plans in parallel and then ask an evaluator to pick the most promising one for

Speaker A

particularly important or sensitive tasks you might want a human in the loop to review plans before execution while Foundation model agents use the model itself as the planner reinforcement learning agents are trained using reinforcement learning algorithms this approach uses more resources than

Speaker A

Foundation models but could offer performance improvements in the future the simplest way to turn a model into a plan generator is through prompt engineering you tell the model what functionality it has available and the expected inputs and outputs for each

Speaker A

tool you can improve your prompts by writing better system prompts with more examples providing clearer descriptions of tools and their parameters simplifying functions as much as possible using a stronger model or fine-tuning a model specifically for Plan Generation as a practical tip

Speaker A

always ask the system to report what parameter values it uses for each function call this provides a sanity check check that can catch many issues before execution another useful approach is to generate plans in natural language first then translate them to the exact

Speaker A

function calls in a second step this helps if function names change over time or if you find a model specifically for plan creation the translation can often be done by a smaller cheaper model agents can fail in various ways so it's

Speaker A

important to have robust evaluation methods there are lots of different things that can go wrong so we could have planning failures like using invalid tools using valid tools but with invalid parameters using valid tools with incorrect parameter values or

Speaker A

failing to achieve the goal or satisfy constraints to evaluate planning capability create a data set where each example is a tuple of task available tools and constraints for each task use the agent to generate multiple plans and compute metrics like the percentage of

Speaker A

generated plans that are valid how many attempts it takes to get a valid plan percentage of tools called that are valid and how often invalid tools are called you could also have tool failures so that could include things like bad

Speaker A

translation from high level plans to specific function name no access to the required tools or tools giving incorrect outputs like poorly generated SQL queries for this your efficiency metrics might be how many steps does the agent need on average to

Speaker A

complete a task what's the cost to complete a task how long does each action typically take are there particularly slow or expensive actions and how does the agent compare to baselines which might be another agent or a human one of the key challenges for

Speaker A

agents is remembering information over time a memory system allows a model to retain and utilize information across interactions a a models typically have three main memory mechanisms there's the internal knowledge embedded in the model itself through training there's the

Speaker A

context window which is kind of your shortterm memory for immediate session specific information and finally external data sources like rag systems this is kind of like your long-term memory information that is essential to all tasks should be incorporated via

Speaker A

training rarely needed information should reside in long-term memory while short-term memory is for immediate context specific information benefits of a well-designed memory management system include storing information longer than the context window allows persisting information between sessions making a

Speaker A

model more consistent in its responses and actions by combining rag for information access tools for capability extension planning for complex tasks and memory systems for continuity agents can tackle increasingly sophisticated problems while this field is still evolving rapidly it represents one of

Speaker A

the most promising Frontiers in AI engineering as with all powerful Technologies agent systems require careful consideration of safety security and ethical use the more capable able an agent becomes the more critical it is to ensure it operates within appropriate

Speaker A

boundaries and with proper oversight now let's explore fine tuning the process of adapting a model to a specific task by further training it and adjusting its weights while prompt engineering and rag are relatively lightweight techniques fine-tuning offers deeper customization

Speaker A

but requires more resources and expertise so when to fine-tune fine tuning can improve a model's performance in two ways first by enhancing domain specific capabilities like coding or answering medical questions and second improving instruction following abilities like adhering to specific

Speaker A

output formats however fine tuning requires significant upfront investment it often needs more memory than what's available on a single GPU making it expensive this is why reducing memory requirements has become a primary motivation for many fine-tuning techniques that we'll discuss later so

Speaker A

you should consider fine-tuning when you've already exhausted what you can achieve with prompt-based methods you need to produce consistent structured outputs and you're working with smaller models that need to perform better on specific tasks a common approach is

Speaker A

model distillation fine-tuning a small model to imitate a larger model's Behavior using data generated by the large model on specific tasks a small fine-tune model May outperform a larger general purpose model on the other hand you should avoid fine-tuning if you need

Speaker A

a general purpose model fine-tuning can improve performance on specific tasks but degrade performance on others or if you're just starting to experiment with a project many teams jump straight to find tuning before thoroughly exploring simpler approaches so what about

Speaker A

fine-tuning versus rag after you've maximized performance gains from prompting choosing between Rag and fine-tuning depends on whether your model's failures are information based or behavior-based if the model fails because it lacks information like private company data or recent events

Speaker A

rag gives the model better access to that information if the model has behavioral issues which I think is very funny to say like outputs that are factually correct but irrelevant or they're in the wrong format fine tuning might help more if your model has both

Speaker A

issues start with rag because it's easier begin with a simple term-based solution and evolve from there in many cases combining rag and fine tuning will give you the biggest performance boost so the workflow to adapt a model to a

Speaker A

task might be first design evaluation criteria and an evaluation pipeline then try to get the model to perform the task with prompting alone add more examples to the prompt from there at that point if the model continues to have

Speaker A

information based failures try more advanced rag like embedding based retrieval if it continues to have behavioral issues opt for fine-tuning finally combine Rag and fine-tuning for a bigger performance boost because of the scale of foundation models memory is

Speaker A

a major bottleneck for both inference and fine tuning the memory requirements for fine-tuning are typically much higher than for inference due to how neural networks are trained neural networks are typically trained using back propagation each training step consists of a forward pass where we

Speaker A

compute the output from the input and a backwards pass where we update the model's weights using signals from the forward pass during inference only the forward pass is executed during training both passes are needed the key contributors to a model's memory

Speaker A

footprint during fine tuning are the total number of parameters the number of trainable parameters and the numerical representation of these parameters a trainable parameter is one that can be updated during fine tuning so during pre-training all model parameters are

Speaker A

updated during inference no parameters are updated and during fine-tuning some or all of the parameters may be updated parameters that remain unchanged are called Frozen parameters one way to reduce training memory is through gradient checkpointing also called activation recomputation where

Speaker A

activations aren't stored but recomputed as needed this increases training time but reduces memory requirements the key Insight here is that the more trainable parameters we have the higher the memory footprint reducing the number of trainable parameters reduces memory

Speaker A

requirements this is the motivation behind parameter efficient fine-tuning which we'll talk about in a bit another way to reduce the memory footprint is through quantization converting a model from a format with more bits to one with fewer bits for a 13 billion parameter

Speaker A

model using 32-bit floating Point each parameter requires 4 bytes resulting in 52 GB total so if you reduce each value to 16 bits the memory needed drops to 26 GB inference is typically done using as few bits as possible 16 eight or even

Speaker A

four bits training is more sensitive to numerical Precision so it's usually done in mixed Precision with some operations in higher Precision like 32bit and others in lower Precision like 16 or 8bit different numerical formats balance range so the span of values that can be

Speaker A

represented and precision how exactly a number can be represented there are a few different formats reducing Precision can cause values to change or result in errors so it's important to load models in their intended format for example when llama 2 is released its weights

Speaker A

optimized for bf16 causing significantly worse quality when loaded with fp16 now let's talk about PFT in the early days of smaller models full fine-tuning so updating all the model parameters was common this required a lot of highquality annotated data and

Speaker A

substantial computational resources as models grew people started using partial fine tuning focusing on specific layers like only the last layer this reduces memory acquirements but it isn't very parameter efficient parameter efficient fine-tuning techniques insert additional parameters into strategic IC locations

Speaker A

in the model to achieve strong fine-tuning performance with a small number of trainable parameters while this can increase inference latency slightly as adapters add computational steps PFT methods are generally not only parameter efficient but also sample efficient they can work with just a few

Speaker A

thousand examples compared to the millions potentially needed for full fine-tuning PFT methods fall into two categories so we have adapter-based methods this is also called additive methods that add new model weights and then we have soft prompt based methods

Speaker A

that introduce special trainable tokens the most popular adapter-based method is Laura low rank adaptation unlike traditional adapters Laura incorporates additional parameters without increasing inference latency instead of adding new layers Laura uses modules that can be merged back into the original layers

Speaker A

Laura works by decomposing weight matrices into products of smaller matrices then updating only these smaller matrices for a weight Matrix with Dimensions n by m Laura first chooses a smaller Dimension R the rank then creates two matrices a which is n

Speaker A

by R and B which is R by m during fine-tuning only A and B are updated while the original weights remain Frozen for inference A and B can be multiplied together and added to the original weights the efficiency of Laura depends

Speaker A

both on the chosen Rank and which matrices it's applied to it's primarily used for Transformer modules in the attention modules if you want to fine-tune a model for multiple tasks you have several options first simultaneous fine-tuning training on a data set with

Speaker A

examples from all tasks at once this is harder and requires more data or you could do sequen fine tuning where you first train on task a and then on task B but this can cause catastrophic forgetting where the model loses its

Speaker A

ability on earlier tasks or you can try model merging so there you fine-tune different tasks separately then combine the resulting models model merging offers greater flexibility than fine tuning alone if you have two models that excel at different aspects of the same

Speaker A

task you can merge them into a single model that outperforms both this approach can be done without gpus it can improve performance while reducing the memory footprint it's an excellent option for on deployment and it can facilitate Federated learning where

Speaker A

multiple devices train using separate data unlike ensembling which combines the outputs of multiple models merging combines the models themselves this improves performance without the higher inference cost of running multiple models several merging approaches exist so we have summing where we just add the

Speaker A

weight values of the constituent models together this is the most common we could have layer stacking so we take different layers from different models and stack them this is also called Franken merging or concatenation where we just combine the parameters this is

Speaker A

less recommended because it doesn't reduce memory compared to separate models so here's a practical fine-tuning approach and what a typical development path might look like first test your fine-tuning code using the cheapest fastest model you have and ensure it

Speaker A

works then test your data by fine-tuning a midsize model if training loss doesn't decrease with more data something might be wrong after that run experiments with your Target Model to see how far you can push performance and then map the price

Speaker A

performance Frontier and select the model that makes the most sense for your use case alternatively a distillation path looks like this start with a small data set and the strongest model you can afford then train the best possible

Speaker A

model with this small data set use this fine-tune model to generate more training data use the expanded data set to train a cheaper model when choosing fine-tuning methods here are some things to consider so for beginners start with

Speaker A

adapter techniques like Laura before attempting full fine tuning understand that data volume matters full fine tuning typically requires thousands to millions of examples while PFT can work with hundreds also you'll need to know how many fine tune models you need

Speaker A

adapter methods let you serve multiple variants that share a base model there are also some key hyper parameters that you should know these ones in particular significantly impact fine-tuning results so we have the learning rate just like in machine learning if the loss curve

Speaker A

fluctuates the learning rate is likely too high if it's stable but decreases very slowly the rate's probably too low generally start larger and decrease over time we also have batch size larger batches process training examples faster but require more memory small batches

Speaker A

lead to more unstable training so to address instability you can accumulate gradiance across several batches we also need to think about the number of epoch smaller data sets typically need more epochs than larger ones for millions of examples one to two Epoch might suffice

Speaker A

for thousands of examples 4 to 10 may be needed reduce Epoch if you see overfitting we also have prompt loss weight for instruction fine-tuning this determines how much prompts should contribute to the loss compared to the responses if it's set to 100% prompts

Speaker A

and responses contribute equally if it's 0% the model learns only from responses the default is typically 10% while the technical process of fine-tuning has been simplified by Frameworks that handle the training process and suggest sensible defaults the Strategic

Speaker A

decisions around fine-tuning remain complex the key is knowing when to fine-tune which technique to use and how to balance the trade-offs between performance resources and data requirements While most companies can't afford to train Foundation models from scratch nearly all can differentiate

Speaker A

themselves through high quality data sets for adaptation as the say goes garbage in garbage out and nowhere is this more true than in data set engineering we're witnessing a shift from model Centric to data Centric approaches in AI development model

Speaker A

Centric AI tries to improve performance by enhancing the models themselves so designing new architectures increasing model sizes or developing new training techniques data Centric AI on the other hand focuses on improving performance by enhancing the data developing better

Speaker A

data processing techniques and creating high quality data sets that allow Superior models to be trained with fewer resources for companies adapting Foundation models rather than building them from scratch the data Centric approach offers the greatest competitive Advantage the type of data you need

Speaker A

depends on your adaptation task for self-supervised fine-tuning you need sequences of relevant domain data for instruction fine tuning you need data in instruction response format for preference fine tuning you need instruction winning response losing response format for reward modeling you

Speaker A

need either preference data or examples with explicit scores your training data should exhibit the behaviors you want your model to learn this can be particularly challenging for complex behaviors like Chain of Thought reasoning or tool use in agent workflows

Speaker A

When developing conversational applications you need to consider whether you require single turn data multi-turn data or both single turn data helps train a model to respond to individual instructions while multi-turn data data teaches the model how to solve

Speaker A

tasks through dialogue like clarifying user intent before addressing the task or incorporating Corrections a small amount of high quality data can outperform a large amount of noisy data a principle confirmed by teams working on models like llama 3 they found that

Speaker A

human generated data is often prone to errors in inconsistencies particularly for nuanced policies leading them to develop AI assisted annotation tools to ensure high quality which is interesting to me but what makes data high quality there are several factors to consider

Speaker A

first relevance the examples should be relevant to your target task legal text from the 19th century might not be relevant for answering contemporary legal questions you'll also need alignment with task requirements if your task focuses on factual consistency

Speaker A

annotations need to be factually correct if it demands creativity annotations should be creative we also need to think about consistency annotations should be consistent across examples and annotators they need to be correctly formatted so data should adhere to the

Speaker A

expected structure they need to be sufficiently unique you want minimal duplicates in your data set they need to be compliant and follow internal and external policies and you need coverage your training data needs to cover the range of possible problems you want to

Speaker A

solve requiring sufficient diversity missing coverage in important areas will result in poor performance for those cases no matter how much data you have overall but how much data do you need asking how much data you need is kind of

Speaker A

like asking how much money you need the answer varies widely depending on your situation several factors influence data requirements so if you're fine-tuning then the fine-tuning technique matters full fine tuning typically requires orders of magnitude more data than

Speaker A

parameter efficient methods like Laura with tens of thousands to millions of examples full fine tuning might be appropriate with just hundreds to a few thousand examples PFT methods will likely work better it also depends on your task complexity a simple sentiment

Speaker A

classification task requires much less data than complex question answering about financial filings for example the base model performance also makes a difference so the closer the base model is to your desired performance the fewer examples you'll need larger more capable

Speaker A

base models generally require fewer examples to fine-tune effectively open ai's fine-tuning guide demonstrates that with fewer examples around 100 more advanced models give better fine-tuning results however after fine-tuning on a large data set around 550,000 examples all models perform similarly regardless

Speaker A

of their initial capabilities so in short with limited data use PFT methods on more advanced models with abundant data full fine tuning on smaller models becomes viable before investing in a large data set start with a small well-crafted set of around 50 examples

Speaker A

to see if fine tuning improves your model if you see clear improvements more data will likely help further if you see no improvement with a small data set a larger one rarely solves the problem though be careful to rule out other

Speaker A

issues like poor hyperparameters or data quality first in most cases you should see improvements after fine-tuning with just 50 to 100 examples you can also reduce the amount of high quality data you need by first fine-tuning on more

Speaker A

accessible data so One path might be self-supervised to supervised first fine-tune on domain specific documents then on targeted question answer pairs or less relevant to more relevant data first fine tune on adjacent domains with abundant data then on your specific

Speaker A

domain or synthetic to real data first find- tune on AI generated examples then on limited real examples experimenting with subsets of your current data set so maybe 25 50 and 100% can help estimate how much more data you'll need a steep

Speaker A

performance gain with increasing data set size suggests significant improvement from doubling your data a plateau indicates diminishing returns so let's say you need more data how can you get it if you don't have enough for your use case if possible you'll want to

Speaker A

create a data flywheel that leverages user interactions to continue ually improve your product this offers a significant competitive advantage or you could also just check available data sets you can often mix and match different sources though all data must

Speaker A

be thoroughly verified for quality and appropriate licensing when annotating your own data the challenge isn't just The annotation process but creating clear guidelines you need to explicitly Define what makes a good response can a response be correct but unhelpful what

Speaker A

distinguishes a score of three versus 4 these guidelines are crucial both for human and AI powered annotations trust me one of the hard machine learning problems I've ever had to solve was an issue with human labelers data augmentation creates new examples from

Speaker A

existing data which is another option so you could do things like flipping an image to create a new variant or you could use data synthesis this generates artificial data that mimics real data properties like simulating Mouse movements on a web page the key

Speaker A

difference between augmented data and synthetic data is that augmented data is derived from real data while synthetic data is Created from scratch data synthesis therefore is particularly valuable for addressing privacy concerns when working with sensitive information together some combination of these

Speaker A

techniques should allow you to produce data at scale increase coverage across your problem space and possibly improve quality with AI generated data since humans aren't always great at creating consistent data but of course make sure to measure the quality of your AI

Speaker A

generated data just like you would for human generated data once you have your data you need to process it data processing can be timec consuming but it is critical for Quality here are some best practices start with filtering

Speaker A

tasks and test scripts before big runs avoid changing data in place so you want to make sure to keep the originals perform exploratory data analysis on distributions and outliers examine interannotator disagreement and resolve conflicts fact check and manually

Speaker A

inspect examples D duplicate data to prevent over representation clean formatting tokens like HTML and markdown which can improve performance and reduce input size remove non-compliant data so anything like pii toxic material or copyrighted content filter out lowquality data identified during

Speaker A

verification if you have more data than your compute budget allows use active learning to select the most helpful examples and ensure data is in the right format for your model using the appropriate tokenizer and chat template while all these steps require a lot of

Speaker A

effort they're essential for creating data sets that will help your model to shine in the competitive landscape of AI applications well-engineered data sets often make the difference between mediocre and exceptional performance now let's dive into one of the most

Speaker A

practical aspects of AI engineering inference optimization after all a model's real world usefulness boils down to two crucial factors how much it costs to run and how quick quickly it responds these characteristics inference cost and latency ultimately determine which

Speaker A

applications can practically use Ai and at what scale let's start by understanding what we mean by inference in the AI life cycle there are two distinct phases in an AI model's Journey training and inference training builds the model while inference uses the model

Speaker A

to compute outputs for given inputs in a production environment the component responsible for running the model inference is called an inference server This Server hosts available models allocates Hardware resources to execute them and returns responses to users the

Speaker A

inference server is part of a broader inference service that also handles receiving routing and pre-processing requests so what does this mean for you well if you're using a model API like those from open aai or Google you're essentially Outsourcing this inference

Speaker A

service but if you decide to host models yourself you'll need to build optimize and maintain your own inference infrastructure to optimize inference we first need to understand what's slowing things down generally speaking AI workloads face two types of bottlenecks

Speaker A

compute bound bottlenecks occur when the limiting factor is the computational power available tasks requiring intensive calculations like image generation are typically compute bound memory bandwidth bound bottlenecks occur when the limiting factor is how quickly data can move between memory and

Speaker A

processors autor regressive language model inference is typically memory bandwidth bound profiling tools like Nvidia Insight can help determine which bottleneck affects your workload through something called a roofline chart what's important to understand is that different optimization Techniques address different bottlenecks a compute

Speaker A

bound workload might benefit from more powerful chips or Distributing work across multiple chips meanwhile a memory bandwidth bound workload might see better results from chips with higher memory bandwidth now that we understand bottlenecks let's look at how inference

Speaker A

is actually served many providers often two distinct types of inference apis each optimized for different use cases online apis optimize for Laten see processing requests as soon as they arrive chatbots typically use online apis since users expect quick responses

Speaker A

batch apis on the other hand optimize for cost processing multiple requests together more efficiently but with higher latency applications without strict resp response time requirements like periodic report generation or synthetic data creation can benefit from batch processing the key is matching

Speaker A

your inference type to your applications needs so now how do we measure if our inference is performing well that brings us to our next section here are some key inference performance metrics to optimize effectively we need to know

Speaker A

what we're measuring several metrics help us evaluate inference performance the first and perhaps most notable metric is latency the time from when users send a query until they receive a complete response for autor regressive models like llms latency break down into

Speaker A

two components so we have the time to First token ttft which is how quickly the first token is generated after receiving a query and then we have time per output token tpot toot how long it takes to generate each subsequent token

Speaker A

the total latency then equals ttft plus toot time the number of output tokens some teams also measure time to publish TTP because the first generated token isn't always immediately shown to users especially when the model first generates a plan or uses Chain of

Speaker A

Thought reasoning one important note about latency since it varies across requests looking at percentiles gives you much more meaningful information than simple averages Beyond latency we also care about throughput which is the number of output tokens per second an

Speaker A

inference service can generate across all requests higher throughput typically means lower cost which is why optimizing for it matters for production systems it's worth mentioning that most AI applications face a fundamental latency throughput tradeoff techniques like batching can improve through put but may

Speaker A

increase latency for individual requests your optimization strategy needs to balance these competing priorities based on your specific application needs finally utilization metrics tell us how efficiently we're using our resources we have model flops per second utilization which is the ratio of observed

Speaker A

throughput relative to the theoretical maximum at Peak computing power model bandwidth utilization which measures the percentage of available memory bandwidth being used now that we know what to measure let's look at the hardware that powers inference at the heart of

Speaker A

inference performance is specialized Hardware an accelerator is a chip designed to speed up specific types of computation for AI workloads the dominant accelerators are gpus those specialized AI chips are growing in popularity you might be wondering about the difference between CPUs and gpus it

Speaker A

comes down to their architecture CPUs have a few powerful cores typically up to 64 for high-end machines which are optimized for general purpose Computing gpus on the other hand have thousands of smaller cores optimized for parallel processing this makes them ideal for

Speaker A

matrix multiplication operations that dominate ml workloads interestingly training and inference have different Hardware requirements training demands more memory due to back prop and is generally more difficult to perform and lower Precision inference often emphasizes latency over throughput since

Speaker A

users are typically waiting for responses when evaluating hardware for inference consider three key questions can it run your workloads how long does it take to do so and how much does it cost the specific Hardware specifications to focus on include flops

Speaker A

computing power memory size and memory bandwidth for compute bound workloads prioritize chips with more flops for memory bound workloads focus on higher bandwidth and more memory with the hardware foundations covered let's move on to techniques for optimizing at the

Speaker A

model level now we're getting into the real tactics for speeding up inference let's start with model level optimizations techniques that make the models themselves more efficient model compression reduces a model's size potentially making it faster there are several approaches here quantization

Speaker A

which we already discussed reduces numerical Precision pruning removes less important parameters or sets them to zero and distillation which we also already discussed trains a smaller model to mimic a larger one among these options weight only quantization is by

Speaker A

far the most popular because it's relatively easy to implement works well for many models out of the box and delivers significant benefits without that much effort another challenge specific to language models is their autor regressive nature they generate

Speaker A

text one token at a time which creates a sequential bottleneck several Techniques address this limitation speculative decoding uses a faster but less powerful model to generate candidate tokens which are then verified by the Target Model it's like having an assistant draft

Speaker A

responses and a manager quickly review and approve inference with reference copies tokens from the input when appropriate for example when answering questions about about a document rather than generating them from scratch this can significantly speed up responses for

Speaker A

document-based queries parallel decoding aims to generate multiple tokens simultaneously breaking the sequential constraint additionally attention mechanism optimization improves the efficiency of Transformer models attention calculations which can be particularly memory intensive at an even lower level kernels and compilers

Speaker A

optimize how models run on specific Hardware kernels are specialized code optimized for Hardware accelerators common optimization techniques include vectorization a parallelization loop tiling and operator Fusion compilers Bridge machine learning models and Hardware converting model operations into optimized code for specific

Speaker A

accelerators but optimization doesn't stop at the model level let's look at how we can optimize the entire inference service we can achieve significant performance gains by efficiently managing resources across an entire inference service one of the most powerful techniques is batching which

Speaker A

combines multiple requests process together batching can be implemented in different ways so we have static batching which groups a fixed number of inputs but all requests must wait until the batch is full this is simple but can lead to inconsistent latency Dynamic

Speaker A

batching sets a maximum time window processing the batch when either it's full or the time limit has been reached this provides more consistent latency guarantees finally we have continuous batching which allows responses to be returned as soon as they're completed

Speaker A

with new requests added to maintain batch size this provides the best user experience but is more complex to implement another powerful technique is decoupled prefill and decode which separates these two phases of of llm inference since they have different

Speaker A

computational needs handling them separately prevents resource competition and improves overall efficiency for applications with repetitive patterns prompt caching stores overlapping text segments like system prompts or reference documents to avoid reprocessing them with each query this is particularly valuable for

Speaker A

applications with long conversations or multiple queries about the same document as models grow larger a single machine may not be sufficient this is where parallelism comes in distributing work across multiple machines replica parallelism creates m multiple copies of

Speaker A

the model each handling different requests this is the simplest approach and works well for high throughput scenarios model parallelism splits a single model across machines either through tensor parallelism which is breaking operations into smaller pieces pipeline parallelism dividing the model

Speaker A

into sequential stages context parallelism splitting input sequences across devices or sequence parallelism splitting different operations across machines so what technique should you implement we just talked about a lot the optimal combination depends on your specific workloads and performance

Speaker A

requir ments for applications prioritizing low latency replica parallelism may be best despite higher costs for most use cases the most impactful techniques are typically quantization tensor parallelism replica parallelism and attention mechanism optimization by thoughtfully applying these techniques you can dramatically

Speaker A

improve both the speed and cost effectiveness of your AI applications making them more responsive to users while keeping your infrastructure cost manageable in our next and final section we'll see how all these components come together in a complete AI application

Speaker A

architecture and how user your feedback creates a virtuous cycle of continuous Improvement now that we've explored all the individual components of AI engineering it's time to pull everything together let's see how these pieces fit into a complete architecture and how

Speaker A

user feedback creates a powerful Loop that helps these systems improve over time the simplest AI application architecture looks like this your application receives a query sends it to a model either through a third party API or self-hosted model and Returns the

Speaker A

response to the user no Bells no whistles just direct input and output but real world applications rarely stay this simple let's walk through how these architectures typically evolve as your needs grow more sophisticated the first enhancement most applications need is

Speaker A

better context construction giving the model access to information required to process useful outputs this is essentially feature engineering for foundation models so you might add rag systems to search and retrieve information from your knowledge base agent capabilities to gather information

Speaker A

from external tools document uploading functionality to analyze specific content or more these additions ensure the model has the necessary context to provide accurate relevant responses step two add guard rails for protection as your application grows in capability you'll need guard rails to protect both

Speaker A

your system and your users input guard rails protect against leaking private information to external apis and malicious prompts that could compromise your system output guard rails catch different types of failures quality failures like empty responses incorrect formatting or factually incorrect

Speaker A

content or security failures like toxic content pii exposure or unauthorized actions the key again is balancing protection with user experience overly restrictive guardrails create frustrating experiences while inadequate ones could leave you vulnerable stage three Implement model routing and

Speaker A

gateways as your application matures you may discover that one model doesn't fit all your needs different queries require different approaches and this is where model routing comes into play a model router typically includes an intent classifier that predicts what the user

Speaker A

is trying to do and then directs the query to the appropriate model or pipeline these routers should be fast and inexpensive so you can use multiple of them without adding significant latency or cost along with routing you'll need a model Gateway this is an

Speaker A

intermediate layer that provides a unified interface to different models both self-hosted and Commercial access control and cost management fallback policies to handle rate limits or API failures and load balancing logging and analytics the Gateway approach makes your codebase much more maintainable if

Speaker A

a model API changes you only need to update the Gateway not every application that uses it it's a classic example of separations of concerns in software engineering next stage four optimize with caching as your user based grow performance and cost optimization become

Speaker A

increasingly important this is where caching enters the picture inference caching includes techniques like KV caching to optimize the attention mechanism and prompt caching to avoid reprocessing identical prompt components caching is particularly valuable for multi-step processes like Chain of

Speaker A

Thought reasoning or queries requiring timec consuming actions like retrieval or web searches for implementation your options range from in-memory storage which is fast but has limited capacity to databases like postgress SQL and reddis you'll also need an eviction

Speaker A

policy like least recently used or least frequently used to manage cache sizes as you scale stage five add complex logic and write actions this is the most sophisticated AI applications go beyond simple question answering to incorporate complex multi-step reasoning flows

Speaker A

agentic patterns with loops and decision-making and write actions that make changes to the environment write actions like sending emails placing orders or initiating transfers dramatically increase your system's capabilities but also introduce significant risks these should be implemented with icient caution and

Speaker A

appropriate safeguards as your architecture grows in complexity keeping track of everything becomes increasingly challenging this is where monitoring and observability become critical while related they serve slightly different purposes monitoring tracks external outputs to detect when something goes wrong but doesn't necessarily help

Speaker A

identify the cause it's like knowing your car broke down but not why observability on the other hand ensures that sufficient information about your system's internal state is collected so that when something goes wrong you can diagnose the issue without deploying new

Speaker A

code it's like having sensors throughout your car that can pinpoint exactly what failed there are three key metrics that can help you evaluate your observability mttd or mean time to detection how long it takes to detect an issue mttr

Speaker A

meantime to response and CFR change failure rate which is the percentage of deployments that result in failures each component in your pipeline should have its own metrics and you should understand how these metrics correlate to your business's Northstar metrics

Speaker A

remember the golden rule of observability just log everything when metrics indicate a problem detailed logs help you identify exactly what went wrong as your application evolves to include multiple models data sources and tools managing these interactions can become increasingly complex this is

Speaker A

where an orchestrator becomes valuable helping you specify how these components work together AI orchestrator tools like Lang chain llama index flow wise Lang flow and hay stock help manage these complex pipelines however it's often wise to start building your application

Speaker A

without an orchestrator first to understand the core mechanics before adding another layer of abstraction now let's talk about what might be the most valuable asset in AI engineering user feedback this feedback provides proprietary data that can give you a

Speaker A

genuine competitive advantage while everyone can access the same Foundation models only you have access to how your specific users interact with your system user feedback comes in two main forms explicit feedback is directly provided by users this is things like Thumbs Up

Speaker A

and Down ratings star ratings or written comments implicit feedback is inferred from user Behavior this could be things like early termination error Corrections or question clarifications complaint messages sentiment frequency of regenerating responses and conversation length when designing your feedback

Speaker A

systems consider carefully when to request input you could ask for feedback at the beginning of the experience like like asking for skill level in a language learning app or when something unexpected happens like slow response time or at natural decision points like

Speaker A

offering between two alternative responses the goal is to gather valuable insights without disrupting the user experience remember that every request for feedback creates friction so use these opportunities wisely while we've covered each component separately a mature AI application integrates all

Speaker A

these elements into a cohesive system the architecture you choose should align with your specific use case technical constraints and business objectives one important thing to remember is that complexity should serve a purpose only add components that solve real problems

Speaker A

for your application sometimes a simpler architecture with fewer moving Parts is more reliable and easier to maintain than a complex one with every Bell and whistle the field of AI engineering is still rapidly evolving with new techniques and best practices emerging

Speaker A

daily the most successful AI Engineers maintain flexibility in their architecture allowing them to incorporate new advances while providing stable reliable experiences to their users and that wraps up our journey through AI engineering we've covered an incredible amount of ground from

Speaker A

understanding Foundation models and evaluation to mastering prompt engineering rag agents fine-tuning data set engineering and optimization techniques of course this was a super high Lev overview of a very detailed book so I really recommend using this as a starting point to check out the book

Speaker A

on your own I had a great time putting this together and I plan to do more videos covering technical content like this in the future so let me know in the comments which book you want me to summarize next and don't forget to

Speaker A

subscribe so you don't miss it when the next one comes out thanks so much for watching and I'll see you next time

Topics:AI EngineeringFoundation ModelsTransformersAttention MechanismLarge Language ModelsSelf-Supervised LearningModel TrainingChinchilla Scaling LawRWKVAI Applications

What is AI engineering and how does it differ from traditional machine learning?

AI engineering focuses on building applications using pre-trained foundation models rather than training models from scratch, emphasizing adaptation and integration over model creation.

Why are Transformers important in AI models?

Transformers use the attention mechanism to process input tokens in parallel and dynamically weigh their importance, making them faster and more effective than previous sequential models.

What are the main challenges facing the future of foundation models?

Key challenges include the potential shortage of high-quality training data and the high electricity consumption of data centers, which limit the scalability of larger models.

Get More with the Söz AI App

Transcribe recordings, audio files, and YouTube videos — with AI summaries, speaker detection, and unlimited transcriptions.

App Store Google Play

Or transcribe another YouTube video here →

Free tools: TXT to SRT · SRT Validator · Merge SRT · Subtitle to Text · All tools