RAG vs Fine-Tuning vs Prompt Engineering: Optimizing AI… — Transcript

Explore RAG, fine-tuning, and prompt engineering to optimize AI model responses with IBM technology insights.

Key Takeaways

RAG is best for incorporating fresh, domain-specific data dynamically but has higher latency and infrastructure costs.
Fine-tuning is ideal for deep domain expertise with faster responses but requires significant training effort and resources.
Prompt engineering is a low-cost method to improve outputs by carefully designing queries without retraining or external data.
Each method has trade-offs in complexity, cost, and performance depending on the use case.
Understanding these approaches helps optimize AI model deployment for specific organizational needs.

Summary

The video compares three methods to improve large language model outputs: Retrieval Augmented Generation (RAG), fine-tuning, and prompt engineering.
RAG enhances responses by retrieving up-to-date, domain-specific information using vector embeddings and incorporating it into the query.
Fine-tuning involves additional training of a pre-trained model on specialized datasets to embed domain expertise directly into model weights.
Prompt engineering improves outputs by crafting precise queries that guide the model’s attention without additional training or data retrieval.
RAG is valuable for real-time, domain-specific info but adds latency and infrastructure costs due to retrieval and vector storage.
Fine-tuning offers faster inference and deep domain knowledge but requires extensive training data, computational resources, and risks catastrophic forgetting.
Prompt engineering is cost-effective and flexible but depends on the skill of query formulation to direct model reasoning effectively.
RAG uses semantic search via vector embeddings rather than keyword matching to find relevant documents.
Fine-tuning modifies model parameters through supervised learning with input-output pairs to improve domain-specific accuracy.
Prompt engineering leverages attention mechanisms in the model to highlight relevant patterns learned during training.

Full Transcript — Download SRT & Markdown

Speaker A

Remember how back in the day people would Google themselves, you type your name into a search engine.

Speaker A

And you see what it knows about you. Well, the modern equivalent of that is to do the same thing with a chatbot.

Speaker A

So when I ask a large language model, who is Martin Keen, well, the response varies greatly depending upon which model I'm asking.

Speaker A

Because different models, they have different training data sets, they have different knowledge cut off dates, so what a given model knows about me, well, it differs greatly.

Speaker A

But how could we improve the model's answer?

Speaker A

Well, there's three ways, so let's start with a model here, and we're going to see how we can improve its responses.

Speaker A

Well, the first thing it could do is it could go out and it could perform a search, a search for new data that either wasn't in its training data set, or it was just data that became available after the model finished training.

Speaker A

And then it could incorporate those results from the search back into its answer.

Speaker A

That is called RAG, or retrieval augmented generation.

Speaker A

That's one method.

Speaker A

Or we could pick a specialized model, a model that's been trained on, let's say, transcripts of these videos.

Speaker A

That would be an example of something called fine tuning.

Speaker A

Or we could ask the model a query that better specifies what we're looking for.

Speaker A

So maybe the LLM already knows plenty about the Martin Keens of the world, but let's tell the model that we're referring to the Martin Keen who works at IBM, rather than the Martin Keen that founded Keen shoes.

Speaker A

That is an example of prompt engineering.

Speaker A

Three ways to get better outputs out of large language models, each with their pluses and minuses.

Speaker A

So let's start with RAG.

Speaker A

So let's break it down, first there's retrieval.

Speaker A

So retrieval of external up-to-date information.

Speaker A

Then there's augmentation, that's augmentation of the original prompt.

Speaker A

With the retrieved information added in, and then finally there's generation.

Speaker A

That's generation of a response based on all of this enriched context.

Speaker A

So we can think of it like this.

Speaker A

So we start with a query, and the query comes in to a large language model.

Speaker A

Now, what RAG is going to do is it's first going to go searching through a corpus of information.

Speaker A

So we have this corpus here full of some sort of data.

Speaker A

Now, perhaps that's your organization's documents, so it might be spreadsheets, PDFs, internal wikis, you know, stuff like that.

Speaker A

But unlike a typical search engine that just matches keywords, RAG converts both your question, the query, and all of the documents into something called vector embeddings.

Speaker A

So these are all converted into vectors.

Speaker A

Essentially turning words and phrases into long lists of numbers that capture their meaning.

Speaker A

So when you ask a query like, what was our company's revenue growth last quarter?

Speaker A

Well, RAG will find documents that are mathematically similar in meaning to your question, even if they don't use the exact same words.

Speaker A

So it might find documents mentioning fourth quarter performance or quarterly sales.

Speaker A

Those don't contain the keyword revenue growth, but they are semantically similar.

Speaker A

Now, once RAG finds the relevant information, it adds this information back into your original query.

Speaker A

Before passing it to the language model.

Speaker A

So instead of the model just kind of guessing based on its training data, it can now generate a response that incorporates your actual facts and figures.

Speaker A

So this makes RAG particularly valuable when you are looking for information that is up-to-date.

Speaker A

And it's also very valuable when you need to add in information that is domain specific as well.

Speaker A

But there are some costs to this.

Speaker A

Let's get with the red pen, so one cost that would be the cost of performance for performing all of this.

Speaker A

Because you have this retrieval step here, and that adds latency to each query compared to a simple prompt to a model.

Speaker A

There are also costs related to just kind of the the processing of this as well.

Speaker A

So if we think about what we're having to do here, we've got documents that need to be vector embeddings.

Speaker A

And we need to store these vector embeddings in a database.

Speaker A

All of this adds to processing costs, it adds to infrastructure costs to make this solution work.

Speaker A

All right, next up, fine tuning.

Speaker A

So remember how we discussed getting better answers about me by training a model specifically on, let's say, my video transcripts?

Speaker A

Well, that is fine tuning in action.

Speaker A

So what we do with fine tuning is we take a model.

Speaker A

But specifically an existing model.

Speaker A

And that existing model has broad knowledge.

Speaker A

And then we're going to give it additional specialized training on a focused data set.

Speaker A

So this is now specialized to what we want to develop particular expertise on.

Speaker A

Now, during fine tuning, we're updating the model's internal parameters through additional training.

Speaker A

So the model starts out with some weights here.

Speaker A

And those weights were optimized during its initial pre-training.

Speaker A

And as we fine tune, we're making small adjustments here to the model's weights using this specialized data set.

Speaker A

So this is being incorporated.

Speaker A

Now, this process typically uses supervised learning, where we provide input-output pairs that demonstrate the kind of responses we want.

Speaker A

So, for example, if we're fine tuning for technical support, we might provide thousands of examples of customer queries.

Speaker A

And those would be paired with correct technical responses.

Speaker A

The model adjusts its weights through back propagation to minimize the difference between its predicted outputs and the targeted responses.

Speaker A

So we're not just teaching the model new facts here, we're actually modifying how it processes information.

Speaker A

The model is learning to recognize domain specific patterns.

Speaker A

So fine tuning shows its strengths when you particularly need a model that has very deep domain expertise.

Speaker A

That's what we can really add in with fine tuning.

Speaker A

And also, it's much faster specifically at inference time.

Speaker A

So when we are putting the queries in, it's faster than RAG because it doesn't need to search through external data.

Speaker A

And because the knowledge is kind of baked into the model's weights, you don't need to maintain a separate vector database.

Speaker A

But there are some downsides as well.

Speaker A

Well, there's certainly issues here with the training complexity of all of this.

Speaker A

You're going to need thousands of high quality training examples.

Speaker A

There are also issues with computational cost.

Speaker A

The computational cost for training this model can be substantial and it's going to require a whole bunch of GPUs.

Speaker A

And there's also challenges related to maintenance as well.

Speaker A

Because unlike RAG where you can easily add new documents to your knowledge base at any point, updating a fine-tuned model requires another round of training.

Speaker A

And then perhaps most importantly of all, there is a risk of something called catastrophic forgetting.

Speaker A

Now that's where the model loses some of its general capabilities while it's busy learning these specialized ones.

Speaker A

So finally, let's explore prompt engineering.

Speaker A

Now specifying Martin Keen who works at IBM versus Martin Keen who founded Keen shoes, that's prompt engineering.

Speaker A

But at its most basic.

Speaker A

Prompt engineering goes far beyond simple clarification.

Speaker A

So let's think about when we input a prompt.

Speaker A

The model receives this prompt.

Speaker A

And it processes it through a series of layers.

Speaker A

And these layers are essentially attention mechanisms.

Speaker A

And each one focuses on different aspects of your prompt text that came in.

Speaker A

And by including specific elements in your prompt, so examples, or context, or how you want the format to look, you're directing the model's attention to relevant patterns it learned during training.

Speaker A

So, for example, telling a model to think about this step by step, that activates patterns it learned from training data where methodical reasoning led to accurate results.

Speaker A

So a well-engineered prompt can transform a model's output without any additional training or without data retrieval.

Speaker A

So take an example of a of a prompt.

Speaker A

Let's say we say, is this code secure?

Speaker A

Not a very good prompt.

Speaker A

An engineered prompt, it might read a bit more like this.

Speaker A

It's much more detailed.

Speaker A

Now, we haven't changed the model, we haven't added new data.

Speaker A

We've just better activated its existing capabilities.

Speaker A

Now, I think the benefits to this are pretty obvious.

Speaker A

One is that we don't need to change any of our back end infrastructure here.

Speaker A

Because there are no infrastructure changes at all.

Speaker A

In order to prompt better, it's all on the user.

Speaker A

There's also the benefit that by doing this, you get to see immediate responses.

Speaker A

An immediate results to what you do.

Speaker A

We don't have to add in new training data or any kind of data processing.

Speaker A

But of course, there are some limitations to this as well.

Speaker A

Prompt engineering is as much an art as it is a science.

Speaker A

So there is certainly a a good amount of trial and error in this sort of process to find effective prompts.

Speaker A

And you're also limited in what you can do here.

Speaker A

You're limited to existing knowledge.

Speaker A

Because you're not able to actually add anything else in here.

Speaker A

No additional amount of prompt engineering is going to teach it truly new information.

Speaker A

You're not going to teach the model anything that's outdated in the model.

Speaker A

So we've talked about now RAG as being one option.

Speaker A

And we talked about fine tuning as being another one.

Speaker A

And now just now, we've talked about prompt engineering as well.

Speaker A

And I've really talked about those as three different distinct things here.

Speaker A

But they're commonly used actually in combination.

Speaker A

We might use all three together.

Speaker A

So consider a legal AI system, RAG, that could retrieve specific cases and recent court decisions.

Speaker A

Uh, the prompt engineering part, that could make sure that we follow proper legal document formats by asking for it.

Speaker A

And then fine tuning, that could help the model master firm specific policies.

Speaker A

I mean, basically we can think of it like this.

Speaker A

We can think that prompt engineering offers flexibility and immediate results.

Speaker A

But it can't extend knowledge.

Speaker A

RAG, that can extend knowledge, it provides up-to-date information.

Speaker A

But with computational overhead, and then fine tuning.

Speaker A

That enables deep domain expertise, but it requires significant resources and maintenance.

Speaker A

Basically, it comes down to picking the methods that work for you.

Speaker A

You know, we've we've sure come a long way from vanity searching on Google.

Topics:Retrieval Augmented GenerationRAGFine-tuningPrompt EngineeringLarge Language ModelsAI OptimizationIBM TechnologyVector EmbeddingsDomain ExpertiseModel Training

Frequently Asked Questions

What is Retrieval Augmented Generation (RAG)?

RAG is a method where a model retrieves relevant external information using vector embeddings and incorporates it into the prompt to generate more accurate and up-to-date responses.

How does fine-tuning improve AI model performance?

Fine-tuning adjusts a pre-trained model’s internal parameters using specialized datasets, enabling it to develop deep domain expertise and provide more accurate responses specific to that domain.

What role does prompt engineering play in optimizing AI models?

Prompt engineering involves crafting precise and context-rich queries that guide the model’s attention to relevant patterns, improving output quality without additional training or data retrieval.

Get More with the Söz AI App

Transcribe recordings, audio files, and YouTube videos — with AI summaries, speaker detection, and unlimited transcriptions.

App Store Google Play

Or transcribe another YouTube video here →