Speaker A
How is this possible? This 110 billion parameter model, Quen 110 billion, is running on this machine. But even my 5090 can't run this model. I'll tell you how. This little device has 128 GB of unified memory on board. Oh, unified memory, you say? Is that kind of like Apple's unified memory in this M4 Max? Not exactly. And this does affect LLM performance, which is what I'm going to be digging into here right now. This thing, of course, is the Asus Flow Z13 2025 edition. And that's the first device to have AMD's new Ryzen AI Max Plus 395 APU, or known as accelerated processing unit, that combines both the CPU and the GPU together on a single chip. And that means that the GPU can have access to the absurd amount of memory that's shared with the CPU. Shared maybe. And as of about a week ago, these were the only two laptops that had 128 GB of memory. But this is becoming more common these days, especially with more and more people interested in running local models, local LLMs. But wait a minute, my phone with 8 gigs of memory or however much is in the iPhone, I don't know, that thing can run LLM, and so can this 16 GB MacBook Air. In fact, all these laptops can run LLM. So why would I care to get 128 GB? Isn't that expensive? Well, yes, it's very expensive for now. And the reason you care is because larger models like Llama 370 billion, for example, they often outperform smaller ones since they store more knowledge and maintain longer reasoning chains. That just means fewer hallucinations and better performance on complex tasks like math or legal arguments, for example. We're going to make that assumption here. We're going to assume that bigger models give you better results because there's a ton of quality benchmarks out there that show results. And going forward, we're going to focus on the other performance side of things, which is how fast they can do stuff. Oh, look at that. 2.12 tokens per second. So, this whole thing depends on the hardware you're using, what the CPU is, what the GPU is, how much memory bandwidth you have, and so on. So, what I did was I picked a bunch of these models that I've been collecting over some time. I don't keep all the ones that I collect, but you know, just a few. I don't have all these models on all these machines, but I have enough to get a sample so we can compare things. And some of these I run on a regular basis, especially vision models where I need to automate some local scripts that look at thousands of pictures and then describe them. This stuff would be really expensive doing in the cloud. So this is 128 gigs on this one. This one is another model of the same thing. It's the Flow Z13, but it has 32 gigs of memory. Now, the reason I found out about the Quen model is because a regular commenter on the channel asked me specifically about that one. And that happened to coincide with the size of the model that's just big enough to fit on this machine. And if you go bigger, it does not work even though it looks like it should. It looks like it should, but it doesn't. So, for example, here's Olama Run. And I'm going to try this one, which is 73 GB in size on disk, not in memory. By the way, I got to mention this. The size on disk of the model is not necessarily going to be the size of the model when it's running in memory. It's just going to give you an approximate starting point. So don't always go by that. In fact, I made an LLM calculator recently in one of the videos. I'll link to that down below so you can check out what the running memory footprint could be. Also an approximation, by the way. So let's do this one. Mr. Large, latest. And this machine should be able to handle that having, you know, 128 GB available. It's not even how it works on Apple Silicon either. This MacBook Pro with the M4 Max also has 128 GB available, but not all of that is available for doing LLMs on the GPU. By default, you get about 96 GB. You can override that, of course, get it a little bit closer, but then you're running at the risk of the system becoming unstable. Now, if we take a look at this, you'll see that Mr. Large is not giving us the best results, but it is running. However, this is not running on the GPU. This is mostly running on the CPU. Take a look at the memory footprint. You'll see that system memory has 128 GB and 125 of that is being used up. However, the GPU is not being used at all. 288 megabytes. What's that? And that's because Olama, the tool that we use to run LLMs, sometimes decides on its own how to distribute the load between the GPU and the CPU. If you want more control over that, you have to use other tools. For example, LM Studio is another one. And if I load a model in here, I'll be able to decide how much should be offloaded to the GPU. All of it or some of it or none of it. If you're not going to offload anything to the GPU, it is going to run fully on the CPU, which means it's not going to be giving you the best performance speed-wise. If you go half and half, still not the best performance. The best performance is if you're running fully on the GPU here. So, here the model is loading and you'll see that we're getting some memory usage on the GPU, but we're still pretty high on that CPU memory usage. We shouldn't be, but I think Olama didn't unload fully. So let's quit it. Okay, good. For some reason, Olama didn't let go of the model even though I said bye to it. Look at that. Now, the memory is going down. Now, we should be able to load things in LM Studio. Let's load this smaller one so we can just see it quickly. I'm going to load 34 layers onto the GPU. And there it is. It loaded. So if I say something like hi, it's going to answer me pretty quickly. 51 tokens per second for the Gemma 34 billion model. Now the reason I said hi to it is because I want to get the fastest possible speed. It's not a realistic example of what you're going to be doing with an LLM. You're going to be asking it much deeper questions probably like what's the meaning of life? But hi, it's like a one token prompt, and smaller prompts yield faster results. Watch, I'll prove it to you. Write a 1,000-word story. This is a little bit of a tangent, but I think it's worth mentioning here. And whether I ask it to write a 1,000-word story or I ask it to write a piece of code should not matter that much. All that matters is the length of the prompt. And if you're using just the model without speculative decoding, I'll go into that in another video, then the output you're getting should pretty much be the same speed. So here we got 42 tokens per second instead of like we had previously 52 because the prompt was longer. What I can do further is just copy this story and send that in. Now you can see the speed of this has drastically slowed down. And if I stop this now, 41 tokens per second. So a bit slower. If we go to task manager, you'll see that we have 288 megabytes dedicated GPU memory. What's up with that? Why not 128 GB? That's what this system has, right? Well, you can actually predetermine how much memory goes into the machine. And that's where the big difference comes in between this new APU by AMD and something like Apple Silicon. I'll get into that momentarily. Tell me, which developer builds trust? The one typing or the one actually listening? This tiny Plaid Notepin turned me from a distracted notetaker to a focused problem solver. I've actually been enjoying using it for the last few months, and I can start recording anywhere, anytime with a single tap. Hands-free, no fumbling for a phone app, and zero risk of an incoming call wrecking the capture. Because the audio lives on the Notepin itself, it doesn't use up the phone battery or storage. I stay focused on my meeting, not the battery gauge. Notepin gives me 300 free minutes of clear speaker-labeled transcripts each month, and that's perfect for standup meetings. Unlike Apple Notes, which needs iOS 18 and a thousand iPhone, Plaud supports 112 languages and works on almost any device. Plaud