This Laptop Runs LLMs Better Than Most Desktops — Transcript

Alex Ziskind reviews the Asus Flow Z13 laptop with AMD Ryzen AI Max Plus 395 APU, showing it runs large LLMs efficiently with 128GB unified memory.

Key Takeaways

  • 128GB unified memory on AMD Ryzen AI Max Plus APU enables running very large LLMs locally on a laptop.
  • Shared memory architecture between CPU and GPU is crucial for efficient LLM performance.
  • Software choice impacts GPU utilization; LM Studio offers better GPU control than Olama Run.
  • Larger LLMs provide better results but require significant memory and compute resources.
  • Prompt length affects token generation speed more than the type of task (story, code, etc.).

Summary

  • The Asus Flow Z13 2025 edition features AMD's Ryzen AI Max Plus 395 APU, combining CPU and GPU on one chip with shared 128GB unified memory.
  • This laptop can run large LLMs like the 110 billion parameter Quen model, which many desktops and GPUs like the RTX 5090 cannot handle.
  • Unified memory architecture allows the GPU to access large memory pools, improving performance for local LLM inference.
  • Larger models (e.g., Llama 370 billion) generally provide better accuracy and reasoning but require more memory and compute power.
  • Performance depends on hardware specs including CPU, GPU, memory bandwidth, and software tools used to manage workload distribution.
  • Olama Run software sometimes defaults to CPU-only execution, limiting GPU utilization and performance.
  • LM Studio allows manual control over GPU offloading, enabling better speed by running models fully or partially on the GPU.
  • Token generation speed varies with prompt length and model size; smaller prompts yield faster token rates.
  • Apple Silicon with 128GB memory does not allocate all memory for GPU use, unlike AMD's APU, affecting LLM performance.
  • The video also briefly mentions a Plaid Notepin device for focused audio note-taking, unrelated to LLM performance.

Full Transcript — Download SRT & Markdown

00:00
Speaker A
How is this possible? This 110 billion parameter model, Quen 110 billion, is running on this machine. But even my 5090 can't run this model. I'll tell you how. This little device has 128 GB of unified memory on board. Oh, unified memory, you say? Is that kind of like Apple's unified memory in this M4 Max? Not exactly. And this does affect LLM performance, which is what I'm going to be digging into here right now. This thing, of course, is the Asus Flow Z13 2025 edition. And that's the first device to have AMD's new Ryzen AI Max Plus 395 APU, or known as accelerated processing unit, that combines both the CPU and the GPU together on a single chip. And that means that the GPU can have access to the absurd amount of memory that's shared with the CPU. Shared maybe. And as of about a week ago, these were the only two laptops that had 128 GB of memory. But this is becoming more common these days, especially with more and more people interested in running local models, local LLMs. But wait a minute, my phone with 8 gigs of memory or however much is in the iPhone, I don't know, that thing can run LLM, and so can this 16 GB MacBook Air. In fact, all these laptops can run LLM. So why would I care to get 128 GB? Isn't that expensive? Well, yes, it's very expensive for now. And the reason you care is because larger models like Llama 370 billion, for example, they often outperform smaller ones since they store more knowledge and maintain longer reasoning chains. That just means fewer hallucinations and better performance on complex tasks like math or legal arguments, for example. We're going to make that assumption here. We're going to assume that bigger models give you better results because there's a ton of quality benchmarks out there that show results. And going forward, we're going to focus on the other performance side of things, which is how fast they can do stuff. Oh, look at that. 2.12 tokens per second. So, this whole thing depends on the hardware you're using, what the CPU is, what the GPU is, how much memory bandwidth you have, and so on. So, what I did was I picked a bunch of these models that I've been collecting over some time. I don't keep all the ones that I collect, but you know, just a few. I don't have all these models on all these machines, but I have enough to get a sample so we can compare things. And some of these I run on a regular basis, especially vision models where I need to automate some local scripts that look at thousands of pictures and then describe them. This stuff would be really expensive doing in the cloud. So this is 128 gigs on this one. This one is another model of the same thing. It's the Flow Z13, but it has 32 gigs of memory. Now, the reason I found out about the Quen model is because a regular commenter on the channel asked me specifically about that one. And that happened to coincide with the size of the model that's just big enough to fit on this machine. And if you go bigger, it does not work even though it looks like it should. It looks like it should, but it doesn't. So, for example, here's Olama Run. And I'm going to try this one, which is 73 GB in size on disk, not in memory. By the way, I got to mention this. The size on disk of the model is not necessarily going to be the size of the model when it's running in memory. It's just going to give you an approximate starting point. So don't always go by that. In fact, I made an LLM calculator recently in one of the videos. I'll link to that down below so you can check out what the running memory footprint could be. Also an approximation, by the way. So let's do this one. Mr. Large, latest. And this machine should be able to handle that having, you know, 128 GB available. It's not even how it works on Apple Silicon either. This MacBook Pro with the M4 Max also has 128 GB available, but not all of that is available for doing LLMs on the GPU. By default, you get about 96 GB. You can override that, of course, get it a little bit closer, but then you're running at the risk of the system becoming unstable. Now, if we take a look at this, you'll see that Mr. Large is not giving us the best results, but it is running. However, this is not running on the GPU. This is mostly running on the CPU. Take a look at the memory footprint. You'll see that system memory has 128 GB and 125 of that is being used up. However, the GPU is not being used at all. 288 megabytes. What's that? And that's because Olama, the tool that we use to run LLMs, sometimes decides on its own how to distribute the load between the GPU and the CPU. If you want more control over that, you have to use other tools. For example, LM Studio is another one. And if I load a model in here, I'll be able to decide how much should be offloaded to the GPU. All of it or some of it or none of it. If you're not going to offload anything to the GPU, it is going to run fully on the CPU, which means it's not going to be giving you the best performance speed-wise. If you go half and half, still not the best performance. The best performance is if you're running fully on the GPU here. So, here the model is loading and you'll see that we're getting some memory usage on the GPU, but we're still pretty high on that CPU memory usage. We shouldn't be, but I think Olama didn't unload fully. So let's quit it. Okay, good. For some reason, Olama didn't let go of the model even though I said bye to it. Look at that. Now, the memory is going down. Now, we should be able to load things in LM Studio. Let's load this smaller one so we can just see it quickly. I'm going to load 34 layers onto the GPU. And there it is. It loaded. So if I say something like hi, it's going to answer me pretty quickly. 51 tokens per second for the Gemma 34 billion model. Now the reason I said hi to it is because I want to get the fastest possible speed. It's not a realistic example of what you're going to be doing with an LLM. You're going to be asking it much deeper questions probably like what's the meaning of life? But hi, it's like a one token prompt, and smaller prompts yield faster results. Watch, I'll prove it to you. Write a 1,000-word story. This is a little bit of a tangent, but I think it's worth mentioning here. And whether I ask it to write a 1,000-word story or I ask it to write a piece of code should not matter that much. All that matters is the length of the prompt. And if you're using just the model without speculative decoding, I'll go into that in another video, then the output you're getting should pretty much be the same speed. So here we got 42 tokens per second instead of like we had previously 52 because the prompt was longer. What I can do further is just copy this story and send that in. Now you can see the speed of this has drastically slowed down. And if I stop this now, 41 tokens per second. So a bit slower. If we go to task manager, you'll see that we have 288 megabytes dedicated GPU memory. What's up with that? Why not 128 GB? That's what this system has, right? Well, you can actually predetermine how much memory goes into the machine. And that's where the big difference comes in between this new APU by AMD and something like Apple Silicon. I'll get into that momentarily. Tell me, which developer builds trust? The one typing or the one actually listening? This tiny Plaid Notepin turned me from a distracted notetaker to a focused problem solver. I've actually been enjoying using it for the last few months, and I can start recording anywhere, anytime with a single tap. Hands-free, no fumbling for a phone app, and zero risk of an incoming call wrecking the capture. Because the audio lives on the Notepin itself, it doesn't use up the phone battery or storage. I stay focused on my meeting, not the battery gauge. Notepin gives me 300 free minutes of clear speaker-labeled transcripts each month, and that's perfect for standup meetings. Unlike Apple Notes, which needs iOS 18 and a thousand iPhone, Plaud supports 112 languages and works on almost any device. Plaud
00:17
Speaker A
memory, you say? Is that kind of like Apple's unified memory in this M4 Max?
00:22
Speaker A
Not exactly. And this does affect LLM performance, which is what I'm going to be digging into here right now. This thing, of course, is the Asus Flow Z13 2025 edition. And that's the first device to have AMD's new Ryzen AI Max
00:36
Speaker A
Plus 395 APU or known as accelerated processing unit that combines both the CPU and the GPU together on a single chip. And that means that the GPU can have access to the absurd amount of memory that's shared with the CPU.
00:51
Speaker A
Shared maybe. And as of about a week ago, these were the only two laptops that had 128 GB of memory. But this is becoming more common these days especially with more and more people interested in running local models
01:03
Speaker A
local LLMs. But wait a minute, my phone with 8 gigs of memory or however much is in the iPhone, I don't know, that thing can run LLM and so can this 16 GB MacBook Air. In fact, all these laptops
01:15
Speaker A
can run LLM. So why would I care to get 128 GB? Isn't that expensive? Well, yes it's very expensive for now. And the reason you care is because larger models like Llama 370 billion, for example they often outperform smaller ones since
01:33
Speaker A
they store more knowledge and maintain longer reasoning chains. That just means fewer hallucinations and better performance on complex tasks like math or legal arguments, for example. We're going to make that assumption here.
01:44
Speaker A
We're going to assume that bigger models give you better results because there's a ton of quality benchmarks out there that show results. And going forward we're going to focus on the other performance side of things, which is how
01:54
Speaker A
fast they can do stuff. Oh, look at that. 2.12 tokens per second. So, this whole thing depends on the hardware you're using, what the CPU is, what the GPU is, how much memory bandwidth you have, and so on. So, what I did was I
02:05
Speaker A
picked a bunch of these models that I've been collecting over some time. I don't keep all the ones that I collect, but you know, just a few. I don't have all these models on all these machines, but I have enough to get a sample so we can
02:17
Speaker A
compare things. And some of these I run on a regular basis, especially vision models where I need to automate some local scripts that look at thousands of pictures and then describe them. This stuff would be really expensive doing in
02:29
Speaker A
the cloud. So this is 128 gigs on this one. This one is another model of the same thing. It's the Flow Z13, but it has 32 gigs of memory. Now, the reason I found out about the Quen model is
02:39
Speaker A
because a regular commenter on the channel asked me specifically about that one. And that happened to coincide with the size of the model that's just big enough to fit on this machine. And if you go bigger, it does not work even
02:51
Speaker A
though it looks like it should. It looks like it should, but it doesn't. So, for example, here's Olama Run. And I'm going to try this one, which is 73 GB in size on disk, not in memory. By the way, I
03:04
Speaker A
got to mention this. The size on disk of the model is not necessarily going to be the size of the model when it's running in memory. It's just going to give you an approximate starting point. So don't always go by that. In fact, I made a LLM
03:16
Speaker A
calculator recently in one of the videos. I'll link to that down below so you can check out what the running memory footprint could be. Also an approximation, by the way. So let's do this one. Mr. Large, latest. And this
03:26
Speaker A
machine should be able to handle that having, you know, 128 GB available. It's not even how it works on Apple Silicon either. This MacBook Pro with the M4 Max also has 128 GB available, but not all of that is available for doing LLMs on
03:42
Speaker A
the GPU. By default, you get about 96 GB. You can override that, of course get it a little bit closer, but then you're running at the risk of the system becoming unstable. Now, if we take a look at this, you'll see that Mr. Large
03:54
Speaker A
is not giving us the best results, but it is running. However, this is not running on the GPU. This is mostly running on the CPU. Take a look at the memory footprint. You'll see that system memory has 128 GB and 125 of that is
04:08
Speaker A
being used up. However, the GPU not being used at all. 288 megabytes. What's that? And that's because Olama, the tool that we use to run LLMs, sometimes decides on its own how to distribute the load between the GPU and the CPU. If you
04:21
Speaker A
want more control over that, you have to use other tools. For example, LM Studio is another one. And if I load a model in here, I'll be able to decide how much should be offloaded to the GPU. All of
04:34
Speaker A
it or some of it or none of it. If you're not going to offload anything to the GPU is going to run fully on the CPU, which means it's not going to be giving you the best performance speed-wise. If you go half and half
04:44
Speaker A
still not the best performance. The best performance is if you're running fully on the GPU here. So, here the model is loading and you'll see that we're getting some memory usage on the GPU but we're still pretty high on that CPU
04:56
Speaker A
memory usage. We shouldn't be, but I think Olama didn't unload fully. So let's quit it. Okay, good. For some reason, Olama didn't let go of the model even though I said bye to it. Look at that. Now, the memory is going down.
05:08
Speaker A
Now, we should be able to load things in LM Studio. Let's load this smaller one so we can just see it quickly. I'm going to load 34 layers onto the GPU.
05:22
Speaker A
And there it is. It loaded. So if I say something like hi, it's going to answer me pretty quickly. 51 tokens per second for the Gemma 34 billion model. Now the reason I said hi to it is because I want
05:34
Speaker A
to get the fastest possible speed. It's not a realistic example of what you're going to be doing with an LLM. You're going to be asking it much deeper questions probably like what's the meaning of life? But high, it's like a
05:44
Speaker A
one token prompt, and smaller prompts yield faster results. Watch, I'll prove it to you. Write a 1,000word story. This is a little bit of a tangent, but I think it's worth uh mentioning here. And whether I ask it to write a 1,00word
05:59
Speaker A
story or I ask it to write a piece of code should not matter that much. All that matters is the length of the prompt. And if you're using just the model without speculative decoding, I'll go into that in another video, then the
06:10
Speaker A
output you're getting should pretty much be the same speed. So here we got 42 tokens per second instead of like we had previously 52 because the prompt was longer. What I can do further is just copy this story and send that in. Now
06:24
Speaker A
you can see the speed of this has drastically slowed down. And if I stop this now 41 tokens per second. So a bit slower. If we go to task manager you'll see that we have 288 megabytes dedicated GPU memory. What's up with that? Why not
06:39
Speaker A
128 GB? That's what this system has right? Well, you can actually predetermine how much memory goes into the machine. And that's where the big difference comes in between this new APU by AMD and something like Apple Silicon.
06:52
Speaker A
I'll get into that momentarily. Tell me which developer builds trust? The one typing or the one actually listening?
06:58
Speaker A
This tiny plaid note pin turned me from a distracted notetaker to a focused problem solver. I've actually been enjoying using it for the last few months and I can start recording anywhere, anytime with a single tap.
07:09
Speaker A
Hands-free, no fumbling for a phone app and zero risk of an incoming call wrecking the capture. Because the audio lives on the Notepin itself, it doesn't use up the phone battery or storage. I stay focused on my meeting, not the
07:22
Speaker A
battery gauge. Notepin gives me 300 free minutes of clear speaker label transcripts each month, and that's perfect for standup meetings. Unlike Apple Notes, which needs iOS 18 and a thousand iPhone, Plaude supports 112 languages and works on almost any
07:37
Speaker A
device. Plaude AI built on top of GPT4 03 Mini, and Claude 3.7 Sonnet instantly pulls out key takeaways, decisions, and action items. You can pick from over 30 templates or customize your own. Later I just ask it, when do we discuss adding
07:53
Speaker A
dark mode? And ask AAI jumps in to that precise second and can pipe the answer into Jira or Slack. It weighs just 16 grams, records 20 straight hours, and clips anywhere. Lapel, lanyard. I prefer wearing mine on my wrist. Hands-free
08:09
Speaker A
distraction proof, and totally discreet. But always make sure and ask for consent before hitting record. And by the way everything is locked in AWS with HIPPA grade security, so client info stays private. Let Plaude Note handle your sprint notes. Link is in the
08:24
Speaker A
description. And in two weeks, I'll send a free notepin to one random commenter from this video. Got to live in the US though. Now, back to the show. For now I want you to see that there is a piece
08:33
Speaker A
of software called Armory Crate on Asus machines, which allows you to free up memory and to basically assign the GPU a certain amount of memory. So, here I can assign it half a gig, 1 gig, all the way
08:46
Speaker A
up to 96 GB. But notice when I select an option, I have to reboot my computer. On this 32 GB machine, I don't have the same options. I have up to 24 GB available for the GPU. And still, if I
09:00
Speaker A
select something else, I have to reboot my machine. So, why wouldn't I just leave that on auto? After all, isn't auto going to determine the best course of action for me? Well, not always. If you ask people that drive automatic
09:12
Speaker A
transmission cars, they're going to say "Hey, I like it." If you ask people that drive manual transmission cars, they're going to say, "Ah, automatic doesn't let me do what I want." Same thing here.
09:21
Speaker A
auto. Having that setting, it looks like we're not getting barely anything in the dedicated GPU memory here in Task Manager for this uh Radeon 8060. But our model still runs, so that's good, right?
09:34
Speaker A
Well, not exactly. I ran a bunch of these models. I ran all of them and on all the different memory settings. If we take a look at this chart for example and we take a look at auto, which is
09:46
Speaker A
this light blue one, you'll see that the performance of auto is always way worse.
09:52
Speaker A
Well, not always. In this case, uh 64 was worse on llama 3.2, but generally the performance of auto is worse than manually selecting one of the memory options. However, if you select the memory option, it's generally better.
10:07
Speaker A
Some of them are better than others. 16 seems to be doing pretty well. 8 is doing well. 96 is doing well and so is 32. 1 GB, if you select 1 GB for the memory option, it's not doing well. And
10:20
Speaker A
neither is 64 in these two cases for these two models, Gemma 3 and Llama 3.2.
10:27
Speaker A
Not sure why that is, but down here 64 seems to be doing okay. Just up here it's not. But generally seems like auto is not the best option. So, let's pick a different option. And I would tend to
10:37
Speaker A
think that if I'm running a large language model, then perhaps I want to have 96 GB, right? I want as much as possible available. So, I'm going to select that and reboot my machine. Why did I have to do that though? That's
10:50
Speaker A
kind of a bummer. Well, it comes down to architecture. While they call it unified memory here, it's not true or unified memory architecture where the CPU and the GPU share a space minimizing data movement between well, the CPU and the
11:04
Speaker A
GPU. It's more like they can share memory here, but they have to figure out who gets what at the beginning and only use what they have access to. Log in with my face. Yes, Apple. Why don't you have that? You could put that somewhere
11:17
Speaker A
in that giant notch. Imagine Apple's like two chefs sharing one big countertop. Both can grab the flour, the knives, or the mixing bowl anytime they want. AMD's is like two chefs with separate tables, and they agree ahead of
11:31
Speaker A
time. You get the flour and you get the bowls and you get the knives and no swapping mid recipe. So, check this out.
11:38
Speaker A
Now, if we go to our GPU, we got 96 GB available for the GPU. And you think this is great, right? Well, there's a problem. If I try to load a model here let's say, I don't know, a larger model
11:51
Speaker A
because after all, that's why I want so much memory is so I can run larger models. Let's go with this Llama 3.370 billion. And we're going to try to offload 80 layers. It's 74 GB on disk.
12:02
Speaker A
Should fit no problem, right? Oops. What's going on here? Failed to load the model. Model loading aborted due to insufficient system resources. What? I thought I had enough. What's going on here? I bought this computer. Now I can't use it. All right, let's try this
12:16
Speaker A
on the Mac. I have a couple of 70 billion ones here. Let's try this one.
12:20
Speaker A
69 GB on disk. Load model. All 80 layers offloaded to the GPU. Let's see what happens now. Oh, this is Q8. Wow, that's a quantization level. So, pretty good high quality model here. And look at that. Memory used 120 GB. I didn't have
12:36
Speaker A
to say, oh, GPU gets this much. It just automatically does it because it shares the same memory. And if I say hi to here, how's it going? Is there something I can help you with? It works. But, you
12:47
Speaker A
know, we're not totally dead in the water here on this machine. There is a hardware setting you can adjust here called guardrails in LM Studio where right now it's set to strict. So, it's going to prevent you from shooting
13:00
Speaker A
yourself in the foot by loading too big a model and making your system freeze.
13:04
Speaker A
We can turn that off. You know, there's a bunch of other settings in between but let's go for it. And let's go for this big one here. Again, offload everything to the GPU and load. If we take a look at what's
13:21
Speaker A
happening here and why it's taking so long, you'll see that system memory is still being used. Why is system memory being used? I thought we had enough available memory in the GPU to just load it in there. Well, we're using that
13:32
Speaker A
too. But we're also using system memory. Failed to load the model. That's too bad. And that's 69 GB on disk. We do have other 70 billion parameter models.
13:42
Speaker A
They're just quantized more. So, instead of Q8 or quantized to 8 bit, we can do Q4, which is quantized to 4bit. It's half the size. So, let's load that one.
13:56
Speaker A
And again, the system memory is growing and so is the GPU memory. And I'll tell you what's happening here. This is my assumption, but I don't see any other explanation. And if you know better then you let me know in the comments
14:07
Speaker A
down below. And hopefully everybody else will get to learn. But because I think this is not a true unified memory architecture, it first has to do what other systems with a discrete GPU do.
14:19
Speaker A
They copy your model to memory on the system and then that gets copied to the GPU memory which is a separate space and when it's discrete GPU that's obvious that data has to be copied from system RAM to the GPU RAM and then it stays
14:32
Speaker A
there to be quick here it wasn't so obvious because you think that well we're on the same chip we're on the same APU they're unified memory why not just use the same thing well I don't think that's what's happening here so if we
14:45
Speaker A
say hi here it looks like it's working fine and Now that it's copied things over to the GPU, it's going to be in there. Memory is still being used, but not as much as initially. As you can see
14:55
Speaker A
here, initially we had a big bump. Now we're down to about half. But if we take a look at the GPU memory, now it's being used quite a bit. And there is our compute from the GPU when I ask my
15:07
Speaker A
prompt, when I ask my query. So that's essentially the big difference here. Could I get the 32 gig version instead of the 128 gig version and be able to run smaller models? Sure, you could. But the 32 gig version is not going to give
15:17
Speaker A
you everything that the 128 gig model will. For example, it will not run this model because this model is way too big for it. And I made a few charts here. I collected some data. I can actually post
15:27
Speaker A
some links down below here. You'll see the different machines uh in different colors. Let's clean it up a little bit here. Let's take a look at the uh two FL Z13 machines. And you'll see that yeah when it comes to these larger models
15:40
Speaker A
only the 128 GB version will run those. However, the models that do run on both of them, the smaller models like this one right here, Gemma 12 billion runs on both, we're getting 24 tokens per second here on the 128 GB machine. And we're
15:57
Speaker A
getting 25.5 tokens when we set the 32 GB machine to 24 GB allocated to the GPU. A little bit even better here in this particular case of this model. If we zoom in on the 32 GB machine, you can
16:11
Speaker A
see the little bump that we get from going to the 16 gigabyte setting right there across the board from all the models. And that's actually consistent when we look at the 128 GB machine as well. The 16 GB setting gives us really
16:26
Speaker A
good results. 64 GB setting does not for some reason. Now, you might wonder how the AMDs do against Apple Silicon. It depends because one other thing plays a very key role in performance and performance I mean speed tokens per
16:40
Speaker A
second. This is the M4 MacBook Air and it has 100 GB per second memory bandwidth. The M4 Max 530 something.
16:49
Speaker A
This machine is pretty decent. It's got 230 something and in fact I ran the stream benchmark which is an industry standard for measuring memory bandwidth and I got for the Triad I got 235 GB per second. There's another machine coming
17:03
Speaker A
very soon that's a little bit less than the M4 Pro chip. The M4 Pro chip has a little bit more memory bandwidth. So but it's going to be about the same speed for tokens generation. There's another machine that's coming out very
17:14
Speaker A
close numbers and I'm planning to get that in here to test and that's the DigX Spark, Nvidia's own machine. Now, if I run Stream on my M4 Max, I'm not getting close to the advertised numbers here.
17:25
Speaker A
For the Triad, I'm only getting 316. For copy, I'm getting 355, but that's not even close to the 500 plus that we're supposed to be getting. And finally, the M4 Air is giving us about 100 in the stream results. So, we know the Apple
17:37
Speaker A
Silicon measurement of the stream benchmark is pretty close to accurate. And that's why the MacBook Air numbers for the tokens per second are actually much lower than the AMD Ryzen AI Max Plus 395. Hey, I'm getting it. I'm
17:49
Speaker A
getting it. So now, right off the bat to get a sense of what these machines are capable of, we might as well run Llama Bench, which is part of Llama CPP a pretty common tool that actually a lot
18:00
Speaker A
of people use, including LM Studio under the hood. So, we're going to go Llama Bench. I compiled it, by the way, using Vulcan so that it uses the GPU on the AMD machines, and by default, it uses Metal on the MacBooks. All we need to do
18:14
Speaker A
is run Llama Bench and point it to the model that we want to test. I'm just going to do one as an example here so we can get a relative sense of how these models perform on these machines. And
18:25
Speaker A
I'm going to pass it the GGUF file. All right. So, let's go. Here we go. On all these. And here's the result on the MacBook Air. You can see the PP512 result, which is about 1500.
18:39
Speaker A
That's the prompt processing result. And what's good about this particular benchmark is that it comes out pretty even across the machines. PP512 TG128 is text generation and we're getting 101 tokens per second here on Windows machines. You can pass in the NGL flag
18:55
Speaker A
which is how many layers you want to offload to the GPU. Not how many, but what the percentage is. Let's see if that works on the Mac as well. We're getting slightly better numbers, but nothing tracked here. You can see the
19:07
Speaker A
results are a little bit better on the AMD machine. NGL 999. Let's redo that with NGL 100. So, we're offloading all the layers, and we're sure to be doing that. And we're getting slightly better numbers here. For PP512, we're getting
19:23
Speaker A
6,173. And for text generation, 159, 162 for the previous run. As you can see we're using Vulcan here. Vulcan is a library that's crossplatform, so you can use it on Linux, Windows, AMD, RTX, all those machines. And here is the result
19:39
Speaker A
for the M4 Max. the first result. I'm running it again with NGL 100 and we're getting slightly better numbers with that. At least for the prompt processing, we're getting better numbers, but they're pretty close. 7,000 or so. And text generation is the
19:53
Speaker A
highest at about 240. That's Gemma 3 1 billion model. I'm not going to do llama bench for all of them just to give you a baseline here how these machines relate to each other. I did recently review this machine from a developer
20:04
Speaker A
perspective. Check that video out. I'll link to it right over here. Thanks for watching. Hope you enjoyed this one and I'll see you in the next one.
Topics:Asus Flow Z13AMD Ryzen AI Max PlusLLM performancelarge language modelsunified memorylocal AI modelsGPU offloadingOlama RunLM StudioAI laptop

Frequently Asked Questions

Why can the Asus Flow Z13 run the 110 billion parameter Quen model when other desktops cannot?

Because it has 128GB of unified memory shared between the CPU and GPU via AMD's Ryzen AI Max Plus APU, allowing the large model to fit and run efficiently.

How does unified memory affect LLM performance on this laptop?

Unified memory allows the GPU direct access to a large pool of memory shared with the CPU, reducing bottlenecks and improving speed when running large models.

What software tools are recommended for better GPU utilization when running LLMs?

LM Studio is recommended because it allows manual control over how much of the model is offloaded to the GPU, unlike Olama Run which often defaults to CPU-only execution.

Get More with the Söz AI App

Transcribe recordings, audio files, and YouTube videos — with AI summaries, speaker detection, and unlimited transcriptions.

Or transcribe another YouTube video here →