Blue Intel X Solidigm Meeting – Final — Transcript

Discussion on KV cache research, SSD testing, and collaboration between Blue Intelligence and Solidigm for AI and GPU workloads.

Key Takeaways

  • Full GPU plus SSD instances are critical for accurate KV cache benchmarking and validation.
  • Direct SSD access is necessary to measure system-level optimizations and endurance accurately.
  • Blue Intelligence’s KV cache solution significantly reduces inference costs without compromising accuracy.
  • The solution is designed for easy integration via a simple software plug-in into existing SSD hardware.
  • Collaboration and sharing of updated research and samples are essential for advancing the joint project.

Summary

  • Blue Intelligence and Solidigm discuss testing SSD samples (QLC and TLC) for KV cache research in AI workloads.
  • They explore providing a full GPU instance with attached solid-state drives for detailed benchmarking and validation.
  • The conversation covers the importance of reducing dollar cost per million tokens in inference over time-to-first-token improvements.
  • They emphasize the need for direct SSD access to measure garbage collection, endurance, and tail latency improvements.
  • The discussion includes technical details about SSD connectivity, PCIe local connection, and challenges with current server hardware.
  • Blue Intelligence highlights their KV cache benchmarks, including Nixl bench and ML Perf 3.0 co-developed with Micron.
  • The go-to-market strategy involves a simple software plug-in for Solidigm SSDs, targeting Neo clouds and enterprise customers.
  • Internal tests showed zero accuracy loss and hallucination when running large datasets, boosting confidence in the solution.
  • They request updated white papers and sample SSDs to proceed with further testing and collaboration.
  • The meeting concludes with next steps focusing on technical requirements and sample provision for continued research.

Full Transcript — Download SRT & Markdown

00:00
Speaker A
Oh, the H100 is reasonable, but the P200 is quite expensive. Do you think that's also something that's possible with the P200? Because they have run the tests on the H100, and yeah, we'd like to—I think I'd like to
00:19
Speaker A
see some of the detailed test reports. We are the AI lab.
00:25
Speaker A
We run that team within our organization, and we have engineers doing KV cache research.
00:31
Speaker A
Oh yeah, great. As part of the—yeah, I got your question. It was really very deep, so I enjoyed it. I also started with your question from your prior conversation, right? Yeah, to share the question that Mate has shared. Okay.
00:53
Speaker A
Yeah, we've—yeah, yeah, we can share some of our research as well. Yeah, very helpful. But JM, how—like I see the request as—that's fair. But how will Dr. Lee take those
01:11
Speaker A
samples and put them in a Lambda dataset? No, no, I mean, we could provide a GPU instance with a couple of solid drives. No problem.
01:18
Speaker A
Oh, for a couple—if it's a couple of weeks, you're saying the full instance along with a GPU?
01:23
Speaker A
Yeah, as long as you get one GPU with SSDs attached, because you don't necessarily need—well, for the QLC W validation, you'll need direct access to a passthrough SSD, but obviously we have been
01:40
Speaker A
doing KV cache scale at cluster level. Oh yeah, and with ISV partners AIS.
02:30
Speaker A
Thank you very much. Thanks to—but if you go back, ask—could you ask Dr.
02:37
Speaker A
to go back two slides and slide up? I read, because I think what I read was maybe one more—one more please. Where you said two samples, you just—yeah, you needed two samples from us: one P,
02:53
Speaker A
QLC sample and one GLC sample. How would you take the question? How would you, if we give you the samples, put it in that Lambda environment? Or do you want from us a full M
03:07
Speaker A
GPU plus storage? We can give you that instance for your engineers' instance host, then run it first.
03:28
Speaker A
Yeah. Yeah. Yes, the latter would be best if that's possible. The full instable environment. Yes. And yeah, we—I don't know if the version, the white paper version that we were sent was maybe an older version. This is
04:15
Speaker A
slightly different as far as the proposals, but updated through the lab, we've independently discovered some of the same breakthroughs in the lab research a couple, so we could share some of the research
04:33
Speaker A
that we—Rosie, I think this is the first time the team is seeing this. The last versions of the white paper did not cover this. Could you, if there is an updated white paper, we would love to.
04:44
Speaker A
Yeah. Yeah. Yeah. For KV cache, the benefit on reduction in dollar per million tokens in inference is more important than the time to first token improvement. I think that this aligns with that vision. I believe
05:38
Speaker A
test architecture, if that were possible, as was being suggested, to provide Blue Intelligence with that sample—not just the single SSD as you pointed out, but the whole instance, you know, the GPU plus the solid SSDs. That would be very
06:03
Speaker A
helpful if that's possible on three counts. First, it would really help improve the garbage collection measurement aspect. And then second, in the endurance testing, we would be able to figure out what the baseline is
06:17
Speaker A
endurance. And then thirdly, it would most closely emulate the reference architecture or layer that we want to get to, in combination with, in partnership with Solid System. So yes, if that's possible, that'd be
06:37
Speaker A
5336. So, Dr. Lee really agrees with what you’ve shared where your findings—I'm sorry, at the central lab—because maybe, but this would also be read as an additional ask if that's possible, because Dr. Lee asked for two,
07:29
Speaker A
I guess separate TLC and QLC SSD samples, the 5536 and the PS10, where because of three reasons. The first is we need a QLC and TLC SSD, or at least more than one SSD, to test how the
07:45
Speaker A
routing really is implemented across different tiers. And then secondly, also to test out the impact of the load on multiple SSDs when there are more and more concurrent users. And then thirdly,
08:04
Speaker A
it really is to validate the effect of Blue Intelligence/Solidime solution on reducing tail latency. We would need the key innovation, which is the quantization technique with the values in the KV cache.
08:20
Speaker A
Yeah. It is true in a sense, but it is also core, but it is also only part of their core technology/solution.
08:46
Speaker A
The other are the optimizations at the system level to improve bandwidth or—yeah, what?
09:11
Speaker A
The other, I guess, side of that technology would be the parallel retrieval mechanism that has really been able to test and implement, and that will be based on the Hopfield network study that was
09:31
Speaker A
published recently and it—want to—but against sparse par. JM, a question to you: where is this SSD physically in the pipeline? Is it direct connected to the GPU, or is it on a network, um, on the network store?
09:46
Speaker A
Yes. So this is one of the major questions in KV cache offload, whether there's this local tier, which in the Nvidia CMX will call this 3.5, that will sit as a Blue Field and the off—but
10:02
Speaker A
my question is on this piece, on this particular PC, it sounds like, well, what to measure, right? AMP and, you know, indirection unit efficiency for writes—you need disk-level access. You can't just have like a
10:17
Speaker A
volume or share. Where is the drive connected? Yeah, this is just local. Well, PCIe connected to the CPU on the host.
10:27
Speaker A
Oh, this is tier three. But I mean, H100 servers today only have E1.s support.
10:41
Speaker A
So we can't drive. But oh, I see. Okay, got it. But if the PC, the market, what's the little—how do you do it?
11:04
Speaker A
How do you do it? We have an NDA with Solids. Yes. We'd be interested in reading a more detailed benchmark report with our engineers.
11:32
Speaker A
We have two KV cache benchmarks that we're working on as part of Nixl bench. There's KV bench. Oh, it's a built-in benchmark into Nixl.
11:47
Speaker A
And then ML Perf 3.0 has a KV cache benchmark that was written by Micron that we're co-developing for ML Perf, and there are two benchmarks that we're using right now, but there are going to be lots more, Rosie. And once the PC and the collaboration is successful, FMS report successful afterwards, what is your plan to really get it adopted by the market? Mir and other Neo clouds—you have to embed it into SSC. What is
12:03
Speaker A
your go-to-market to get to some of those projected market estimates? ML Famous.
12:19
Speaker A
Yeah. Go to market. Yeah. Telemetry PS10 intern. P 5316. And yeah, as to your question, to Yankee's question, before the go-to-market model, how it's packaged fundamentally is it's just basically a plug-in, one line of code, and you can plug Blue
13:08
Speaker A
Intelligence's software into the server. And stepwise, what we want to do first, what Blue wants to do first, is to try integrating, plugging in Blue software into Solid SSD, starting with the PS10 to really test out and
15:17
Speaker A
complete that WAFT reduction to a very competitive level, then expand that to the more target skew, the 55 prefixes, and then the final go-to-market would be again to use that packaging model, i.e., you have Solidified MESSD, and with a
15:34
Speaker A
single line of code, you plug in, very simply plug in Blue software, and that becomes the whole package product that would be sold. Our plan is firstly to the Neo clouds and then enterprise, and the reason that Blue has more confidence in
15:52
Speaker A
its solutions once it arrives at the go-to-market stage is that they've run an internal test where they found with their software, when they ran Wikipedia, which amounts to, it said, 6,000 books of data, which is very heavy context and
16:08
Speaker A
very heavy data, the result was zero loss in accuracy and zero hallucination. So it gives them extra confidence about their competitiveness at the go-to-market stage with Solid Nest.
16:26
Speaker A
Thank you for that input. Thank you. I think we're on top of the hour.
16:41
Speaker A
Thank you for that input. Thank you. I think we're on top of the hour. So, let's take our next steps. So, um I think we'd like uh Dr. Lee from your team obviously this slide and any associated white paper which we could
16:58
Speaker A
Yeah. Um and then maybe uh you know sample two s like we'd like to understand real exact from Dr. and his team on what is required um uh uh uh on one one SSD or multiple SSDs connected to H100 or B200
17:19
Speaker A
uh and then we can work with the AI central lab team to see uh whether that infrastructure is as per uh their their expectations and then we can take it I think this is a six week activity I think we'd like to
17:35
Speaker A
go into it Kevin not a not a big investment on our end and at this point I think the team is they're not asking for engineering support it's just infrastructure support we should be able to do great you know there would be
17:48
Speaker A
minimal you know engagement of our engineers with you know this PC process uh that's why you know I allowed him you know to talk about you know something you know with our engineers so yes if it is you know too much burden you know for
18:03
Speaker A
our engineers and then if we do have other you know priorities is you know together with you know wa and you know minio then you know priority is priority but you know if the investment from our side is minimal then why not
18:18
Speaker A
absolutely I think that that's my current stance as well I think not a problem based on what I've heard today uh we should be able to give one instance uh we'll have to understand a little more on what exactly is required
18:32
Speaker A
uh at what granularity because I think he also wanted to collect uh right amplification ation and storage metrics of the SSD. So like we need to understand um ex you know a little bit more detail on the test infrastructure
18:47
Speaker A
requirement but uh but in parallel I think we are I think Nate dropped here to get on to other call. Nate who was on the call was asking questions is one of our industry experts um in storage software. So, and then we have JM here
19:03
Speaker A
and Gabe who are infrastructure experts. We I think we have everybody here from a data center standpoint. Now, if I see a reason for us to include our core engineering like Robbie Friy's team, I'll pull them in at that at the right
19:18
Speaker A
time. But at this point, like we can work directly with Dr. and
Topics:KV cacheSSD testingQLC SSDTLC SSDGPU instanceAI inferenceBlue IntelligenceSolidigmbenchmarkingML Perf

Frequently Asked Questions

What types of SSD samples are requested for testing?

They requested two SSD samples: one QLC and one TLC SSD to test routing, load impact, and tail latency improvements.

Why is a full GPU instance with SSDs important for the tests?

A full GPU instance with attached SSDs allows for accurate benchmarking of garbage collection, endurance, and system-level optimizations in a realistic environment.

What is Blue Intelligence’s go-to-market strategy for their KV cache solution?

The strategy involves integrating their software as a simple plug-in into Solidigm SSDs, targeting Neo clouds and enterprise customers with a packaged product.

Get More with the Söz AI App

Transcribe recordings, audio files, and YouTube videos — with AI summaries, speaker detection, and unlimited transcriptions.

Or transcribe another YouTube video here →