DeepSeek V4: Cheapest Frontier Model Ever Released? — Transcript

DeepSeek V4 offers frontier-level AI models with massive context windows, open-source licensing, and drastically reduced inference costs.

Key Takeaways

DeepSeek V4 delivers frontier-level AI performance at a fraction of the cost of closed models.
Open-source MIT licensing enables commercial use and customization without restrictions.
Hybrid attention and advanced training techniques enable massive context windows with efficient inference.
V4 Pro and Flash are practical options for coding, agentic tasks, and long-context applications.
The release marks a significant generational step in open-weight model capabilities and economics.

Summary

DeepSeek V4 introduces two models: V4 Pro (1.6 trillion parameters) and V4 Flash (284 billion parameters), both with 1 million token context windows.
Both models use mixture of experts (MoE) architecture and are released under the MIT license, allowing commercial use and fine-tuning.
V4 Pro is the largest open-weight model released, trained on over 30 trillion tokens with advanced training optimizations for stability and efficiency.
The models feature a new hybrid attention mechanism (compressed sparse attention and heavily compressed attention) to reduce memory and compute costs significantly.
Inference costs are drastically lower than comparable frontier models, with V4 Flash nearly 100x cheaper than GPT 5.5 and Claude Opus 4.7 on input plus output token pricing.
Benchmarks show V4 Pro competitive with top models on coding and agentic tasks, though it lags slightly on knowledge benchmarks compared to closed models.
The model supports mixed precision (FP4 + FP8), enabling more feasible deployment, though V4 Pro requires high-end hardware setups for local use.
Three reasoning modes are available in the API, with recommendations for context window sizes based on task complexity.
Community reception is strong, with high rankings on open-weight model leaderboards and integration support across multiple platforms.
The release compresses frontier AI economics into a much lower price band, making it attractive for teams needing long-context and agentic AI capabilities.

Full Transcript — Download SRT & Markdown

Speaker A

DeepSeek just shipped V4 and honestly, this release is the open-source moment a lot of people have been waiting for since R1.

Speaker A

There are two models in the family, the bigger one is V4 Pro, sitting at 1.6 trillion total parameters with 49 billion active per token.

Speaker A

The smaller one is V4 Flash at 284 billion total with 13 billion active.

Speaker A

Both are mixture of experts, both ship with a 1 million token context window by default, and both come under the MIT license, which means you can take the weights, fine-tune them, and put them in a commercial product without asking anyone for permission.

Speaker A

V4 Pro is now officially the largest open-weight model that has ever been released, V4 Pro was pre-trained on 33 trillion tokens, and Flash on roughly 32 trillion.

Speaker A

So they are sitting on a similar data foundation, just with a very different parameter budget.

Speaker A

Okay, so the part that actually matters for your wallet.

Speaker A

At a 1 million token context, V4 Pro uses about 27% of the per token inference flops compared to V3.2 and only 10% of the KV cache.

Speaker A

The Flash model is even more aggressive, using around 10% of the flops and only about 7% of the KV cache for the same context length, that is not a tweak, that is an order of magnitude jump on the memory side.

Speaker A

The reason is the new hybrid attention setup.

Speaker A

They built something they call compressed sparse attention combined with heavily compressed attention, CSA basically picks the top K chunks of the context that actually matter, HCA is a denser pass but on heavily reduced inputs.

Speaker A

They also keep a small sliding window for recent tokens and use learnable attention sinks to hold the logit stable, the whole point is to break the quadratic wall that kills long context inference everywhere else, there is more under the hood.

Speaker A

They replaced AdamW with the Muon optimizer for most of the training, which uses Newton-Schulz iterations to keep the singular values of the gradient matrix close to one, that gave them faster convergence and more stable training at the trillion parameter scale.

Speaker A

They also brought in something called manifold-constrained hyper-connections to widen the residual stream into four parallel paths while keeping signal amplification under control, in their own ablations the unconstrained version blew up to 3000x amplification and crashed training.

Speaker A

The constrained version sits at 1.6x, which is what let them push to 1.6 trillion parameters without the loss spike disaster you usually see at this scale.

Speaker A

For training stability they also added what they call anticipatory routing, where the MoE router uses slightly older weights so the routing decisions do not feed back into themselves and amplify outliers.

Speaker A

They put a clamp on the SwiGLU gate to stop activations from exploding mid-training too.

Speaker A

None of this is sexy on a thumbnail, but it is the reason a 1.6 trillion parameter MoE that actually finished training in one piece.

Speaker A

Pricing is the part that is going to make every API team reconsider their stack on Monday morning.

Speaker A

V4 Pro is around $1.75 per million input tokens on a cache miss, with cache hit dropping to about 15 cents, and output sits near $3.50 to $4 per million.

Speaker A

Flash is way cheaper, with input at 14 cents per million and output at 28 cents.

Speaker A

Compare that to GPT 5.5 sitting near $1.75 input and $14 output, and Claude Opus 4.7 at $5 input and $25 output.

Speaker A

On output tokens, V4 Pro ends up roughly 1/6 the cost of Opus 4.7 and about 1/7 the cost of GPT 5.5, with cache hits that gap stretches to nearly 1/10, Flash on its own is almost 100x cheaper than the western frontier models on raw input plus output math.

Speaker A

Now the benchmarks, because this is where the picture gets honest, on agentic coding the scores are genuinely close to frontier.

Speaker A

V4 Pro Max scores 80.6% on SWE Bench Verified, which is essentially tied with Claude Opus 4.6 at 80.8 and Gemini 3.1 Pro at 80.6.

Speaker A

LiveCodeBench is 93.5%, which puts it ahead of Gemini at 91.7 and well ahead of Claude at 88.8.

Speaker A

On Codeforces, it hits a rating of 3206, which beats GPT 5.4X high at 3168, so for code generation, competitive programming and single-shot software tasks, this model is sitting in frontier territory.

Speaker A

But against the newer GPT 5.5 and Opus 4.7, the picture pulls back.

Speaker A

On SWE Bench Pro, V4 Pro Max gets 55.4%, GPT 5.5 hits 58.6, Opus 4.7 leads at 64.3.

Speaker A

On Terminal Bench 2.0, V4 Pro Max scores 67.9%, and Opus 4.7 is 69.4, GPT 5.5 jumps ahead on Terminal Bench at 82.7.

Speaker A

On MCP Atlas for tool orchestration, V4 sits at 73.6% behind GPT 5.5 at 75.3 and Opus 4.7 at 79.1.

Speaker A

The interesting outlier is BrowseComp for agentic web browsing, V4 Pro Max scores 83.4% there, beating Opus 4.7 at 79.3 and almost matching GPT 5.5 at 84.4, so agentic browsing is where this model really pushes.

Speaker A

On knowledge, it lags a bit, Gemini 3.1 Pro still leads on MMLU Pro, simple QA, and HLE.

Speaker A

DeepSeek themselves write in the technical report that the open-source frontier trails the closed one by roughly 3 to 6 months, that is unusually honest framing for a launch and I think it is closer to the truth than most marketing claims I have seen this year.

Speaker A

The hardware story is also worth a beat, they mention validating the model on both Nvidia GPUs and Huawei Ascend NPUs, but they do not say a word about which chips were actually used for training, that is a change from V3 where the training stack was documented in detail.

Speaker A

The model uses FP4 + FP8 mixed precision, with MoE expert weights running in FP4 and most other parameters in FP8, FP4 cuts memory roughly in half compared to FP8, so deployment becomes much more feasible, people will assume Blackwell, but Hopper can do FP4 in a weights-only mode too, so we cannot really tell from outside.

Speaker A

For local deployment, V4 Pro is basically out of reach for any normal setup, you are looking at an H200, dual A100 80GB or a quantized rig with multiple 4090s.

Speaker A

V4 Flash is a bit more practical, and there is already an 8-bit MLX build on Hugging Face, sitting around 302GB, so a 128GB Apple Silicon Mac Studio with quantization is the closest thing to a home setup that can host this.

Speaker A

Three reasoning modes ship in the API, non-think for fast answers, think high for normal reasoning, and think max for the heavy stuff.

Speaker A

For Think Max, they recommend running with at least a 384K context window, because the chains of thought are long, in my own light testing on the chat interface, the model is fast on output, but it spends a lot of tokens thinking, which is going to drive up your output bill on real workloads even with the cheap rates, so benchmark on your own tasks before you switch defaults.

Speaker A

The community reaction has been pretty strong, Hugging Face welcomed the whale back, Vals AI listed V4 as the number one open-weight model on their Vibe Code benchmark.

Speaker A

Arena AI's live code leaderboard placed V4 Pro thinking at number three across all open models with an 88 Elo jump over V3.2, that is a real generational step, not a half update.

Speaker A

Integration support is already there for Claude Code, Open Claw and Open Code, and you can route through Open Router if you do not want to send traffic to a Chinese hosted endpoint for compliance reasons.

Speaker A

The one thing this release does not do is dethrone GPT 5.5 or Claude Opus 4.7 across the board, it does not need to.

Speaker A

If you can get 80 to 90% of frontier performance on coding and agentic tasks for 1/6 or 1/7 the price, with weights you can self-host, that is a different conversation than whether your provider is on a closed model leaderboard, that is the real story here, the economics of frontier class AI just got compressed into a much lower price band, and any team running heavy agent loops, codebase analysis, or long-context RAG should at least put V4 Flash or V4 Pro into their evaluation harness this week.

Speaker A

Alright, so that's it from the video and I hope you enjoyed it, if you did, please like this video and subscribe to the channel and I'll see you in the next video.

Topics:DeepSeek V4open-source AI modelmixture of expertslong context windowAI inference costcoding AI benchmarksagentic AIhybrid attentionAI model deploymentAI training optimization

Frequently Asked Questions

What are the main differences between DeepSeek V4 Pro and V4 Flash?

V4 Pro has 1.6 trillion parameters with 49 billion active per token and is the largest open-weight model released, while V4 Flash has 284 billion parameters with 13 billion active. Both share the same 1 million token context window but differ significantly in parameter budget and inference cost.

How does DeepSeek V4 reduce inference costs compared to other frontier models?

DeepSeek V4 uses a new hybrid attention mechanism combining compressed sparse attention and heavily compressed attention, drastically reducing memory and compute requirements. This results in inference costs that are roughly 1/6 to 1/7 the price of models like GPT 5.5 and Claude Opus 4.7.

Is DeepSeek V4 suitable for local deployment?

V4 Pro requires very high-end hardware such as Nvidia H200 or dual A100 80GB GPUs, making it impractical for most local setups. V4 Flash is more feasible for local deployment, with quantized versions available that can run on machines like a 128GB Apple Silicon Mac Studio.

Get More with the Söz AI App

Transcribe recordings, audio files, and YouTube videos — with AI summaries, speaker detection, and unlimited transcriptions.

App Store Google Play

Or transcribe another YouTube video here →