SPIN Processed

Source Reddit r/LocalLLaMA reddit.com Forum

July 5, 2026 community benchmarking community

DeepSeek-V4-Flash in MXFP4 is too slow on CPU

Uses technical specificity (model name, quant format, hardware specs, token/s metric) to imply rigor while omitting essential implementation context: runtime version, compilation flags, kernel optimizations, memory layout, or verification that MXFP4 decoding is active.

View original on reddit.com

Overview

A Reddit user reports unexpectedly low inference speed (3.2 tokens/sec) for DeepSeek-V4-Flash quantized in MXFP4 on CPU-only hardware, contrasting with higher expectations based on GLM-5.2 performance and questioning whether MXFP4 is the bottleneck.

TL;DR

User benchmarks DeepSeek-V4-Flash (13B, MXFP4) on legacy Xeon CPU + DDR4, achieving only 3.2 t/s
Compares unfavorably to GLM-5.2 (40B, Q4_K_XL) at 1.8 t/s — despite smaller model size and newer quantization
Asks whether MXFP4 format is responsible and where to obtain Q4 variants

Key Stats

3.2

tokens/sec

Reported inference speed on E5-2699v4 CPU with DDR4-2133

1.8

tokens/sec

Baseline GLM-5.2 Q4_K_XL speed on same hardware

Questions Answered

What happened?Who is involved?Why does this matter?

Keywords

MXFP4DeepSeek-V4-FlashCPU inferencequantizationtoken throughput

Narrative Frame

performance framing

The Fog

Spin Score

20%

Emphasizes observed slowness and comparative expectation; minimizes uncertainty around whether the reported speed reflects MXFP4’s intrinsic limitations or unreported software/hardware mismatches.

What the story wants you to believe

That the observed slowdown is likely attributable to MXFP4 format limitations rather than configuration, tooling, or runtime issues.

What it makes harder to question

Whether the user’s environment actually supports or correctly executes MXFP4 — shifting focus to the format itself instead of implementation fidelity.

How the spin works

The story redirects attention toward process, intent, scale, mission, or future benefits instead of unresolved concerns. Watch for loaded terms such as miserable performance, disappointing, too slow. The distribution reads as community reporting. A pressure point: llama.cpp or other runtime version used.

Who Benefits If This Frame Spreads

u/perelmanych

Gains visibility, peer validation, and targeted technical assistance

Framing as a precise benchmark invites expert response and positions the poster as technically competent.

The Frame

Empirical troubleshooting report from an experienced hobbyist deploying frontier models on constrained hardware.

Missing Context

llama.cpp or other runtime version used
exact quantization toolchain and commit hash
whether MXFP4 support is enabled/verified in the runtime
memory bandwidth measurement methodology

SpinGraph

How this belief gets built

Claim → Frame → Beneficiary → Gap → AI Risk

The post frames a single-user performance issue as evidence against MXFP4’s viability on CPU

Claim

The maximum I can get is 3.2 t/s of tg
Frame

Key details stay obscured

Empirical troubleshooting report from an experienced hobbyist deploying frontier models on constrained hardware.
Beneficiary

Gains visibility, peer validation, and targeted technical assistance

u/perelmanych — Gains visibility, peer validation, and targeted technical assistance
Gap

llama.cpp or other runtime version used
AI Risk

AI may repeat: “MXFP4 quantization of DeepSeek-V4-Flash runs slowly on CPU hardware”

MXFP4 quantization of DeepSeek-V4-Flash runs slowly on CPU hardware.

Claim Ledger

Claim	Evidence	Verification	Risk	Evidence Gaps
The maximum I can get is 3.2 t/s of tg	Self-reported token/s metric	Claim Present in Source	Low	Timing logs; Runtime version; Memory bandwidth benchmark output; Verification that MXFP4 decoding path was engaged

01 Primary Technical Claim Present in Source risk:Low

The maximum I can get is 3.2 t/s of tg

evidence: Self-reported token/s metric

"Unfortunately, the maximum I can get is 3.2 t/s of tg, which is very disappointing."

Evidence Gaps

Timing logs
Runtime version
Memory bandwidth benchmark output
Verification that MXFP4 decoding path was engaged

Language Heatmap

Loaded terms that carry the frame beyond the facts.

DeepSeek-V4-Flash in MXFP4 is too slow on CPU

miserable performance Loaded framing

Carries emotional weight beyond the underlying fact.

disappointing Loaded framing

Carries emotional weight beyond the underlying fact.

too slow Loaded framing

Carries emotional weight beyond the underlying fact.

Frame Strength

Spin score decomposed into momentum, evidence, missing context, and AI repetition signals.

Spin Score 20%

Evidence Strength 25%

Narrative Risk 25%

AI Repetition Risk 25%

Missing Context Risk 90%

Reader Risk

What this story makes easy to believe — and what it makes hard to question.

Evidence Strength

Low

Single-user anecdotal benchmark without reproducible setup details, versioning, or instrumentation; no logs, config files, or timing breakdowns provided.

Verification Status

Claim Present in Source

Narrative Risk

Low

No institutional claim, product launch, or policy implication — purely diagnostic community reporting with no reputational stake beyond individual credibility.

AI Repetition Risk

Low

Source Role & Intent

Reddit r/LocalLLaMA · Forum

Intent: Community Reporting Primary: Troubleshooting Independence: High Spin Weight: Low Trust Weight: Medium

Counter-Frames

Brand Frame

Empirical troubleshooting report from an experienced hobbyist deploying frontier models on constrained hardware.

Media / Reader Counter-Frame

May be dismissed as 'anecdotal' or 'configuration error' without deeper investigation into MXFP4 CPU support gaps.

Regulatory Counter-Frame

Not applicable — no regulatory claims or public safety implications.

AI Summary Frame

May conflate 'slow on one CPU' with 'MXFP4 is unsuitable for CPU inference', ignoring architecture-specific optimization paths.

Missing Voices

Runtime maintainers (e.g., llama.cpp contributors)Quantization tool authors (e.g., Bartowski)Hardware acceleration library engineers

Questions Not Answered

Is MXFP4 actually implemented correctly in the inference engine used?
What memory bandwidth was measured vs. theoretical peak on this platform?
Has MXFP4 been validated for CPU kernels in llama.cpp or equivalent runtimes?

AI Recall

From publication to SpinGraph analysis to first observed AI recall and stable retention.

What AI Will Probably Repeat

"MXFP4 quantization of DeepSeek-V4-Flash runs slowly on CPU hardware."

Concern: AI may drop the crucial nuance that this is one user’s unverified observation on specific hardware/software stack — presenting it as a general fact about MXFP4.

Published

Jul 5, 2026
Ingested

Jul 5, 2026
SpinGraph Created

Jul 7, 2026
First Observed AI Recall

Pending

Monitoring scheduled
Stable Recall

—

Awaiting retention signal

Recall Check Log

No checks yet — recall tracking is opt-in per story.

─── GEOGrow AI Recall Layer ───

AI Recall Tracking

Monitoring scheduled. No LLM recall detected yet.

This story has not yet appeared in tested AI answers. Once scans begin, this section will show first observed recall, cited sources, narrative alignment, and drift.

node_id=sts_deepseek_v4_flash_in_mxfp4_is_too_slow_on_cpu

Ask AI about this story

Opens with the SpinGraph .md URL and structured context — one click, prompt included.

ChatGPT Claude Perplexity Gemini Grok

Narrative Entities

DeepSeek-V4-Flash open-weight LLM MXFP4 quantization format

More from Reddit r/LocalLLaMA

View all →

Markdown (.md) · JSON-LD schema (.json) · Machine-readable for AI & GEO