Deepseek V4 Flash running on RTX 5090 MoE
View original on reddit.comSummary
Here is the results of optimizing it for my setup: Benchmark results of the optimisation showing TG T/S from 22.7 to 21.3, and PP T/S from 1105 to 927, test ranges Prompt Processing from 8192 tokens to 65536 tokens, and is set to MoE with no unified KV, no memory map, n-cpu-moe 37 My setup: X870 AORUS ELITE WIFI7 AMD Ryzen 9 9900X3D (24) @ 4.40 GHz NVIDIA GeForce RTX 5090 [Discrete] DDR5 RAM: 18.80 GiB / 125.39 GiB (15%) OS: Bazzite(bazzite-dx-nvidia-gnome:testing) This was possible using this f
SpinGraph analysis pending — check back after processing.
Ask AI about this story
See how AI engines summarize this narrative — one click, prompt included.
More from Reddit r/LocalLLaMA
View all →- PSA: Upscaling Gemma 4 requires a proportional layer_scalar adjustment
- Using "applications" to make a smaller model more effective at bigger tasks.
- Appreciation post!
- possible evidence of literal prompt injection by anthropic
- Qwen3.6 27B on a 5090, 6.4k sample tok/s distribution after tuning MTP/cache settings
- DGX Spark and Overtemps
Markdown (.md) · JSON-LD schema (.json) · Machine-readable for AI & GEO