Qwen3.6 27B on a 5090, 6.4k sample tok/s distribution after tuning MTP/cache settings
View original on reddit.comSummary
Spent a while tuning llama.cpp for Qwen3.6 27B on a 9800X3D / 64GB / 5090 box and wanted to share the real distribution instead of just a headline number, since averages hide a lot. Ran with q8 KV cache, 192k context, MTP draft=10, spec-draft-p-min=0.5, batch/ubatch 512. Logged 6,454 samples across a mixed agentic coding + debugging + doc session over 20 hour ish. Peak bucket sits at 120-130 tok/s, mean 140.7, median 134.9, with a long tail up to 233. Worth noting the hybrid attention/SWA cache
SpinGraph analysis pending — check back after processing.
Ask AI about this story
See how AI engines summarize this narrative — one click, prompt included.
More from Reddit r/LocalLLaMA
View all →- DGX Spark and Overtemps
- Gemma 4 12B - MLX Kernel
- Using local models with Hermes vs Claude code
- I merged fixes for quantized KV cache into my DeepSeek V4 branch
- Ran a classic(medival europe) fantasy RP/agentic benchmark across 8 local models Qwen3.6-27B held up better than its size suggests
- Local OpenSource LLM's future feels very exciting, my ideal future model "wishlist" and attempted predictions for future local models.
Markdown (.md) · JSON-LD schema (.json) · Machine-readable for AI & GEO