Qwen3.6 27B on a 5090, 6.4k sample tok/s distribution after tuning MTP/cache settings

Summary

Spent a while tuning llama.cpp for Qwen3.6 27B on a 9800X3D / 64GB / 5090 box and wanted to share the real distribution instead of just a headline number, since averages hide a lot. Ran with q8 KV cache, 192k context, MTP draft=10, spec-draft-p-min=0.5, batch/ubatch 512. Logged 6,454 samples across a mixed agentic coding + debugging + doc session over 20 hour ish. Peak bucket sits at 120-130 tok/s, mean 140.7, median 134.9, with a long tail up to 233. Worth noting the hybrid attention/SWA cache

SpinGraph analysis pending — check back after processing.

Ask AI about this story

See how AI engines summarize this narrative — one click, prompt included.

ChatGPT Claude Perplexity Gemini Grok

More from Reddit r/LocalLLaMA

View all →

Markdown (.md) · JSON-LD schema (.json) · Machine-readable for AI & GEO