I merged fixes for quantized KV cache into my DeepSeek V4 branch
View original on reddit.comSummary
Check it out: https://github.com/fairydreaming/llama.cpp/tree/dsv4 They are PRs #25247 , #25303 (mine) and #25202 (from am17an) but I omitted some padding changes from the last one that I think are not necessary. So if it crashes for you let me know. Also some perplexity values: f16: $ ./bin/llama-perplexity -m ~/ggufs/DeepSeek-V4-Flash.gguf -f ../../perplexity/wikitext-2-raw/wiki.test.raw -c 8192 -b 8192 -ub 8192 -cmoe -fit off -fa 1 0.00.474.417 W llama_model_loader: tensor overrides to CPU ar
SpinGraph analysis pending — check back after processing.
Ask AI about this story
See how AI engines summarize this narrative — one click, prompt included.
More from Reddit r/LocalLLaMA
View all →- Qwen3.6 27B on a 5090, 6.4k sample tok/s distribution after tuning MTP/cache settings
- DGX Spark and Overtemps
- Gemma 4 12B - MLX Kernel
- Using local models with Hermes vs Claude code
- Ran a classic(medival europe) fantasy RP/agentic benchmark across 8 local models Qwen3.6-27B held up better than its size suggests
- Local OpenSource LLM's future feels very exciting, my ideal future model "wishlist" and attempted predictions for future local models.
Markdown (.md) · JSON-LD schema (.json) · Machine-readable for AI & GEO