Ran a classic(medival europe) fantasy RP/agentic benchmark across 8 local models Qwen3.6-27B held up better than its size suggests
View original on reddit.comSummary
Threw together a benchmark suite (quest completion, scene endings, item/time tracking, character detection, storytelling, drafting) and ran it across 8 models people talk about a lot on here. Judged with an external LLM grader, N varies per category (shown on the chart). Overall pass rates: gemma-4-31B on top at 87%, Qwen3.6-27B close behind at 82%, then a pretty steep drop off after gemma-4-12B (80%) down to the smaller/looser models in the 55-70% range. but oh well that expected. The interesti
SpinGraph analysis pending — check back after processing.
Ask AI about this story
See how AI engines summarize this narrative — one click, prompt included.
More from Reddit r/LocalLLaMA
View all →- Qwen3.6 27B on a 5090, 6.4k sample tok/s distribution after tuning MTP/cache settings
- DGX Spark and Overtemps
- Gemma 4 12B - MLX Kernel
- Using local models with Hermes vs Claude code
- I merged fixes for quantized KV cache into my DeepSeek V4 branch
- Local OpenSource LLM's future feels very exciting, my ideal future model "wishlist" and attempted predictions for future local models.
Markdown (.md) · JSON-LD schema (.json) · Machine-readable for AI & GEO