I benchmarked 13 models at 65K-128K context to find out what actually matters for agentic workloads

Summary

I benchmarked 13 models at 65K-128K context to find out what actually matters for agentic workloads — prefill dominates everything, and KV head count beats parameter count I've been running local LLMs for agentic workflows (tool use, coding agents, RAG) and kept seeing people obsess over tg128 (token generation speed) as the headline performance metric. So I ran a structured long-context benchmark to figure out what actually matters when your context window is full. The answer surprised me.

SpinGraph analysis pending — check back after processing.

Ask AI about this story

Opens with the SpinGraph .md URL and structured context — one click, prompt included.

ChatGPT Claude Perplexity Gemini Grok

More from Reddit r/LocalLLaMA

View all →

Markdown (.md) · JSON-LD schema (.json) · Machine-readable for AI & GEO