Is dSpark, dflash, MTP, QAT, and similar tech going to increase inference speed enough to where model spillover to disk will be more tolerable?
View original on reddit.comSummary
We’re seeing all these performance boosts coming to inference lately with things like dSpark, dllash, MTP, etc. and I know the whole model spillover-to-disk has always been the inflection point where a model would go from maybe a barely acceptable 4 to 5 tokens per second to like a completely unusable 0.5 tokens per sec after disk spillover happens. Has this changed now? Do these new speed boosters push the inference speed to the point where model spillover to disk isn’t as bad of a performance
SpinGraph analysis pending — check back after processing.
Ask AI about this story
See how AI engines summarize this narrative — one click, prompt included.
More from Reddit r/LocalLLaMA
View all →- PSA: Upscaling Gemma 4 requires a proportional layer_scalar adjustment
- Using "applications" to make a smaller model more effective at bigger tasks.
- Appreciation post!
- possible evidence of literal prompt injection by anthropic
- Qwen3.6 27B on a 5090, 6.4k sample tok/s distribution after tuning MTP/cache settings
- DGX Spark and Overtemps
Markdown (.md) · JSON-LD schema (.json) · Machine-readable for AI & GEO