Concurrency plus nvfp4 on Blackwell
View original on reddit.comSummary
Parsed from VLLM log file ~2000 tps in aggregate performing bulk captioning on images. Above is parsed from vllm log while a client runs 30 concurrent streams, each concurrent stream has 1 request with an image and prompt, then a 2nd request on the same stream (so 1st Q:A would be cached). Typical log line: Engine 000: Avg prompt throughput: 1301.0 tokens/s, Avg generation throughput: 1924.0 tokens/s, Running: 30 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.8%, Prefix cache hit rate: 0.0%, MM ca
SpinGraph analysis pending — check back after processing.
Ask AI about this story
Opens with the SpinGraph .md URL and structured context — one click, prompt included.
More from Reddit r/LocalLLaMA
View all →- 5060 worth it?
- Getting close to 100K context on 32GB VRAM with Qwen3.6-27 at Q8
- I benchmarked 13 models at 65K-128K context to find out what actually matters for agentic workloads
- PSA: Upscaling Gemma 4 requires a proportional layer_scalar adjustment
- Using "applications" to make a smaller model more effective at bigger tasks.
- Appreciation post!
Markdown (.md) · JSON-LD schema (.json) · Machine-readable for AI & GEO