Inference

I run Qwen 3.5 and 3.6 (35B MoE, 3B active parameters) in production across three nodes — two DGX Spark (FP8, vLLM) and one RTX 5090 (Q4, llama.cpp). After 100+ benchmark scenarios and thousands of inference calls, three problems dominated my debugging time: Quantization loss is not uniform — MoE models at Q4 lose 16% on CJK tasks vLLM is non-deterministic under speculative decoding — identical prompts produce different outputs Thinking tokens consume 60–90% of the budget on tasks where they provide zero benefit None of these show up in standard benchmarks. All of them break production workflows. ...

Inference

Quantization, Determinism, and Thinking Tokens: Running Open-Source LLMs in Production

What tok/s Doesn't Tell You: Measuring LLM Speed That Matters