Quantization, Determinism, and Thinking Tokens: Running Open-Source LLMs in Production

I run Qwen 3.5 and 3.6 (35B MoE, 3B active parameters) in production across three nodes — two DGX Spark (FP8, vLLM) and one RTX 5090 (Q4, llama.cpp).
After 100+ benchmark scenarios and thousands of inference calls,
three problems dominated my debugging time:

Quantization loss is not uniform — MoE models at Q4 lose 16% on CJK tasks
vLLM is non-deterministic under speculative decoding — identical prompts produce different outputs
Thinking tokens consume 60–90% of the budget on tasks where they provide zero benefit

None of these show up in standard benchmarks.
All of them break production workflows.

1. Quantization: FP8 Is the Production Floor

The common wisdom — “30B+ models lose less than 1% at Q4” — is only true for Dense models on English benchmarks.

MoE Models Are Different

A 35B MoE with 3B active parameters is effectively a small model from a quantization perspective.
The “35B total” is misleading — you’re quantizing expert weights that route dynamically,
and any bit-flip in the router logits causes discrete misrouting,
not the smooth degradation you get with Dense models.

From 25+ papers (Dettmers 2022, Ouyang 2024, APEX 2025) and my own measurements:

Architecture	Q4 English Loss	Q4 CJK Loss	Mechanism
Dense 30B+	<1%	4–5%	Smooth degradation
MoE 35B-A3B	1–14%	10–17%	Router misrouting (discrete)

The CJK Penalty

Marchisio et al. (EMNLP 2024) found that automatic benchmarks underestimate CJK loss by 10–15 percentage points
compared to human evaluation:

Language	Auto Benchmark	Human Eval	Gap
Japanese	-1.7%	-16.0%	14.3pp
Korean	minor	-4.6%	—
Chinese (MGSM)	-17.3%	—	—

My measurement on Qwen 3.5 35B-A3B: FP8 vs Q4 on Korean factual tasks showed -16% on human-evaluated quality
while automatic metrics showed only -1.2%.

The Shared Expert Problem

MoE architectures have “shared experts” (always active, regardless of routing).
These shared weights have kurtosis 13.10 — nearly 4x higher than routed experts (3.41).
High kurtosis means outlier values that get clipped by low-bit quantization.

APEX (2025) showed that per-expert quantization — Q6_K for routed experts, Q8_0 for shared experts — matches F16 quality.
Uniform Q4 across all experts loses 1–14% depending on the task.

What I Run

DGX (production):  FP8 — quality baseline, 65 tok/s
5090 (interactive): Q4 — speed priority, 204 tok/s, accept CJK penalty

FP8 is the production floor.
Below that, you’re trading quality for speed in ways that don’t show up until users notice degraded Korean/Japanese output.

2. vLLM Non-Determinism: The MTP Problem

During my Local-Trinity benchmark (55 tasks, n=3 per node),
I discovered that vLLM produces different outputs from identical prompts:

Node	Backend	Stability (n=3)
Desktop 5090	llama-server	STABLE 12/12
DGX2	vLLM + MTP	UNSTABLE 4/12

Same Qwen 3.6 model, same prompts.
The 5090 produced identical outputs across all 3 runs for every task.
The DGX varied wildly — one task scored 100, 100, 9 across three runs.

Root Cause: Triple Non-Determinism

vLLM has three sources of non-determinism that compound:

CUDA kernel non-determinism — FP8 matrix multiply accumulation order varies with thread scheduling
Batch scheduling — continuous batching means different request interleaving on each run
MTP (Multi-Token Prediction) speculation — speculated tokens accepted/rejected differently based on timing

Thinking Machines Lab confirmed this at scale:
running Qwen3-235B through vLLM 1,000 times produced 80 distinct output variations from the same prompt.

MTP Is the Primary Cause

I tested with MTP disabled on DGX2:

MTP ON: UNSTABLE 4/12, HMT-03 scored 100→39 on one run
MTP OFF: STABLE, HMT-03 passed 3/3 (std 0.15s)

But MTP OFF costs 14% speed (61→52.7 tok/s)
and caused timeouts on thinking-heavy tasks.

Operational Implication

Production: MTP ON + n=3 mean scoring (accept variance)
Single-shot trust: Desktop 5090 (llama-server) only

If you need deterministic outputs (test suites, reproducible research), don’t use vLLM with MTP.
Use llama.cpp or disable MTP and accept the speed penalty.

3. Thinking Tokens: When “Smarter” Makes Things Worse

Qwen 3.6 supports per-request thinking control.
I ran every task type with thinking ON and OFF:

Task Type	Thinking ON	Thinking OFF	Speed Difference
Factual queries	9/9	9/9	OFF 2.5x faster
Code generation	5/5	5/5	OFF 4.2x faster
Code debugging	3/3	3/3	OFF 6.0x faster
Multi-turn coding	1/2	2/2	OFF 7.3x faster
Complex math	3/3	0/3	ON required

Thinking OFF is the correct default for coding.
It produces equal or better results at 4–7x the speed.
Thinking ON only helps for complex mathematical reasoning — and even there, it has a catastrophic failure mode.

The Thinking Runaway Problem

With thinking enabled on complex coding tasks,
Qwen 3.6 sometimes enters a thinking loop — spending the entire token budget on internal reasoning,
leaving no capacity for actual code output:

bugfix task:  TTFT 141s, completion_tokens=1, empty output (3/3 repro)
refactor task: TTFT 130s, completion_tokens=291, incomplete code

This is a known issue (QwenLM/Qwen3.6#88): 17.4% of hard coding tasks trigger thinking token exhaustion.
Of those, 84% show repetitive loops within the <think> block.

The vLLM Bug That Makes It Worse

vLLM issue #39573: when MTP is active,
thinking_token_budget is silently ignored.
You cannot cap thinking tokens while using speculative decoding.
The parameter is accepted without error and does nothing.

This means:

MTP ON + thinking ON = no budget control, runaway possible
MTP OFF + thinking ON + budget = works, but 14% slower
MTP ON + thinking OFF = safe and fastest for coding

Per-Request Control

The solution is dynamic per-request thinking control:

// Default (all coding tasks):
{"chat_template_kwargs": {"enable_thinking": false}}

// Complex math only:
// Omit extra_body → server default (thinking ON)

No server restart needed.
Thinking is toggled per API call.
My harness classifies task complexity upfront and routes accordingly.

Summary: Production Configuration

After v1–v7 of each harness (100+ scenarios per node, 290+ unit tests),
this is the configuration that survived testing:

Decision	Choice	Evidence
Quantization floor	FP8	Q4 MoE: CJK -16%, router misrouting
Speed tier	Q4 for interactive, FP8 for production	Accept CJK penalty for 3x speed
vLLM determinism	MTP ON + n=3 mean	Single-shot determinism only with llama-server
Thinking default	OFF	ON only for complex math (3/3 vs 0/3)
Thinking budget	Not available with MTP	vLLM #39573 open
Max tokens (thinking ON)	32768	Prevents runaway (80% → 0% failure)

None of these decisions came from reading papers or following leaderboards.
All of them came from running the same tasks repeatedly until the failure modes revealed themselves.

References

Dettmers & Zettlemoyer (2022): LLM.int8() — 35,000 experiments on quantization scaling
Marchisio et al. (EMNLP 2024): CJK quality loss under quantization
APEX (2025): Per-expert quantization for MoE models
Thinking Machines Lab: vLLM non-determinism at scale (Qwen3-235B, 1000 runs)
vLLM #39573: MTP + thinking_budget incompatibility
QwenLM Qwen3.6#88: Thinking token runaway on LiveCodeBench

Measured on DGX Spark (128GB, vLLM 0.19.1) and RTX 5090 (32GB, llama.cpp). Your configuration may differ.

1. Quantization: FP8 Is the Production Floor#

MoE Models Are Different#

The CJK Penalty#

The Shared Expert Problem#

What I Run#

2. vLLM Non-Determinism: The MTP Problem#

Root Cause: Triple Non-Determinism#

MTP Is the Primary Cause#

Operational Implication#

3. Thinking Tokens: When “Smarter” Makes Things Worse#

The Thinking Runaway Problem#

The vLLM Bug That Makes It Worse#

Per-Request Control#

Summary: Production Configuration#

References#