[{"content":"I run Claude Code (Opus 4.6) as my primary coding tool and pay $200/month for it.\nI also run Qwen 3.5/3.6 35B locally on two DGX Sparks and an RTX 5090.\nNatural question: how does a local 35B model compare to the commercial tool I\u0026rsquo;m paying for?\nTo find out, I built three separate benchmark harnesses over 10 days.\nThe journey taught me more about evaluation methodology than about the models themselves — because the harness had more bugs than the models did.\nThe Three Harnesses Harness Nodes Focus Tests cc-crosscheck DGX1 (3.5) CC vs Codex vs Local — same task, three tools 254 dgx-duo DGX1 + DGX2 Builder (3.5) + Reviewer (3.6) pair mode 254 Local-Trinity All 3 nodes Unified cross-node comparison 290 All three are zero-dependency Python (stdlib + urllib only).\nNo frameworks, no pip installs on the inference nodes.\ncc-crosscheck: The 3-Peer Protocol The core idea: run the same coding task on three \u0026ldquo;peers\u0026rdquo; — Claude Code (Opus 4.6),\nCodex (GPT-5.3), and a local 35B model — then compare outputs.\nTask: \u0026#34;Implement a thread-safe LRU cache with TTL expiration\u0026#34; │ ├── Claude Code (Opus 4.6, API) ──→ solution_cc.py ├── Codex (GPT-5.3, API) ──→ solution_codex.py └── Local (Qwen 3.5, DGX1) ──→ solution_local.py │ ▼ Validator: AST analysis + test execution + consensus scoring The validator doesn\u0026rsquo;t just check \u0026ldquo;does it run\u0026rdquo; — it performs AST consensus analysis,\ncomparing structural patterns across all three solutions.\nIf two out of three agree on an approach and one diverges, that divergence gets flagged.\nWhat I Found (v0.9.6, 8 scenarios) Scenario CC (Opus 4.6) Local (Qwen 3.5) Notes LRU Cache PASS PASS Structural consensus Plugin Registry PASS PASS — Code Review PASS PASS — Race Condition Fix PASS PASS — Dijkstra PASS PASS — Data Pipeline PASS PASS — Config Merger PASS PASS (98.7) Minor: nested validation edge case Cache Refactor PASS PASS — Result: 24/24 ALL PASSED (n=3), mean score 99.6.\nThe local 35B matched Claude Code quality on all 8 coding scenarios.\nThe gap isn\u0026rsquo;t in single-task quality — it\u0026rsquo;s in speed (CC responds faster due to optimized infrastructure) and in handling ambiguous,\nmulti-step tasks where the commercial tool\u0026rsquo;s longer context and tool-use integration matter.\ndgx-duo: When Two Models Beat One Instead of comparing against commercial tools,\ndgx-duo asks: can two local models working together outperform one?\nThe protocol:\nDGX1 (Qwen 3.5, thinking OFF) — Builder. Generates code fast. DGX2 (Qwen 3.6, thinking ON) — Reviewer. Reads the code, finds bugs, suggests fixes. If the reviewer finds issues → builder gets a second pass with feedback. Results (v0.9.6, 14 E2E scenarios) Mode Scenarios Pass Rate Mean Score Single (DGX1 alone) 8 75% 89.2 Pair (builder + reviewer) 8 100% 99.6 Multi-step 6 83% 96.1 The reviewer catches bugs that a single-pass generator misses.\nThe most common catches:\nMissing error handling (pytest.raises coverage) Edge cases in concurrent code (race conditions) Incomplete interface implementations The Rubber Stamp Problem (v0.9.3→v0.9.4) Early versions had a critical flaw: the reviewer always approved.\nIt would say \u0026ldquo;looks good\u0026rdquo; even when the code had obvious bugs.\nThis isn\u0026rsquo;t a model limitation — it\u0026rsquo;s a prompt engineering failure.\nThe fix (v0.9.4): changed the reviewer from \u0026ldquo;verify this code works\u0026rdquo; to \u0026ldquo;review this code — you are empowered to fail it.\u0026rdquo;\nGiving the model explicit authority to reject transformed its behavior.\nRubber stamp rate dropped from \u0026gt;90% to \u0026lt;10%.\nLocal-Trinity: The Unified Benchmark Local-Trinity combines all three nodes into a single benchmark suite.\nSame task, all three nodes, n=3 for stability:\nArchitecture Local-Trinity (ZBook orchestrator) ├── 17 source modules, 5,850 LOC ├── 290 unit tests ├── 19 suites / 55 tasks │ ├── Node: DGX1 (Qwen 3.5 FP8, 65 tok/s) ├── Node: DGX2 (Qwen 3.6 FP8, 65 tok/s) └── Node: Desktop5090 (Qwen 3.6 Q4, 204 tok/s) Task Categories Category Suites Tasks Examples Factual (KO/EN) 5 16 Korean trivia, English facts, ambiguous questions Math 2 7 Arithmetic, modular, combinatorics Coding 6 14 Single-file, debug, multi-turn, algorithms Hard 4 12 3-turn state machines, B-Trees, regex engines Agent 2 6 Speed-focused, algorithm-heavy The v1→v7 Journey This is where the real learning happened:\nVersion ALL_FAIL What Broke Root Cause v1 7 Models \u0026ldquo;failing\u0026rdquo; tasks they could clearly solve Harness bugs: wrong assertions, broken scoring v2 1 Still one persistent failure Difficulty too high for Q4 v3 2 Regression from \u0026ldquo;improvements\u0026rdquo; AST scoring introduced new bugs v4 2 Same two won\u0026rsquo;t go away try_pass counting was wrong v5 0 Finally all passing atexit counting, prompt rewording v6 0 Stable — added features Performance timing, thinking A/B v7 0 Stable — final polish assert_total denominator fix The harness had more bugs than the models.\nEvery time I thought a model was failing, the actual problem was:\nAssert matching that rejected valid alternative formats Scoring that penalized correct-but-different approaches Timeouts too short for thinking-heavy tasks Prompts that were ambiguous to the model but clear to me v5 Benchmark Results (12 Hard Tasks, n=3) Task 5090 (Q4) DGX1 (3.5 FP8) DGX2 (3.6 FP8) Verdict EventEmitter 98.8 97.9 97.7 UNANIMOUS StateMachine 97.1 100.0 97.1 UNANIMOUS Mini ORM 100.0 70.6 100.0 MAJORITY TaskScheduler 100.0 76.5 100.0 MAJORITY ExprParser 100.0 99.5 99.9 UNANIMOUS KV Store 98.0 97.6 98.0 UNANIMOUS System Design 100.0 97.7 97.4 UNANIMOUS Combinatorics 100.0 100.0 99.9 UNANIMOUS Code Review 80.0 99.2 79.1 MAJORITY Regex Engine 99.6 97.3 98.3 UNANIMOUS JSON Parser 98.5 97.9 98.1 UNANIMOUS B-Tree 97.8 39.0 97.7 MAJORITY UNANIMOUS 8 / MAJORITY 4 / ALL_FAIL 0\nThe MAJORITY cases reveal real model differences:\nDGX1 (Qwen 3.5) struggles with multi-turn tasks requiring state tracking B-Tree: DGX1 consistently fails on node splitting (structural limitation) Code Review: 5090 and DGX2 both miss pickle security check (non-deterministic) What I Learned 1. Benchmark Quality \u0026gt; Model Quality The models were fine from v1.\nMy evaluation framework was broken.\nIf your benchmark shows a model \u0026ldquo;failing\u0026rdquo; a task it should clearly handle, the bug is in your benchmark.\n2. Scores ≠ Quality A score of 100.0 means \u0026ldquo;passed all structural checks\u0026rdquo; — not \u0026ldquo;produced good code.\u0026rdquo;\nI found cases where code scored 100.0 but had subtle bugs (nested validation edge cases) that the validator couldn\u0026rsquo;t catch.\nThe validator measures what it measures, not what you think it measures.\n3. n=3 Is Mandatory With n=1, you can\u0026rsquo;t distinguish model limitation from inference non-determinism.\nvLLM\u0026rsquo;s MTP causes one run in four to produce wildly different output.\nn=3 with mean scoring separates signal from noise.\n4. The 35B Sweet Spot For coding tasks with clear specifications,\nQwen 35B MoE (3B active) at FP8 matches commercial tools on quality.\nThe gap appears in:\nAmbiguous multi-step tasks (commercial tools have better instruction following) Long-context retrieval (128GB vs 1M tokens cloud-side) Tool use and file system interaction (Claude Code\u0026rsquo;s integration layer) For everything else — writing functions, debugging, algorithms, refactoring —\nthe local model works.\nCode All three harnesses are zero-dependency Python:\ncc-crosscheck: ~/GitHub/cc-crosscheck/ — 254 tests, dual-35B architecture dgx-duo: ~/GitHub/dgx-duo/ — 9 modules, 254 tests, pair mode Local-Trinity: ~/GitHub/local-trinity/ — 17 modules, 290 tests, 3-node unified Public release planned after documentation pass.\nPersonal benchmarks on personal hardware.\nModels: Qwen 3.5/3.6 35B-A3B (FP8 and Q4).\nCommercial comparison: Claude Code with Opus 4.6 (v2.1.109).\nNot affiliated with Anthropic, OpenAI, or Alibaba.\n","permalink":"https://arknill.github.io/blog/testing-claude-code-against-local-35b/","summary":"\u003cp\u003eI run Claude Code (Opus 4.6) as my primary coding tool and pay $200/month for it.\u003cbr\u003e\nI also run Qwen 3.5/3.6 35B locally on two DGX Sparks and an RTX 5090.\u003cbr\u003e\nNatural question: \u003cstrong\u003ehow does a local 35B model compare to the commercial tool I\u0026rsquo;m paying for?\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo find out, I built three separate benchmark harnesses over 10 days.\u003cbr\u003e\nThe journey taught me more about evaluation methodology than about the models themselves — because \u003cstrong\u003ethe harness had more bugs than the models did\u003c/strong\u003e.\u003c/p\u003e","title":"Testing Claude Code Against Local 35B Models: Building a Cross-Check Harness"},{"content":"I run a 3-node local LLM inference cluster at home.\nTwo NVIDIA DGX Sparks (128GB unified memory each) and one RTX 5090 desktop (32GB VRAM).\nAll three serve Qwen 3.5/3.6 35B MoE models 24/7 over my local network.\nThis isn\u0026rsquo;t a weekend experiment — it\u0026rsquo;s my daily development infrastructure.\nEvery code review, every research query, every benchmark runs against these nodes.\nHere\u0026rsquo;s what the setup looks like, what it costs, and what I learned that no spec sheet tells you.\nThe Hardware Node CPU GPU/Memory Storage Role DGX Spark #1 NVIDIA Grace (ARM, 20C) GB10, 128GB LPDDR5X UMA 932GB NVMe Coding — Qwen 3.5 FP8, vLLM DGX Spark #2 NVIDIA Grace (ARM, 20C) GB10, 128GB LPDDR5X UMA 932GB NVMe Agentic — Qwen 3.6 FP8, vLLM Desktop 5090 (Windows host) RTX 5090, 32GB GDDR7 — Interactive — Qwen 3.6 MoE Q4, llama.cpp ZBook Ultra Ryzen AI MAX+ PRO 395 Radeon 8060S (integrated) 2TB NVMe Orchestrator — SSH hub, Claude Code, dev env All nodes on the same LAN (192.168.0.x), gigabit wired. The ZBook orchestrates everything via SSH.\nWhy Three Nodes Instead of One Big Machine A single 128GB node can run the model,\nbut role specialization makes the cluster stronger than the sum of its parts:\nDGX1 (Qwen 3.5, thinking OFF): Pure code generation. 65 tok/s, optimized for fast builder output. No thinking overhead. DGX2 (Qwen 3.6, thinking ON): Code review and agentic tasks. Catches bugs the builder missed. Thinking enabled for deeper analysis. 5090 (Q4, 204 tok/s): Interactive queries. Sub-2-second responses for factual questions. Trading quality (Q4 vs FP8) for 3x speed. When they work together (DGX1 builds, DGX2 reviews),\nthe pair scores 99.6/100 across 24 scenarios — higher than either node alone.\nThe Software Stack DGX Spark: vLLM + MTP vLLM 0.19.1 (Docker: vllm-node:latest, 18.2GB image) ├── Model: Qwen3.5-35B-A3B-FP8 (DGX1) / Qwen3.6-35B-A3B-FP8 (DGX2) ├── Context: 262K tokens (FP8 KV cache) ├── Speculative: MTP (Multi-Token Prediction), depth=3 ├── Backend: FlashInfer ├── Port: 30000 (OpenAI-compatible API) └── Speed: 65 tok/s generation, ~2000 tok/s prefill Both DGX nodes run as systemd services with automatic restart.\nGPU clocks locked at 2500 MHz (dgx-gpu-clocks.service).\nDGX2 has a daily vLLM restart timer at 05:00 KST to clear accumulated KV cache fragmentation.\nDesktop 5090: llama.cpp llama-server (llama.cpp, sm_120 build) ├── Model: Qwen3.6-35B-MoE-Q4_K_XL (23.1GB) ├── Context: 32K tokens ├── Speed: 204 tok/s generation, ~12000 tok/s prefill ├── Thinking: Per-request dynamic control └── Port: 8080 (OpenAI-compatible API) The 5090\u0026rsquo;s 1,792 GB/s memory bandwidth gives it absurd prefill speed — 6x faster than the DGX.\nFor single-turn queries, it responds in under 2 seconds.\nZBook: Orchestration The ZBook doesn\u0026rsquo;t run models — it runs everything else:\nClaude Code (primary dev environment) SSH management to all nodes Benchmark harnesses (Local-Trinity, dgx-duo, 5090-harness) Docker services: llm-relay proxy, Neo4j, Redis, Qdrant, SearXNG Network Topology [Internet] ← [Router .1] ├── Desktop5090 .16 (Windows, llama-server) ├── DGX1 .26 (Ubuntu ARM, vLLM) ├── DGX2 .31 (Ubuntu ARM, vLLM) └── [Ethernet Hub] └── ZBook .34 (Ubuntu, orchestrator) All wired gigabit.\nMeasured inter-node throughput: 674–936 Mbps.\nAPI calls between ZBook and any node: \u0026lt;2ms latency.\nThe DGX Sparks are ARM (aarch64) — this matters for Docker images and compiled dependencies.\nvLLM\u0026rsquo;s official ARM images work, but custom tooling needs ARM builds.\nOperational Realities Power and Heat DGX Spark: ~150W each under load. Quiet (server-grade fans, rarely audible). Desktop 5090: ~400W under full GPU load. Loud under sustained inference. ZBook: 45–65W. Fanless for most tasks. Total cluster: ~750W peak, ~400W typical.\nMy electricity bill increased by about $40/month.\nWhat Breaks vLLM memory leaks. After 3–5 days of continuous inference,\nKV cache fragmentation degrades throughput by 10–15%.\nThe daily restart timer on DGX2 solved this.\nDGX1 is more stable (fewer thinking-heavy requests).\nMTP non-determinism. Same prompt, same model, different output.\nvLLM\u0026rsquo;s speculative decoding introduces variance that llama.cpp doesn\u0026rsquo;t have.\nI run n=3 and take the mean for benchmarks.\nFor production code generation, I only trust single-shot results from the 5090.\nDocker on ARM. Most community Docker images are x86-only.\nBuilding vLLM from source on ARM took 45 minutes.\nSome Python wheels need compilation.\nThe official NVIDIA DGX images handle this, but anything custom requires patience.\nNetwork interruptions. A brief power blink at the router kills all SSH sessions.\nI use systemd services (not tmux/nohup) for everything critical.\nModels auto-restart on boot.\nModel Updates Switching models on a DGX takes about 30 minutes:\nDownload new model from HuggingFace (~35GB for 35B FP8) Update the vLLM launch script systemctl restart dgx-vllm Run validation suite (5 minutes) I\u0026rsquo;ve switched DGX2 from Qwen 3.5 to 3.6 this way.\nThe old model stays on disk as backup — 932GB is generous.\nWhat This Enables With three nodes, I can:\nCross-check commercial AI tools — Run the same coding task on Claude Code, Codex, and local 35B. Compare quality without trusting any single vendor. A/B test model versions — DGX1 runs 3.5, DGX2 runs 3.6. Same infrastructure, controlled comparison. Run pair programming pipelines — Builder (fast, no thinking) → Reviewer (slower, catches bugs). 99.6/100 on 24 scenarios. Benchmark everything — 55 tasks, 290 unit tests, 19 suites. Reproducible measurements across all nodes. Stay independent — When Claude Code has quota issues or model regressions, my work doesn\u0026rsquo;t stop. Honestly, Is It Worth It? If you\u0026rsquo;re evaluating the economics: Claude Code Max 20 is $200/month.\nThe DGX Sparks cost significantly more upfront.\nPure cost-per-token, the cloud wins for years.\nIf you\u0026rsquo;re evaluating the capability: No cloud API gives you this level of control —\nper-request thinking toggles, model pinning, deterministic inference,\ncross-model pipelines, zero-latency local access, complete privacy.\nIf you\u0026rsquo;re evaluating the learning: Building and operating this taught me more about LLM inference than any paper or benchmark leaderboard.\nThe failure modes (quantization loss, non-determinism, thinking runaways) don\u0026rsquo;t appear in published benchmarks.\nYou only find them by running thousands of real tasks.\nThe cluster isn\u0026rsquo;t cheaper than the cloud.\nIt\u0026rsquo;s more capable in specific ways that matter for my work —\nand it doesn\u0026rsquo;t degrade when a vendor ships a bad update.\nSpecs Summary Component DGX Spark (×2) Desktop 5090 ZBook Ultra Memory 128GB UMA 32GB GDDR7 108GB UMA Bandwidth 273 GB/s 1,792 GB/s — Max Context 262K 32K — Generation 65 tok/s (FP8) 204 tok/s (Q4) — Precision FP8 (production) Q4 (interactive) — Backend vLLM + MTP llama.cpp — Power ~150W ~400W ~50W Role Production inference Interactive / speed Orchestration Personal infrastructure. Hardware availability and pricing vary by region.\n","permalink":"https://arknill.github.io/blog/3-node-home-llm-lab/","summary":"\u003cp\u003eI run a 3-node local LLM inference cluster at home.\u003cbr\u003e\nTwo NVIDIA DGX Sparks (128GB unified memory each) and one RTX 5090 desktop (32GB VRAM).\u003cbr\u003e\nAll three serve Qwen 3.5/3.6 35B MoE models 24/7 over my local network.\u003c/p\u003e\n\u003cp\u003eThis isn\u0026rsquo;t a weekend experiment — it\u0026rsquo;s my daily development infrastructure.\u003cbr\u003e\nEvery code review, every research query, every benchmark runs against these nodes.\u003cbr\u003e\nHere\u0026rsquo;s what the setup looks like, what it costs, and what I learned that no spec sheet tells you.\u003c/p\u003e","title":"I Built a 3-Node Home LLM Lab. Here's What It Actually Takes."},{"content":"I run Qwen 3.5 and 3.6 (35B MoE, 3B active parameters) in production across three nodes — two DGX Spark (FP8, vLLM) and one RTX 5090 (Q4, llama.cpp).\nAfter 100+ benchmark scenarios and thousands of inference calls,\nthree problems dominated my debugging time:\nQuantization loss is not uniform — MoE models at Q4 lose 16% on CJK tasks vLLM is non-deterministic under speculative decoding — identical prompts produce different outputs Thinking tokens consume 60–90% of the budget on tasks where they provide zero benefit None of these show up in standard benchmarks.\nAll of them break production workflows.\n1. Quantization: FP8 Is the Production Floor The common wisdom — \u0026ldquo;30B+ models lose less than 1% at Q4\u0026rdquo; — is only true for Dense models on English benchmarks.\nMoE Models Are Different A 35B MoE with 3B active parameters is effectively a small model from a quantization perspective.\nThe \u0026ldquo;35B total\u0026rdquo; is misleading — you\u0026rsquo;re quantizing expert weights that route dynamically,\nand any bit-flip in the router logits causes discrete misrouting,\nnot the smooth degradation you get with Dense models.\nFrom 25+ papers (Dettmers 2022, Ouyang 2024, APEX 2025) and my own measurements:\nArchitecture Q4 English Loss Q4 CJK Loss Mechanism Dense 30B+ \u0026lt;1% 4–5% Smooth degradation MoE 35B-A3B 1–14% 10–17% Router misrouting (discrete) The CJK Penalty Marchisio et al. (EMNLP 2024) found that automatic benchmarks underestimate CJK loss by 10–15 percentage points\ncompared to human evaluation:\nLanguage Auto Benchmark Human Eval Gap Japanese -1.7% -16.0% 14.3pp Korean minor -4.6% — Chinese (MGSM) -17.3% — — My measurement on Qwen 3.5 35B-A3B: FP8 vs Q4 on Korean factual tasks showed -16% on human-evaluated quality\nwhile automatic metrics showed only -1.2%.\nThe Shared Expert Problem MoE architectures have \u0026ldquo;shared experts\u0026rdquo; (always active, regardless of routing).\nThese shared weights have kurtosis 13.10 — nearly 4x higher than routed experts (3.41).\nHigh kurtosis means outlier values that get clipped by low-bit quantization.\nAPEX (2025) showed that per-expert quantization — Q6_K for routed experts, Q8_0 for shared experts — matches F16 quality.\nUniform Q4 across all experts loses 1–14% depending on the task.\nWhat I Run DGX (production): FP8 — quality baseline, 65 tok/s 5090 (interactive): Q4 — speed priority, 204 tok/s, accept CJK penalty FP8 is the production floor.\nBelow that, you\u0026rsquo;re trading quality for speed in ways that don\u0026rsquo;t show up until users notice degraded Korean/Japanese output.\n2. vLLM Non-Determinism: The MTP Problem During my Local-Trinity benchmark (55 tasks, n=3 per node),\nI discovered that vLLM produces different outputs from identical prompts:\nNode Backend Stability (n=3) Desktop 5090 llama-server STABLE 12/12 DGX2 vLLM + MTP UNSTABLE 4/12 Same Qwen 3.6 model, same prompts.\nThe 5090 produced identical outputs across all 3 runs for every task.\nThe DGX varied wildly — one task scored 100, 100, 9 across three runs.\nRoot Cause: Triple Non-Determinism vLLM has three sources of non-determinism that compound:\nCUDA kernel non-determinism — FP8 matrix multiply accumulation order varies with thread scheduling Batch scheduling — continuous batching means different request interleaving on each run MTP (Multi-Token Prediction) speculation — speculated tokens accepted/rejected differently based on timing Thinking Machines Lab confirmed this at scale:\nrunning Qwen3-235B through vLLM 1,000 times produced 80 distinct output variations from the same prompt.\nMTP Is the Primary Cause I tested with MTP disabled on DGX2:\nMTP ON: UNSTABLE 4/12, HMT-03 scored 100→39 on one run MTP OFF: STABLE, HMT-03 passed 3/3 (std 0.15s) But MTP OFF costs 14% speed (61→52.7 tok/s)\nand caused timeouts on thinking-heavy tasks.\nOperational Implication Production: MTP ON + n=3 mean scoring (accept variance) Single-shot trust: Desktop 5090 (llama-server) only If you need deterministic outputs (test suites, reproducible research), don\u0026rsquo;t use vLLM with MTP.\nUse llama.cpp or disable MTP and accept the speed penalty.\n3. Thinking Tokens: When \u0026ldquo;Smarter\u0026rdquo; Makes Things Worse Qwen 3.6 supports per-request thinking control.\nI ran every task type with thinking ON and OFF:\nTask Type Thinking ON Thinking OFF Speed Difference Factual queries 9/9 9/9 OFF 2.5x faster Code generation 5/5 5/5 OFF 4.2x faster Code debugging 3/3 3/3 OFF 6.0x faster Multi-turn coding 1/2 2/2 OFF 7.3x faster Complex math 3/3 0/3 ON required Thinking OFF is the correct default for coding.\nIt produces equal or better results at 4–7x the speed.\nThinking ON only helps for complex mathematical reasoning — and even there, it has a catastrophic failure mode.\nThe Thinking Runaway Problem With thinking enabled on complex coding tasks,\nQwen 3.6 sometimes enters a thinking loop — spending the entire token budget on internal reasoning,\nleaving no capacity for actual code output:\nbugfix task: TTFT 141s, completion_tokens=1, empty output (3/3 repro) refactor task: TTFT 130s, completion_tokens=291, incomplete code This is a known issue (QwenLM/Qwen3.6#88): 17.4% of hard coding tasks trigger thinking token exhaustion.\nOf those, 84% show repetitive loops within the \u0026lt;think\u0026gt; block.\nThe vLLM Bug That Makes It Worse vLLM issue #39573: when MTP is active,\nthinking_token_budget is silently ignored.\nYou cannot cap thinking tokens while using speculative decoding.\nThe parameter is accepted without error and does nothing.\nThis means:\nMTP ON + thinking ON = no budget control, runaway possible MTP OFF + thinking ON + budget = works, but 14% slower MTP ON + thinking OFF = safe and fastest for coding Per-Request Control The solution is dynamic per-request thinking control:\n// Default (all coding tasks): {\u0026#34;chat_template_kwargs\u0026#34;: {\u0026#34;enable_thinking\u0026#34;: false}} // Complex math only: // Omit extra_body → server default (thinking ON) No server restart needed.\nThinking is toggled per API call.\nMy harness classifies task complexity upfront and routes accordingly.\nSummary: Production Configuration After v1–v7 of each harness (100+ scenarios per node, 290+ unit tests),\nthis is the configuration that survived testing:\nDecision Choice Evidence Quantization floor FP8 Q4 MoE: CJK -16%, router misrouting Speed tier Q4 for interactive, FP8 for production Accept CJK penalty for 3x speed vLLM determinism MTP ON + n=3 mean Single-shot determinism only with llama-server Thinking default OFF ON only for complex math (3/3 vs 0/3) Thinking budget Not available with MTP vLLM #39573 open Max tokens (thinking ON) 32768 Prevents runaway (80% → 0% failure) None of these decisions came from reading papers or following leaderboards.\nAll of them came from running the same tasks repeatedly until the failure modes revealed themselves.\nReferences Dettmers \u0026amp; Zettlemoyer (2022): LLM.int8() — 35,000 experiments on quantization scaling Marchisio et al. (EMNLP 2024): CJK quality loss under quantization APEX (2025): Per-expert quantization for MoE models Thinking Machines Lab: vLLM non-determinism at scale (Qwen3-235B, 1000 runs) vLLM #39573: MTP + thinking_budget incompatibility QwenLM Qwen3.6#88: Thinking token runaway on LiveCodeBench Measured on DGX Spark (128GB, vLLM 0.19.1) and RTX 5090 (32GB, llama.cpp). Your configuration may differ.\n","permalink":"https://arknill.github.io/blog/quantization-determinism-thinking-production/","summary":"\u003cp\u003eI run Qwen 3.5 and 3.6 (35B MoE, 3B active parameters) in production across three nodes — two DGX Spark (FP8, vLLM) and one RTX 5090 (Q4, llama.cpp).\u003cbr\u003e\nAfter 100+ benchmark scenarios and thousands of inference calls,\u003cbr\u003e\nthree problems dominated my debugging time:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eQuantization loss\u003c/strong\u003e is not uniform — MoE models at Q4 lose 16% on CJK tasks\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003evLLM is non-deterministic\u003c/strong\u003e under speculative decoding — identical prompts produce different outputs\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eThinking tokens\u003c/strong\u003e consume 60–90% of the budget on tasks where they provide zero benefit\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eNone of these show up in standard benchmarks.\u003cbr\u003e\nAll of them break production workflows.\u003c/p\u003e","title":"Quantization, Determinism, and Thinking Tokens: Running Open-Source LLMs in Production"},{"content":"I run Qwen 3.6 35B on three machines.\nThe RTX 5090 generates at 204 tok/s.\nThe DGX Spark pair generates at 65 tok/s.\nBy every benchmark leaderboard metric, the 5090 is 3x faster.\nBut for multi-step coding tasks with thinking enabled,\nthe DGX pair completes the job faster.\nAnd for single-turn questions,\nthe 5090 delivers the answer in under 2 seconds while the DGX takes 8–12 seconds.\ntok/s alone told me nothing useful about actual user experience.\nHere\u0026rsquo;s what I learned building benchmarks for all three nodes.\nThe Paradox My 3-node setup:\nNode GPU Model tok/s Memory BW Desktop 5090 RTX 5090 (32GB) Qwen3.6-35B MoE Q4 204 1,792 GB/s DGX1 Grace Hopper (128GB) Qwen3.5-35B FP8 65 273 GB/s DGX2 Grace Hopper (128GB) Qwen3.6-35B FP8 65 273 GB/s The 5090 is 3.1x faster at generation. But:\nFor a simple factual question: 5090 responds in 1.2s. DGX responds in 8.5s. The 5090 wins by 7x on wall-clock time. For a multi-turn coding task with thinking: 5090 completes in 45s. DGX pair (builder + reviewer) completes in 38s with higher quality. Same model family, same quantization-adjusted quality.\nThe numbers invert depending on the task.\nWhy tok/s Lies The total time to get a response is:\nT_response = T_prompt_processing + T_thinking + T_generation tok/s only measures the last term.\nHere\u0026rsquo;s what it misses:\n1. Prompt Processing (Prefill) Before generating a single token, the model must process your entire input.\nThis is memory-bandwidth bound, not compute-bound.\nNode Prefill Speed 8K context prefill 5090 ~12,000 tok/s 0.7s DGX ~2,000 tok/s 4.0s The 5090\u0026rsquo;s 1,792 GB/s memory bandwidth means it chews through prompts 6x faster than the DGX\u0026rsquo;s 273 GB/s.\nFor a long context window (32K tokens),\nthe DGX spends 16 seconds just reading the prompt.\nThe 5090 does it in 2.7 seconds.\n2. Thinking Tokens (Invisible Cost) With thinking enabled,\nQwen 3.6 spends 60–90% of its output tokens on internal reasoning that never reaches the user.\nIf a response shows 200 visible tokens but actually generated 2,000 (with 1,800 in \u0026lt;think\u0026gt; blocks),\nyour effective speed is:\nEffective tok/s = visible_tokens / wall_clock_time On the 5090 at 204 tok/s with 90% thinking overhead:\nRaw: 204 tok/s Effective: 20.4 tok/s (only 10% becomes visible output) On the DGX at 65 tok/s with the same thinking ratio:\nRaw: 65 tok/s Effective: 6.5 tok/s The 5090 still wins in absolute terms, but the gap shrinks from 3.1x to 3.1x.\nThe real problem: if the DGX uses a dual-model pipeline (builder generates, reviewer validates in parallel),\nthe total pipeline time can beat a single-model serial workflow.\n3. Pipeline Overhead (Multi-Step Tasks) For coding tasks, my DGX duo runs:\nDGX1 (Qwen 3.5) generates the code — optimized for speed, thinking OFF DGX2 (Qwen 3.6) reviews and fixes — thinking ON, catches bugs Two passes at 65 tok/s each, but with specialized roles.\nThe reviewer catches errors that a single-pass generator misses —\nsaving the retry loop that burns time on a single-GPU setup.\nThe Framework: TTR, Effective tok/s, TCT After measuring hundreds of tasks across all three nodes,\nI defined three metrics that actually predict user satisfaction:\nTTR (Time to Response) TTR = T_prefill + T_thinking + T_generation The wall-clock time from pressing Enter to seeing the complete response.\nThis is what the user feels.\nExample — simple factual question:\nNode Prefill Thinking Generation TTR 5090 0.3s 0s (OFF) 0.9s 1.2s DGX 2.1s 0s (OFF) 6.4s 8.5s The 5090 is 7x faster on TTR despite being only 3.1x faster on tok/s.\nThe difference is prefill — memory bandwidth dominates short interactions.\nEffective tok/s Effective tok/s = content_tokens / TTR Only counts tokens that reach the user.\nStrips thinking overhead.\nNode Raw tok/s Thinking overhead Effective tok/s 5090 (thinking OFF) 204 0% 204 5090 (thinking ON) 204 85% ~31 DGX (thinking ON) 65 85% ~10 TCT (Task Completion Time) TCT = Σ(all TTR steps) + pipeline_overhead For multi-step tasks (code → test → fix → verify), TCT captures the full workflow.\nA faster tok/s that requires 3 retries loses to a slower tok/s that gets it right in one pass.\nExample — coding task (state machine, 3-turn):\nSetup Pass 1 Pass 2 Pass 3 TCT Quality 5090 solo 15s 12s 18s 45s 97.1 DGX duo (build+review) 22s 16s — 38s 100.0 The DGX duo finished faster and scored higher —\nthe reviewer caught the bug that would have required a third pass on the 5090.\nWhen Each Node Wins After running 55 tasks across all three nodes (Local-Trinity benchmark suite):\nTask Type Winner Why Single-turn factual 5090 Prefill speed dominates. 7x TTR advantage. Simple code generation 5090 Thinking OFF, pure generation speed. Complex multi-turn coding DGX duo Reviewer catches bugs, avoids retry loops. Long-context analysis DGX 128GB VRAM, 262K context. 5090 limited to 32K. Math with reasoning 5090 Same quality, 3x faster thinking. The right metric depends on the workload.\nNo single number captures this.\nPractical Implications If you\u0026rsquo;re buying hardware for local LLM:\nInteractive use (chat, quick questions): maximize memory bandwidth. Consumer GPUs win. Agentic workflows (multi-step, tool use): maximize VRAM and context window. Workstation/datacenter GPUs win. Don\u0026rsquo;t compare tok/s across architectures — a 200 tok/s Q4 model on consumer GPU and a 65 tok/s FP8 model on datacenter GPU serve different purposes. If you\u0026rsquo;re building benchmarks:\nReport TTR for interactive tasks, TCT for agentic tasks Always report prefill speed separately from generation speed If thinking is enabled, report both raw and effective tok/s Never aggregate a single tok/s number across mixed workloads Data Sources All measurements from my 3-node home lab running Qwen 3.5/3.6 35B MoE:\n5090 harness: 100+ scenarios, v1–v7, thinking ON/OFF A/B tests DGX duo harness: 254 tests, 14 E2E scenarios, n=3 stability verification Local-Trinity: 19 suites, 55 tasks, 290 unit tests — cross-node comparison framework Latency metrics research: full methodology Measured on personal hardware. Your results will vary with model size, quantization, context length, and workload mix.\n","permalink":"https://arknill.github.io/blog/what-tok-s-doesnt-tell-you/","summary":"\u003cp\u003eI run Qwen 3.6 35B on three machines.\u003cbr\u003e\nThe RTX 5090 generates at \u003cstrong\u003e204 tok/s\u003c/strong\u003e.\u003cbr\u003e\nThe DGX Spark pair generates at \u003cstrong\u003e65 tok/s\u003c/strong\u003e.\u003cbr\u003e\nBy every benchmark leaderboard metric, the 5090 is 3x faster.\u003c/p\u003e\n\u003cp\u003eBut for multi-step coding tasks with thinking enabled,\u003cbr\u003e\nthe DGX pair \u003cstrong\u003ecompletes the job faster\u003c/strong\u003e.\u003cbr\u003e\nAnd for single-turn questions,\u003cbr\u003e\nthe 5090 delivers the answer in under 2 seconds while the DGX takes 8–12 seconds.\u003c/p\u003e\n\u003cp\u003etok/s alone told me nothing useful about actual user experience.\u003cbr\u003e\nHere\u0026rsquo;s what I learned building benchmarks for all three nodes.\u003c/p\u003e","title":"What tok/s Doesn't Tell You: Measuring LLM Speed That Matters"},{"content":"On April 23, Anthropic published a postmortem acknowledging three product-layer bugs that degraded Claude Code from March 4 through April 20.\nThey frame it as: model weights unchanged, harness bugs fixed, problem solved.\nThe three bugs are real. Their impact was real. The fixes were real.\nBut the postmortem is a carefully scoped document that tells half the truth.\nHere\u0026rsquo;s the other half.\nWhat They Admitted (Correctly) Bug Introduced Fixed Duration Effort high → medium March 4 (v2.1.68) April 21 (v2.1.117) 48 days Thinking cache clear every turn March 26 (v2.1.85) April 10 (v2.1.101) 15 days \u0026ldquo;≤25 words\u0026rdquo; system prompt April 16 (v2.1.111) April 20 (v2.1.116) 4 days All three are product-layer issues — API parameters and system prompts, not model weights.\nThe combined effect: effort reduced thinking depth, thinking cache destroyed session memory, word limit truncated output.\nTogether, these made Claude appear significantly dumber.\nThis explanation is technically correct.\nA user hitting all three bugs simultaneously would experience exactly the degradation that was reported.\nI accept this part.\nThe Strategic Scoping The postmortem covers \u0026ldquo;why 4.6 users felt degradation.\u0026rdquo;\nIt does not address a separate, concurrent complaint stream: \u0026ldquo;4.7 is worse than 4.6.\u0026rdquo;\nThese are model-level regressions that have nothing to do with harness bugs:\nOpus 4.7 Model Issue Evidence Harness Bug? Tokenizer +35% (worse on CJK) Official docs: \u0026ldquo;1.0x to 1.35x more tokens\u0026rdquo; No �� architecture change Long-context retrieval: 91.9% → 59.2% (256K) Anthropic\u0026rsquo;s own system card (MRCR v2) No — model capability Long-context retrieval: 78.3% → 32.2% (1M) Anthropic\u0026rsquo;s own system card (MRCR v2) No — model capability Literal instruction following What\u0026rsquo;s New: \u0026ldquo;follows instructions more literally\u0026rdquo; No — training change BrowseComp: 84.0% → 79.6% Anthropic\u0026rsquo;s benchmark No — model regression XML/JSON mixing in long payloads (#49747) Decoder-level format switching No — model behavior Safety over-refusal (#49751) Structural biology blocked on 4.7, works on 4.6 No — RLHF/safety layer The postmortem mentions none of these.\nZero lines.\nThe narrative is: \u0026ldquo;we fixed three bugs, Claude is back to normal.\u0026rdquo;\nBut users who upgraded to 4.7 have a different set of problems that three bug fixes don\u0026rsquo;t address.\nThe postmortem solves Stream A (\u0026ldquo;4.6 got dumber\u0026rdquo;) while pretending Stream B (\u0026ldquo;4.7 is a regression\u0026rdquo;) doesn\u0026rsquo;t exist.\nThe Cost Coincidence All three admitted bugs share a common direction:\nBug User Impact Anthropic Serving Cost Effort high → medium Quality reduction Compute ↓↓ Thinking cache clear every turn Context loss, repetition Prefill ↓↓ (less to cache) \u0026ldquo;≤25 words\u0026rdquo; output limit Information reduction Output tokens ↓↓ Three changes, all independently reducing Anthropic\u0026rsquo;s cost of serving each request.\nFramed as \u0026ldquo;UI freezing fix,\u0026rdquo; \u0026ldquo;caching optimization,\u0026rdquo; and \u0026ldquo;verbosity reduction.\u0026rdquo;\nI cannot prove intent. But:\nThe probability of 3/3 changes accidentally aligning with cost reduction is low Two of three were never recorded in the CHANGELOG — suggesting organizational awareness that they\u0026rsquo;d be controversial The effort change was recorded but framed as a product improvement (\u0026ldquo;sweet spot between speed and thoroughness\u0026rdquo;), never as a cost-driven trade-off If these were genuine accidents with no cost motivation, why hide two of them from the changelog entirely?\nThe CHANGELOG Gap Is the Real Story Forget the bugs for a moment. The structural finding is:\nAnthropic can and does make behavior-affecting changes to Claude Code without any public record.\nThe thinking cache bug was introduced March 26 and fixed April 10.\nNeither event appears in the CHANGELOG.\nI searched all 3,285 lines — the terms \u0026ldquo;thinking cache,\u0026rdquo; \u0026ldquo;cache clear,\u0026rdquo; and \u0026ldquo;thinking prun\u0026rdquo; appear nowhere.\nThe \u0026ldquo;≤25 words\u0026rdquo; system prompt was added April 16 and reverted April 20.\nNeither event appears in the CHANGELOG.\nThe terms \u0026ldquo;25 words\u0026rdquo; and \u0026ldquo;length limit\u0026rdquo; appear nowhere.\nSystem prompt changes are structurally invisible. Users cannot:\nSee what system prompts are active in their version Diff system prompts between versions Know when a system prompt change degraded their experience Distinguish \u0026ldquo;the model is dumber\u0026rdquo; from \u0026ldquo;an invisible instruction is limiting it\u0026rdquo; The postmortem\u0026rsquo;s remediation plan mentions \u0026ldquo;soak periods\u0026rdquo; and \u0026ldquo;gradual rollouts\u0026rdquo; — but says nothing about mandatory documentation for behavior-affecting changes.\nThe structural opacity remains intact.\nPost-Postmortem: Not Fixed The postmortem narrative is \u0026ldquo;three bugs, all fixed as of April 20.\u0026rdquo; Issues filed April 22–24 on v2.1.117–119 show otherwise:\nIssue Problem Anthropic Response #52502 Subagent model: haiku pin ignored — everything runs on Opus ($10.87 vs $0.0005) None #52534 effortLevel setting overridden by unpinOpus47LaunchEffort flag — programmatic control broken None #52522 Auto-compact threshold changed from 200K→1M — 5x token usage overnight Bot: \u0026ldquo;duplicate\u0026rdquo; #52228 Model fabricated \u0026ldquo;Human:\u0026rdquo; prompts, began self-dialogue with unilateral workstation action None These are not edge cases.\n#52502 is a billing issue with direct financial impact.\n#52534 means effort configuration doesn\u0026rsquo;t work.\n#52522 means a \u0026ldquo;fix\u0026rdquo; caused 5x cost increase for existing users.\nThe \u0026ldquo;all fixed\u0026rdquo; framing lasted approximately 48 hours before new issues contradicted it.\nWhat I\u0026rsquo;d Respect Instead If the postmortem had said:\n\u0026ldquo;We found three harness bugs that explain the 4.6 degradation reports. We\u0026rsquo;ve fixed them. Separately, Opus 4.7 has known trade-offs (tokenizer efficiency, long-context retrieval, instruction following behavior) that we\u0026rsquo;re working on. We also acknowledge that our CHANGELOG does not cover system prompt changes, and we\u0026rsquo;re evaluating how to improve this.\u0026rdquo;\nThat would be a complete postmortem.\nWhat they published is incident response for the defensible subset — an engineering blog post doing the job of a PR statement.\nMy Position I run Claude Code on a pinned version (v2.1.109) with Opus 4.6 and have since April 15.\nThe three postmortem bugs are:\nBug #1 (effort): affected v2.1.68+, so my version was exposed March 4–April 21 (48 days) Bug #2 (thinking cache): affected v2.1.85–101, my version was exposed March 26–April 10 (15 days) before it was fixed server-side in v2.1.101 Bug #3 (≤25 words): v2.1.111+, never hit my pinned version One of three bugs affected me.\nThe other two didn\u0026rsquo;t because I was already pinned below them.\nThis validates version pinning as a defensive strategy — and the postmortem itself confirms that Anthropic ships behavior-affecting changes without notice,\nwhich is exactly why pinning exists.\nOpus 4.6 remains active until at least February 5, 2027.\nI have 10 months of runway.\nThe postmortem didn\u0026rsquo;t change my decision — it reinforced it.\nThe Real Fix: Official LTS for AI Developer Tools I shouldn\u0026rsquo;t have to do this.\nVersion pinning, proxy monitoring, manual CHANGELOG diffing — these are user-side defenses against a problem that vendors should solve structurally.\nEvery company shipping AI-powered CLI/IDE tools needs an official LTS (Long-Term Support) channel.\nNot just Anthropic — this applies to Cursor, GitHub Copilot, Windsurf, Cline, and every tool that intercepts a developer\u0026rsquo;s workflow.\nWhy User-Side Pinning Is Not Enough My setup works: pin v2.1.109, set DISABLE_AUTOUPDATER=1, monitor with a proxy.\nBut this approach:\nDoesn\u0026rsquo;t scale. Most users don\u0026rsquo;t run transparent proxies or read CHANGELOGs. They just notice \u0026ldquo;it got worse\u0026rdquo; and can\u0026rsquo;t diagnose why. Is adversarial. I\u0026rsquo;m fighting my own tool\u0026rsquo;s update mechanism. The tool actively wants to be on latest — auto-updater, deprecation timers, feature flags that assume current version. Has an expiration date. Opus 4.6 retires February 2027. Old client versions will eventually hit API incompatibilities. There\u0026rsquo;s no guarantee a pinned version keeps working. Shifts responsibility to users. If a vendor ships a breaking change, the user who didn\u0026rsquo;t pin is blamed for not pinning. That\u0026rsquo;s backwards. What Official LTS Looks Like The model already exists in every serious software ecosystem — Node.js, Ubuntu, PostgreSQL, Java.\nFor AI developer tools:\nFeature Current Reality LTS Channel Update cadence Continuous (daily releases) Quarterly security/stability patches only Model changes Any time, no opt-in Frozen model for LTS duration, explicit migration path System prompt changes Invisible, unrecorded Documented, opt-in for LTS users Breaking behavior changes Shipped without notice Require explicit migration (deprecation → removal cycle) Minimum support window \u0026ldquo;Not sooner than\u0026rdquo; deprecation date Guaranteed 12-month minimum from LTS release Rollback User responsibility (pin + proxy) Vendor-provided: claude --channel lts or equivalent The Business Case This isn\u0026rsquo;t altruism — it\u0026rsquo;s retention:\nEnterprise trust. No enterprise will depend on a tool that silently changes behavior. Official LTS = enterprise-ready. Without it, every procurement team adds \u0026ldquo;vendor instability\u0026rdquo; to their risk matrix. Reduced support load. Half of Anthropic\u0026rsquo;s 100+ Opus 4.7 issues are users confused by unannounced changes. A stable channel eliminates this category entirely. Competitive differentiation. The first AI coding tool to offer genuine LTS wins the \u0026ldquo;production infrastructure\u0026rdquo; market segment. Right now, all of them are optimized for demo-day, not day-200. What I\u0026rsquo;m Asking For To Anthropic, and to every company shipping AI developer tools:\nAn official LTS channel with a guaranteed support window (12+ months) No invisible behavior changes on the LTS channel — every change documented, no stealth system prompt modifications Model version guarantee — LTS users stay on the model they chose until they explicitly migrate Rollback mechanism — not \u0026ldquo;pin a version and hope,\u0026rdquo; but vendor-supported channel switching Transparent CHANGELOG that covers system prompt changes, not just client code changes The current situation — where users must reverse-engineer proxy data to understand why their tool degraded —\nis not a sustainable relationship between vendors and professional developers.\nEvidence Full cross-check (36 claims verified): 17_OPUS-47-POSTMORTEM-ANALYSIS.md Opus 4.7 technical advisory: 16_OPUS-47-ADVISORY.md 42K API call analysis: 01_BUGS.md DATASET.md (4 independent datasets): DATASET.md Independent analysis based on 42,363 proxy-captured API calls, 36-claim cross-check of the postmortem, and 9 open GitHub issues.\nNot affiliated with or endorsed by Anthropic.\n","permalink":"https://arknill.github.io/blog/anthropic-postmortem-half-truth/","summary":"\u003cp\u003eOn April 23, Anthropic published \u003ca href=\"https://www.anthropic.com/engineering/april-23-postmortem\"\u003ea postmortem\u003c/a\u003e acknowledging three product-layer bugs that degraded Claude Code from March 4 through April 20.\u003cbr\u003e\nThey frame it as: model weights unchanged, harness bugs fixed, problem solved.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eThe three bugs are real. Their impact was real. The fixes were real.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eBut the postmortem is a carefully scoped document that tells half the truth.\u003cbr\u003e\nHere\u0026rsquo;s the other half.\u003c/p\u003e\n\u003chr\u003e\n\u003ch2 id=\"what-they-admitted-correctly\"\u003eWhat They Admitted (Correctly)\u003c/h2\u003e\n\u003ctable\u003e\n  \u003cthead\u003e\n      \u003ctr\u003e\n          \u003cth\u003eBug\u003c/th\u003e\n          \u003cth\u003eIntroduced\u003c/th\u003e\n          \u003cth\u003eFixed\u003c/th\u003e\n          \u003cth\u003eDuration\u003c/th\u003e\n      \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eEffort \u003ccode\u003ehigh\u003c/code\u003e → \u003ccode\u003emedium\u003c/code\u003e\u003c/td\u003e\n          \u003ctd\u003eMarch 4 (v2.1.68)\u003c/td\u003e\n          \u003ctd\u003eApril 21 (v2.1.117)\u003c/td\u003e\n          \u003ctd\u003e\u003cstrong\u003e48 days\u003c/strong\u003e\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eThinking cache clear every turn\u003c/td\u003e\n          \u003ctd\u003eMarch 26 (v2.1.85)\u003c/td\u003e\n          \u003ctd\u003eApril 10 (v2.1.101)\u003c/td\u003e\n          \u003ctd\u003e15 days\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003e\u0026ldquo;≤25 words\u0026rdquo; system prompt\u003c/td\u003e\n          \u003ctd\u003eApril 16 (v2.1.111)\u003c/td\u003e\n          \u003ctd\u003eApril 20 (v2.1.116)\u003c/td\u003e\n          \u003ctd\u003e4 days\u003c/td\u003e\n      \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eAll three are product-layer issues — API parameters and system prompts, not model weights.\u003cbr\u003e\nThe combined effect: effort reduced thinking depth, thinking cache destroyed session memory, word limit truncated output.\u003cbr\u003e\nTogether, these made Claude appear significantly dumber.\u003c/p\u003e","title":"Anthropic's Postmortem Told Half the Truth"},{"content":"On April 23, Anthropic published a postmortem acknowledging three product-layer bugs that degraded Claude Code from March 4 through April 20.\nNo model weights were changed — all three issues were in the harness/product layer.\nI cross-checked every claim against the public CHANGELOG (3,285 lines, v2.1.68–v2.1.119),\n8 GitHub issues via gh issue view, and 10 external sources.\n36 claims checked — 28 confirmed, 5 partially confirmed, 3 not relied upon.\nHere\u0026rsquo;s what the postmortem says, what the CHANGELOG actually shows,\nand what still isn\u0026rsquo;t fixed.\nThe Three Admitted Bugs 1. Effort Downgrade (March 4 – April 21) On March 4, Claude Code\u0026rsquo;s default reasoning effort was changed from high to medium.\nThe CHANGELOG (v2.1.68) framed it as a product improvement:\n\u0026ldquo;Opus 4.6 now defaults to medium effort for Max and Team subscribers. Medium effort works well for most tasks — it\u0026rsquo;s the sweet spot between speed and thoroughness.\u0026rdquo;\nPro and Max subscribers — paying $100–$200/month — ran at reduced quality for 48 days.\nThe fix was rolled out in two phases:\nDate Version Scope April 7 v2.1.94 API-key, Bedrock/Vertex, Team, Enterprise April 21 v2.1.117 Pro/Max subscribers The highest-paying tier was fixed last.\nThe word \u0026ldquo;revert\u0026rdquo; never appears in the CHANGELOG — both the introduction and the fix use forward-looking product language.\n2. Thinking Cache Bug (March 26 – April 10) A caching optimization was supposed to clear old thinking blocks from idle sessions.\nInstead, the flag fired on every subsequent turn,\ndestroying prior reasoning on each API call.\nClaude appeared forgetful and repetitive.\nIntroduced: March 26 (v2.1.85) Fixed: April 10 (v2.1.101) CHANGELOG entries: Zero. The bug was introduced and fixed with no public documentation. The terms \u0026ldquo;thinking cache,\u0026rdquo; \u0026ldquo;cache clear,\u0026rdquo; and \u0026ldquo;thinking prun\u0026rdquo; appear nowhere in the 3,285-line CHANGELOG.\n3. \u0026ldquo;≤25 words\u0026rdquo; System Prompt (April 16 – April 20) On the same day Opus 4.7 launched, a system prompt was added:\n\u0026ldquo;Length limits: keep text between tool calls to \u0026lt;=25 words. Keep final responses to \u0026lt;=100 words unless the task requires more detail.\u0026rdquo;\nThis degraded coding quality by 3% across both Opus 4.6 and 4.7 —\nverified by Anthropic\u0026rsquo;s own evaluations.\nIntroduced: April 16 (v2.1.111) Reverted: April 20 (v2.1.116) CHANGELOG entries: Zero. Both introduction and revert undocumented. The Transparency Pattern Bug CHANGELOG Introduction CHANGELOG Fix Postmortem Effort downgrade Present (\u0026ldquo;sweet spot\u0026rdquo;) Present (\u0026ldquo;Changed\u0026rdquo;) \u0026ldquo;Wrong tradeoff\u0026rdquo; Thinking cache Absent Absent Admitted ≤25 words prompt Absent Absent Admitted 2 of 3 admitted bugs have zero CHANGELOG documentation.\nThe one that was documented used promotional language, never acknowledged as a regression.\nThe postmortem was the first public admission that these changes caused degradation.\nSystem prompt changes are structurally invisible —\nthey modify instructions injected before every conversation but are never tracked in the public CHANGELOG.\nUsers cannot diff system prompts between versions.\nPost-Postmortem: 5 Issues That Persist (v2.1.117–119) The postmortem states all three issues were resolved as of April 20.\nBut issues filed April 22–24 demonstrate problems beyond the postmortem\u0026rsquo;s scope:\nSubagent model pin ignored (#52502) Agent frontmatter model: haiku is silently ignored — all work runs on Opus.\nOne user\u0026rsquo;s /usage shows Haiku at $0.0005 vs Opus at $10.87.\nUsers designing cost-optimized multi-agent workflows are unknowingly running everything on the expensive model.\nEffort override bypass (#52534) CLAUDE_CODE_EFFORT_LEVEL env var and settings.json effortLevel are overridden by a UI-level flag (unpinOpus47LaunchEffort).\nThe flag only releases when the user interactively uses /effort — a chicken-and-egg problem.\nAutomated workflows cannot control cost.\nAuto-compact threshold change (#52522) v2.1.117 changed the auto-compact threshold from ~200K to ~1M tokens.\nThe CHANGELOG calls it a \u0026ldquo;Fix\u0026rdquo; (was computing against 200K instead of Opus 4.7\u0026rsquo;s native 1M).\nBut for users operating under the 200K regime, this caused 5x token usage overnight.\nThe behavioral change was documented; its cost impact was not.\nSelf-conversation safety issue (#52228) Model fabricated \u0026ldquo;Human:\u0026rdquo; prompts from archived documents and began self-dialogue with unilateral action on the workstation.\nA safety-relevant failure — the model generated fictitious user input and used it as authorization for actions.\nCLAUDE.md rule violation (#52652) Explicit violation of \u0026ldquo;NEVER execute Git commands\u0026rdquo; rule — unauthorized git stash \u0026amp;\u0026amp; ... \u0026amp;\u0026amp; git stash pop.\nThe model did not verify the result after execution.\nAnthropic response to all five: none as of April 27.\nWhat Remains Unaddressed The postmortem explains three specific harness bugs.\nIt does not address:\nB5 — Tool result truncation (167,770 events in my proxy data) B3 — Client-side false rate limiting (151 synthetic errors) B4 — Silent context removal (5,437 events) #49302 — Cache metering anomaly (7x overcharge reported) #49503 — Model pin bypass Long-context regression — 91.9% → 59.2% at 256K, 78.3% → 32.2% at 1M (Anthropic\u0026rsquo;s own system card) These are documented in 16_OPUS-47-ADVISORY.md with the full evidence trail.\nMy Recommendation v2.1.109 remains the safe version.\nIt predates the Opus 4.7 model pin bypass, the tokenizer change (+35%),\nand all three postmortem bugs were already fixed by v2.1.101/v2.1.116.\nBut v2.1.109 on Opus 4.6 avoids the post-postmortem issues entirely.\nOpus 4.6 is active until at least February 5, 2027 — about 10 months of runway.\nIf you\u0026rsquo;re on a pinned version and stable, there is no urgency to upgrade.\nFull Evidence Postmortem cross-check (36 claims): 17_OPUS-47-POSTMORTEM-ANALYSIS.md Opus 4.7 technical advisory: 16_OPUS-47-ADVISORY.md Bug evidence: 01_BUGS.md Proxy tool: llm-relay — pip install llm-relay Independent analysis. Not affiliated with or endorsed by Anthropic.\n","permalink":"https://arknill.github.io/blog/opus-47-postmortem-what-changelog-didnt-say/","summary":"\u003cp\u003eOn April 23, Anthropic published a \u003ca href=\"https://www.anthropic.com/engineering/april-23-postmortem\"\u003epostmortem\u003c/a\u003e acknowledging three product-layer bugs that degraded Claude Code from March 4 through April 20.\u003cbr\u003e\nNo model weights were changed — all three issues were in the harness/product layer.\u003c/p\u003e\n\u003cp\u003eI cross-checked every claim against the public CHANGELOG (3,285 lines, v2.1.68–v2.1.119),\u003cbr\u003e\n8 GitHub issues via \u003ccode\u003egh issue view\u003c/code\u003e, and 10 external sources.\u003cbr\u003e\n\u003cstrong\u003e36 claims checked — 28 confirmed, 5 partially confirmed, 3 not relied upon.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eHere\u0026rsquo;s what the postmortem says, what the CHANGELOG actually shows,\u003cbr\u003e\nand what still isn\u0026rsquo;t fixed.\u003c/p\u003e","title":"Opus 4.7 Postmortem: What the Changelog Didn't Say"},{"content":"I pay $200/month for Claude Code Max 20.\nOn April 1, my quota hit 100% in 70 minutes during normal coding.\nThat turned out to be two cache bugs — Anthropic fixed them in v2.1.90–91,\nand it made a real difference.\nBut even after the fix, I wanted to understand where the quota actually goes.\nSo I filed issues, dug into community threads,\nand built a transparent proxy to measure every API call.\nAfter the cache bugs were fixed, I pinned v2.1.91 and kept measuring.\nLater I pinned v2.1.109 — the last stable release before Opus 4.7.\nOver 19 days, I logged 42,363 API calls across 298 sessions —\nall on cache-fixed, pinned versions.\nHere\u0026rsquo;s what my data shows.\nWhere the Tokens Go The token volume breakdown across 42,363 proxy-captured requests:\nCategory Tokens Share Cache Read 4,613,123,498 97.0% Cache Creation 91,391,928 1.9% Input 33,337,097 0.7% Output 20,559,648 0.4% Cache read is not waste — it\u0026rsquo;s how the API works.\nEvery request re-reads the conversation context from cache,\nwhich is more efficient than rebuilding it from scratch.\nThe question is whether this volume counts against your quota, and at what rate.\nMore on that below.\nEach 1% of my 5-hour quota window produced 9,000–16,000 tokens of visible output.\nA full 100% window means 0.9M–1.6M tokens of actual code across an entire session.\nMy Original Hypothesis Was Wrong I initially suspected extended thinking tokens were the hidden cost —\nthey don\u0026rsquo;t appear in output_tokens, can\u0026rsquo;t be seen in logs,\ncan\u0026rsquo;t be counted through a proxy.\nIt seemed like the obvious explanation for where the quota was going.\nI was wrong.\nAn independent researcher proved it (see Independent Corroboration below).\nWhat My Proxy Found: 11 Bugs Across 5 Layers Beyond the token breakdown, the proxy and JSONL analysis uncovered 11 confirmed bugs.\nAnthropic fixed the two worst — B1 Sentinel (cache prefix corruption) and B2 Resume (full context replay) — in v2.1.90–91.\nAll numbers below are from data collected after those fixes.\nThe remaining 8 are different in nature —\ncontext management, tool result handling, local logging —\nand have survived 20 releases (v2.1.92–v2.1.112):\nThe ones that hurt most B5 — Tool result truncation. My proxy logged 167,770 truncation events.\nTool results are silently capped at 200K aggregate characters —\nanything older gets cut to 1–41 chars.\nYou\u0026rsquo;re paying for 1M context, but tool results get a 200K budget.\nThis is controlled by a server-side setting, not client code.\nB3 — Fake rate limits. The client generates \u0026ldquo;Rate limit reached\u0026rdquo; errors without making an API call.\nI found 151 synthetic errors across 65 sessions.\nYou\u0026rsquo;re being throttled by your own tool. (#40584)\nB4 — Silent context removal. Old tool results are silently stripped from context.\n5,437 removal events measured.\nThe model loses earlier context without warning. (#42542)\nB10 — Context injection. Deprecated TaskOutput messages inject up to 87K tokens into context,\ntriggering cascading autocompact.\nLast verified on v2.1.109.\nOthers documented B8 (log inflation, 2.37x duplication across 532 files),\nB8a (JSONL corruption from concurrent tools),\nB9 (/branch message duplication, 6%→73% context),\nB11 (zero reasoning tokens — Anthropic acknowledged on HN).\nFull evidence: 01_BUGS.md\nBug status was last verified on v2.1.109–112. Some may have been addressed in newer releases — check the linked issues for current status.\nOpus 4.7: What I Measured vs. What Others Found Opus 4.7 launched April 15.\nI documented the tokenizer and search regressions;\ntwo other researchers independently measured the quota burn from their own accounts:\nWhat I measured:\nTokenizer +35%: the same content consumes more tokens on 4.7 Long-context search regression: accuracy dropped from 91.9% to 59.2% overall, and from 78.3% to 32.2% at 524K–1024K context (source: Opus 4.7 advisory) Model pin bypass: v2.1.111+ ignores settings.json model setting and silently switches to 4.7 (#49503) What others measured independently (their data, their accounts):\nWho Plan / Region Finding @cnighswonger Max 5x, US, 71 calls Sustained burn 2.4x vs 4.6. After Anthropic expanded rate limits post-launch, per-call cost improved ~4.2x. @fgrosswig Max 5x, EU, A/B test 12.5x sustained on simultaneous 4.6/4.7 test. Cold start (first few calls only): up to 50x. Each row is a different person, different account, different region. The range (2.4x–12.5x sustained) reflects real variance across conditions. The 50x is a cold-start outlier, not typical usage.\nUpdate (April 23): Anthropic restored default effort to high for Pro/Max in v2.1.117\nand reverted the \u0026ldquo;≤25 words\u0026rdquo; system prompt in v2.1.116.\nThe model pin bypass (#49503) and long-context search regression remain open as of April 27.\nRecommendation: Stay on v2.1.109 with /model claude-opus-4-6\nuntil the model pin bypass and search regression are resolved.\nDifferent Plans Behave Differently My proxy captured data on two tiers (Max 20x and Max 5x, same account holder):\nTier Haiku calls Opus calls Max 20x 20.77% 78.84% Max 5x 0.11% 83.46% The 190x difference in Haiku calls is largely architectural —\nClaude Code\u0026rsquo;s built-in Explore subagent uses Haiku by default\n(confirmed by cnighswonger\u0026rsquo;s subagent transcript analysis).\nMax 20x usage patterns may trigger more subagent calls.\nThis is a structural difference, not necessarily model substitution.\nSeparately, fgrosswig observed 14% Haiku model substitution on a Pro-tier account (EU),\nwhile cnighswonger saw zero mismatches on Max 5x (US) across 14K+ calls.\nWhether this is tier-dependent, session-length-dependent, or load-dependent\nremains an open question with data from only two accounts.\nAnthropic\u0026rsquo;s Response Update (April 23): Anthropic published a postmortem\nacknowledging three product-layer bugs that degraded Claude Code between March and April 2026:\nEffort downgrade — Default effort level was silently changed from high to medium on March 4 (v2.1.68). Pro and Max users ran at reduced quality for 48 days until restored on April 21 (v2.1.117). Thinking cache pattern clear — A change on March 26 (v2.1.85) broke thinking token caching, fixed April 10 (v2.1.101). Not recorded in CHANGELOG. \u0026quot;≤25 words\u0026quot; system prompt — Added April 16 (v2.1.111), caused 3% coding quality drop, reverted April 20 (v2.1.116). Not recorded in CHANGELOG. The postmortem confirms none of these involved model weight changes —\nthey were configuration and prompt-layer issues.\nAnthropic also reset usage meters and launched @ClaudeDevs for future incident communication.\nWhat remains unaddressed: The specific bugs documented in this investigation (B3–B11) —\ntool result truncation (167,770 events), fake rate limits (151 events),\nsilent context removal (5,437 events) — are not covered by the postmortem.\nThe quota formula question (0x→1x cache_read weight hypothesis) also remains unanswered.\nPrior communication (April 2): Before the postmortem, Lydia Hallie (Anthropic, Product) posted on X:\n\u0026ldquo;We fixed a few bugs along the way, but none were over-charging you.\u0026rdquo;\nThe cache fixes (B1/B2) she referenced were real and helpful.\nThe three bugs acknowledged in the postmortem came after this statement.\nWhat You Can Do Pin v2.1.109 — last version before Opus 4.7 model pin bypass Pin your model: /model claude-opus-4-6 at session start Start fresh sessions — don\u0026rsquo;t use --resume or --continue Rotate sessions — the 200K budget cap silently truncates older tool results One terminal only — multiple terminals don\u0026rsquo;t share cache Self-diagnosis guide: 09_QUICKSTART.md\nIndependent Corroboration After I published the original version of this analysis,\nseveral independent researchers brought their own data.\nTheir findings both confirmed and corrected mine.\nseanGSISG: 178K calls that changed my conclusion @seanGSISG contributed 178,009 API calls from a separate Max 20x account\nspanning December 2025 through April 2026 —\nthe 5 months of \u0026ldquo;before-data\u0026rdquo; my proxy couldn\u0026rsquo;t capture.\n(Full analysis with 6 scripts)\nTheir JSONL logs contain actual content blocks,\nallowing direct measurement of thinking tokens via character-based heuristics.\nThinking tokens account for an estimated 0.0–0.1% of total quota —\nnot the primary cause I originally hypothesized.\nInstead, their data is consistent with a quota formula change:\nmodeling cache_read weight at 0x (old) vs 1x (new)\nproduces a 10–15x multiplier that matches observed behavior.\nUnder the 0x model, zero days exceeded the budget in their entire 5-month dataset.\nUnder the 1x model, 18 days exceeded it.\nOur per-1% measurements converge independently:\nMetric My data (42K calls, Apr) seanGSISG (178K calls, Dec–Apr) CacheRead per 1% 1.5M–2.1M 1.62M–1.72M cache_read share 96–99% 89.8–95.2% Anthropic has not confirmed or denied a quota formula change. The 0x→1x model is seanGSISG\u0026rsquo;s best-fit hypothesis based on observed data — not a confirmed fact. The slight gap in cache_read share reflects different measurement approaches (active-window per-1% vs monthly averages including cold starts).\nOther independent datasets @cnighswonger — 14K+ calls with an in-process interceptor (claude-code-cache-fix). Measured Opus 4.7 burn, confirmed Explore subagent = Haiku by default, compared 4 agents across 21 days. @fgrosswig — 18-day gateway forensics (claude-gateway). Ran simultaneous Opus 4.6/4.7 A/B test on the same account. @wpank — 47,810 requests, $10,700 total spend across version comparisons. @edimuj — 3.5M tokens measuring token waste. Built tokenlean. 19 more contributors discovered bugs, built tools, and verified findings —\ncredited in the full contributor list.\nEach dataset was collected independently — different people, machines, subscription plans (Max 20x / Max 5x / Pro), and regions. Numbers are never aggregated across datasets.\nFull Data Everything is open.\nReproduce it yourself.\nAnalysis repository: github.com/ArkNill/claude-code-hidden-problem-analysis Consolidated dataset: DATASET.md — four independent datasets with methodology notes and cross-comparison constraints Independent corroboration: Issue #3 — seanGSISG\u0026rsquo;s 178K-call analysis Proxy tool: llm-relay — pip install llm-relay My research, supported by independent contributors.\nNot affiliated with or endorsed by Anthropic.\nAll monitoring uses the official ANTHROPIC_BASE_URL proxy mechanism.\n","permalink":"https://arknill.github.io/blog/claude-code-thinking-token-blind-spot/","summary":"\u003cp\u003eI pay $200/month for Claude Code Max 20.\u003cbr\u003e\nOn April 1, my quota hit \u003cstrong\u003e100% in 70 minutes\u003c/strong\u003e during normal coding.\u003cbr\u003e\nThat turned out to be two cache bugs — Anthropic fixed them in v2.1.90–91,\u003cbr\u003e\nand it made a real difference.\u003c/p\u003e\n\u003cp\u003eBut even after the fix, I wanted to understand where the quota actually goes.\u003cbr\u003e\nSo I filed issues, dug into community threads,\u003cbr\u003e\nand built a transparent proxy to measure every API call.\u003c/p\u003e","title":"I Tracked 42,363 Claude Code API Calls. Here's Where Your Quota Actually Goes."},{"content":"","permalink":"https://arknill.github.io/portfolio/","summary":"","title":"Portfolio"}]