Blogs

Testing Claude Code Against Local 35B Models: Building a Cross-Check Harness

I run Claude Code (Opus 4.6) as my primary coding tool and pay $200/month for it. I also run Qwen 3.5/3.6 35B locally on two DGX Sparks and an RTX 5090. Natural question: how does a local 35B model compare to the commercial tool I’m paying for? To find out, I built three separate benchmark harnesses over 10 days. The journey taught me more about evaluation methodology than about the models themselves — because the harness had more bugs than the models did. ...

I Built a 3-Node Home LLM Lab. Here's What It Actually Takes.

I run a 3-node local LLM inference cluster at home. Two NVIDIA DGX Sparks (128GB unified memory each) and one RTX 5090 desktop (32GB VRAM). All three serve Qwen 3.5/3.6 35B MoE models 24/7 over my local network. This isn’t a weekend experiment — it’s my daily development infrastructure. Every code review, every research query, every benchmark runs against these nodes. Here’s what the setup looks like, what it costs, and what I learned that no spec sheet tells you. ...

Quantization, Determinism, and Thinking Tokens: Running Open-Source LLMs in Production

I run Qwen 3.5 and 3.6 (35B MoE, 3B active parameters) in production across three nodes — two DGX Spark (FP8, vLLM) and one RTX 5090 (Q4, llama.cpp). After 100+ benchmark scenarios and thousands of inference calls, three problems dominated my debugging time: Quantization loss is not uniform — MoE models at Q4 lose 16% on CJK tasks vLLM is non-deterministic under speculative decoding — identical prompts produce different outputs Thinking tokens consume 60–90% of the budget on tasks where they provide zero benefit None of these show up in standard benchmarks. All of them break production workflows. ...

What tok/s Doesn't Tell You: Measuring LLM Speed That Matters

I run Qwen 3.6 35B on three machines. The RTX 5090 generates at 204 tok/s. The DGX Spark pair generates at 65 tok/s. By every benchmark leaderboard metric, the 5090 is 3x faster. But for multi-step coding tasks with thinking enabled, the DGX pair completes the job faster. And for single-turn questions, the 5090 delivers the answer in under 2 seconds while the DGX takes 8–12 seconds. tok/s alone told me nothing useful about actual user experience. Here’s what I learned building benchmarks for all three nodes. ...

Anthropic's Postmortem Told Half the Truth

On April 23, Anthropic published a postmortem acknowledging three product-layer bugs that degraded Claude Code from March 4 through April 20. They frame it as: model weights unchanged, harness bugs fixed, problem solved. The three bugs are real. Their impact was real. The fixes were real. But the postmortem is a carefully scoped document that tells half the truth. Here’s the other half. What They Admitted (Correctly) Bug Introduced Fixed Duration Effort high → medium March 4 (v2.1.68) April 21 (v2.1.117) 48 days Thinking cache clear every turn March 26 (v2.1.85) April 10 (v2.1.101) 15 days “≤25 words” system prompt April 16 (v2.1.111) April 20 (v2.1.116) 4 days All three are product-layer issues — API parameters and system prompts, not model weights. The combined effect: effort reduced thinking depth, thinking cache destroyed session memory, word limit truncated output. Together, these made Claude appear significantly dumber. ...

Opus 4.7 Postmortem: What the Changelog Didn't Say

On April 23, Anthropic published a postmortem acknowledging three product-layer bugs that degraded Claude Code from March 4 through April 20. No model weights were changed — all three issues were in the harness/product layer. I cross-checked every claim against the public CHANGELOG (3,285 lines, v2.1.68–v2.1.119), 8 GitHub issues via gh issue view, and 10 external sources. 36 claims checked — 28 confirmed, 5 partially confirmed, 3 not relied upon. Here’s what the postmortem says, what the CHANGELOG actually shows, and what still isn’t fixed. ...

I Tracked 42,363 Claude Code API Calls. Here's Where Your Quota Actually Goes.

I pay $200/month for Claude Code Max 20. On April 1, my quota hit 100% in 70 minutes during normal coding. That turned out to be two cache bugs — Anthropic fixed them in v2.1.90–91, and it made a real difference. But even after the fix, I wanted to understand where the quota actually goes. So I filed issues, dug into community threads, and built a transparent proxy to measure every API call. ...