Local-Llm

Testing Claude Code Against Local 35B Models: Building a Cross-Check Harness

I run Claude Code (Opus 4.6) as my primary coding tool and pay $200/month for it. I also run Qwen 3.5/3.6 35B locally on two DGX Sparks and an RTX 5090. Natural question: how does a local 35B model compare to the commercial tool I’m paying for? To find out, I built three separate benchmark harnesses over 10 days. The journey taught me more about evaluation methodology than about the models themselves — because the harness had more bugs than the models did. ...

I Built a 3-Node Home LLM Lab. Here's What It Actually Takes.

I run a 3-node local LLM inference cluster at home. Two NVIDIA DGX Sparks (128GB unified memory each) and one RTX 5090 desktop (32GB VRAM). All three serve Qwen 3.5/3.6 35B MoE models 24/7 over my local network. This isn’t a weekend experiment — it’s my daily development infrastructure. Every code review, every research query, every benchmark runs against these nodes. Here’s what the setup looks like, what it costs, and what I learned that no spec sheet tells you. ...

Quantization, Determinism, and Thinking Tokens: Running Open-Source LLMs in Production

I run Qwen 3.5 and 3.6 (35B MoE, 3B active parameters) in production across three nodes — two DGX Spark (FP8, vLLM) and one RTX 5090 (Q4, llama.cpp). After 100+ benchmark scenarios and thousands of inference calls, three problems dominated my debugging time: Quantization loss is not uniform — MoE models at Q4 lose 16% on CJK tasks vLLM is non-deterministic under speculative decoding — identical prompts produce different outputs Thinking tokens consume 60–90% of the budget on tasks where they provide zero benefit None of these show up in standard benchmarks. All of them break production workflows. ...

What tok/s Doesn't Tell You: Measuring LLM Speed That Matters

I run Qwen 3.6 35B on three machines. The RTX 5090 generates at 204 tok/s. The DGX Spark pair generates at 65 tok/s. By every benchmark leaderboard metric, the 5090 is 3x faster. But for multi-step coding tasks with thinking enabled, the DGX pair completes the job faster. And for single-turn questions, the 5090 delivers the answer in under 2 seconds while the DGX takes 8–12 seconds. tok/s alone told me nothing useful about actual user experience. Here’s what I learned building benchmarks for all three nodes. ...