Testing

I run Claude Code (Opus 4.6) as my primary coding tool and pay $200/month for it. I also run Qwen 3.5/3.6 35B locally on two DGX Sparks and an RTX 5090. Natural question: how does a local 35B model compare to the commercial tool I’m paying for? To find out, I built three separate benchmark harnesses over 10 days. The journey taught me more about evaluation methodology than about the models themselves — because the harness had more bugs than the models did. ...