Testing on ArkNill

Testing on ArkNillhttps://arknill.github.io/tags/testing/Recent content in Testing on ArkNillHugoenMon, 27 Apr 2026 00:00:00 +0000Testing Claude Code Against Local 35B Models: Building a Cross-Check Harnesshttps://arknill.github.io/blog/testing-claude-code-against-local-35b/Mon, 27 Apr 2026 00:00:00 +0000https://arknill.github.io/blog/testing-claude-code-against-local-35b/I built three benchmark harnesses to compare Claude Code and Codex against local Qwen 35B models. The harness had more bugs than the models. Here's the v1→v7 journey — 55 tasks, 290 tests, and what 'ALL_FAIL 7→0' taught me about evaluation.