Benchmarks on ArkNill

Benchmarks on ArkNillhttps://arknill.github.io/tags/benchmarks/Recent content in Benchmarks on ArkNillHugoenMon, 27 Apr 2026 00:00:00 +0000Testing Claude Code Against Local 35B Models: Building a Cross-Check Harnesshttps://arknill.github.io/blog/testing-claude-code-against-local-35b/Mon, 27 Apr 2026 00:00:00 +0000https://arknill.github.io/blog/testing-claude-code-against-local-35b/I built three benchmark harnesses to compare Claude Code and Codex against local Qwen 35B models. The harness had more bugs than the models. Here's the v1→v7 journey — 55 tasks, 290 tests, and what 'ALL_FAIL 7→0' taught me about evaluation.What tok/s Doesn't Tell You: Measuring LLM Speed That Mattershttps://arknill.github.io/blog/what-tok-s-doesnt-tell-you/Fri, 24 Apr 2026 00:00:00 +0000https://arknill.github.io/blog/what-tok-s-doesnt-tell-you/My 204 tok/s GPU feels slower than a 65 tok/s one for some tasks. tok/s alone is a misleading metric — here's a framework (TTR, Effective tok/s, TCT) that measures what users actually experience.