<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Benchmarks on ArkNill</title><link>https://arknill.github.io/tags/benchmarks/</link><description>Recent content in Benchmarks on ArkNill</description><generator>Hugo</generator><language>en</language><lastBuildDate>Mon, 27 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://arknill.github.io/tags/benchmarks/index.xml" rel="self" type="application/rss+xml"/><item><title>Testing Claude Code Against Local 35B Models: Building a Cross-Check Harness</title><link>https://arknill.github.io/blog/testing-claude-code-against-local-35b/</link><pubDate>Mon, 27 Apr 2026 00:00:00 +0000</pubDate><guid>https://arknill.github.io/blog/testing-claude-code-against-local-35b/</guid><description>I built three benchmark harnesses to compare Claude Code and Codex against local Qwen 35B models. The harness had more bugs than the models. Here&amp;#39;s the v1→v7 journey — 55 tasks, 290 tests, and what &amp;#39;ALL_FAIL 7→0&amp;#39; taught me about evaluation.</description></item><item><title>What tok/s Doesn't Tell You: Measuring LLM Speed That Matters</title><link>https://arknill.github.io/blog/what-tok-s-doesnt-tell-you/</link><pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate><guid>https://arknill.github.io/blog/what-tok-s-doesnt-tell-you/</guid><description>My 204 tok/s GPU feels slower than a 65 tok/s one for some tasks. tok/s alone is a misleading metric — here&amp;#39;s a framework (TTR, Effective tok/s, TCT) that measures what users actually experience.</description></item></channel></rss>