<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Testing on ArkNill</title><link>https://arknill.github.io/tags/testing/</link><description>Recent content in Testing on ArkNill</description><generator>Hugo</generator><language>en</language><lastBuildDate>Mon, 27 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://arknill.github.io/tags/testing/index.xml" rel="self" type="application/rss+xml"/><item><title>Testing Claude Code Against Local 35B Models: Building a Cross-Check Harness</title><link>https://arknill.github.io/blog/testing-claude-code-against-local-35b/</link><pubDate>Mon, 27 Apr 2026 00:00:00 +0000</pubDate><guid>https://arknill.github.io/blog/testing-claude-code-against-local-35b/</guid><description>I built three benchmark harnesses to compare Claude Code and Codex against local Qwen 35B models. The harness had more bugs than the models. Here&amp;#39;s the v1→v7 journey — 55 tasks, 290 tests, and what &amp;#39;ALL_FAIL 7→0&amp;#39; taught me about evaluation.</description></item></channel></rss>