<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Local-Llm on ArkNill</title><link>https://arknill.github.io/tags/local-llm/</link><description>Recent content in Local-Llm on ArkNill</description><generator>Hugo</generator><language>en</language><lastBuildDate>Mon, 27 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://arknill.github.io/tags/local-llm/index.xml" rel="self" type="application/rss+xml"/><item><title>Testing Claude Code Against Local 35B Models: Building a Cross-Check Harness</title><link>https://arknill.github.io/blog/testing-claude-code-against-local-35b/</link><pubDate>Mon, 27 Apr 2026 00:00:00 +0000</pubDate><guid>https://arknill.github.io/blog/testing-claude-code-against-local-35b/</guid><description>I built three benchmark harnesses to compare Claude Code and Codex against local Qwen 35B models. The harness had more bugs than the models. Here&amp;#39;s the v1→v7 journey — 55 tasks, 290 tests, and what &amp;#39;ALL_FAIL 7→0&amp;#39; taught me about evaluation.</description></item><item><title>I Built a 3-Node Home LLM Lab. Here's What It Actually Takes.</title><link>https://arknill.github.io/blog/3-node-home-llm-lab/</link><pubDate>Sun, 26 Apr 2026 00:00:00 +0000</pubDate><guid>https://arknill.github.io/blog/3-node-home-llm-lab/</guid><description>Two DGX Sparks (128GB each) and an RTX 5090 desktop — running Qwen 3.5/3.6 35B in production. Hardware choices, real costs, operational lessons, and why three nodes beat one big one.</description></item><item><title>Quantization, Determinism, and Thinking Tokens: Running Open-Source LLMs in Production</title><link>https://arknill.github.io/blog/quantization-determinism-thinking-production/</link><pubDate>Sat, 25 Apr 2026 00:00:00 +0000</pubDate><guid>https://arknill.github.io/blog/quantization-determinism-thinking-production/</guid><description>FP8 is the production floor. Q4 MoE loses 16% on CJK. vLLM is non-deterministic under MTP. Thinking tokens eat 90% of your budget on the wrong tasks. Hard lessons from operating Qwen 3.5/3.6 35B across 3 nodes.</description></item><item><title>What tok/s Doesn't Tell You: Measuring LLM Speed That Matters</title><link>https://arknill.github.io/blog/what-tok-s-doesnt-tell-you/</link><pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate><guid>https://arknill.github.io/blog/what-tok-s-doesnt-tell-you/</guid><description>My 204 tok/s GPU feels slower than a 65 tok/s one for some tasks. tok/s alone is a misleading metric — here&amp;#39;s a framework (TTR, Effective tok/s, TCT) that measures what users actually experience.</description></item></channel></rss>