Open SourceImportance: High

Qwen local agent tests show long context hurts tool calling

r/LocalLLaMAJun 9, 2026 · 4d ago

Several smaller GGUF versions of Qwen3.6-35B-A3B were compared on , which means an AI agent correctly choosing and using outside tools. The test covered three ByteShape versions and five Unsloth versions, ranging from 13.2GB to 29.3GB. There were 144 runs in total.

The short-context setup used about 5,000 tokens, while the long-context setup added about 122,000 tokens to mimic a busy agent with lots of conversation and tool history. ByteShape GPU-5 had the best average score, and ByteShape CPU-5 had the worst. There was no clear overall winner between ByteShape and Unsloth.

Reducing the KV cache to q8_0 scored almost the same as f16, while q4_0 was slightly worse but not dramatically worse. caused a large drop in across all setups, with an average gap of almost 10 points.

Key points

The compared three ByteShape GGUF models and five Unsloth GGUF models for Qwen3.6-35B-A3B.
ByteShape GPU-5 had the best average result, but ByteShape CPU-5 had the worst result.
q8_0 KV cache performed almost the same as f16, making it a useful memory-saving option.
q4_0 KV cache was slightly weaker, but the loss was small.
reduced across all tested setups.

Read original ↗