Open SourceImportance: High

OSCAR shares 2-bit KV cache tools for cheaper long AI runs

r/LocalLLaMAJun 10, 2026 · 2d ago

A r/LocalLLaMA post shares new OSCAR materials for KV cache quantization. The post links GGUF downloads for Gemma and Qwen models, code for llama.cpp and SGLang, and the OSCAR paper. The paper says its INT2 method cuts KV cache memory by about 8x and can raise throughput under the same memory limit.

Key points

OSCAR targets KV cache size during long-context LLM serving.
The post links GGUF files for Gemma-4-12B-it, Qwen3-32B, and Qwen3-4B-Thinking-2507.
It also links code branches for llama.cpp and SGLang.
The paper says INT2 KV cache can reduce memory use by about 8x.
The paper reports speed gains in some serving settings compared with BF16.

Quick term guide

r/LocalLLaMA: A Reddit community focused on running AI language models on personal hardware.
LocalLLaMA: A Reddit community about AI models that people can often run on their own computers.
quantization: A way to shrink an AI model by reducing the precision of its numbers, trading a little quality for a much smaller file.
llama.cpp: A free, open-source program that lets you run AI language models on a CPU without a GPU.
AI agents: AI agents are AI tools that can carry out steps toward a goal, not just answer once.
workloads: The tasks a computer is expected to handle.
Long Context: The total amount of text or conversation history an AI can remember and process at once.
compressed: Reduced so it takes less data or processing work.

Read original ↗