Open SourceImportance: Medium

A dual-GPU setup question for faster local AI models

r/LocalLLaMAJun 12, 2026 · 4h ago

A user says they are running local AI models with a 12GB 3080 Ti and a 20GB 3080. They report that speed changes a lot depending on whether the model data and KV cache fit inside GPU memory. After changing cache settings so more data stayed in GPU memory, speed rose from about 20t/s to 70t/s. The user asks for advice on using two uneven GPUs together.

Key points

The user is using two different GPUs: one with 12GB memory and one with 20GB memory.
They say performance improves when the model data and KV cache fit in GPU memory.
Changing cache settings reportedly raised speed from about 20t/s to 70t/s.
Changing split mode and main GPU settings did not make much difference for them.
The main question is whether a 17GB model file can need much more memory during inference.

Quick term guide

local AI models: AI programs that run directly on your computer hardware instead of over the internet.
local AI model: An AI model that runs on your own computer or company hardware instead of a cloud service.
AI models: The core brain or underlying program that powers an artificial intelligence tool.
AI agents: AI agents are AI tools that can carry out steps toward a goal, not just answer once.
local models: AI models that run on your own computer or device instead of a company server.
local model: An AI model you run directly on your own computer, with no internet connection or external service needed.
performance: How fast and smoothly a site loads and works.
inference: The step where a trained AI model actually produces answers or results in real use.

Read original ↗