A dual-GPU setup question for faster local AI models

A user says they are running local AI models with a 12GB 3080 Ti and a 20GB 3080. They report that speed changes a lot depending on whether the model data and KV cache fit inside GPU memory. After changing cache settings so more data stayed in GPU memory, speed rose from about 20t/s to 70t/s. The user asks for advice on using two uneven GPUs together.

Key points

  • The user is using two different GPUs: one with 12GB memory and one with 20GB memory.
  • They say performance improves when the model data and KV cache fit in GPU memory.
  • Changing cache settings reportedly raised speed from about 20t/s to 70t/s.
  • Changing split mode and main GPU settings did not make much difference for them.
  • The main question is whether a 17GB model file can need much more memory during inference.

Quick term guide

local AI models
AI programs that run directly on your computer hardware instead of over the internet.
local AI model
An AI model that runs on your own computer or company hardware instead of a cloud service.
AI models
The core brain or underlying program that powers an artificial intelligence tool.
AI agents
AI agents are AI tools that can carry out steps toward a goal, not just answer once.
local models
AI models that run on your own computer or device instead of a company server.
local model
An AI model you run directly on your own computer, with no internet connection or external service needed.
performance
How fast and smoothly a site loads and works.
inference
The step where a trained AI model actually produces answers or results in real use.
Read original