A dual-GPU setup question for faster local AI models
A user says they are running local AI models with a 12GB 3080 Ti and a 20GB 3080. They report that speed changes a lot depending on whether the model data and KV cache fit inside GPU memory. After changing cache settings so more data stayed in GPU memory, speed rose from about 20t/s to 70t/s. The user asks for advice on using two uneven GPUs together.
Key points
- The user is using two different GPUs: one with 12GB memory and one with 20GB memory.
- They say performance improves when the model data and KV cache fit in GPU memory.
- Changing cache settings reportedly raised speed from about 20t/s to 70t/s.
- Changing split mode and main GPU settings did not make much difference for them.
- The main question is whether a 17GB model file can need much more memory during inference.
Quick term guide
- local AI models
- AI programs that run directly on your computer hardware instead of over the internet.
- local AI model
- An AI model that runs on your own computer or company hardware instead of a cloud service.
- AI models
- The core brain or underlying program that powers an artificial intelligence tool.
- AI agents
- AI agents are AI tools that can carry out steps toward a goal, not just answer once.
- local models
- AI models that run on your own computer or device instead of a company server.
- local model
- An AI model you run directly on your own computer, with no internet connection or external service needed.
- performance
- How fast and smoothly a site loads and works.
- inference
- The step where a trained AI model actually produces answers or results in real use.