Mac memory limits make huge local llama.cpp runs harder

On an M2 Max Mac with 96GB of memory, llama.cpp can start pushing the KV cache into swap when a Qwen 3.5 122B q4 model reaches about 91 to 92GB of memory use. The setup uses a very large 150,000 context size, one parallel request, `--mlock`, `-ngl 99`, `-fa on`, and `--cache-ram 6000`. The main point is that llama.cpp cannot fully control how macOS decides to move memory around.

Linux has a direct way to turn swap off, but the same command does not work on macOS. One practical workaround is to use `-ctk q8_0` and `-ctv q8_0` to quantize the KV cache, which can cut KV memory use by roughly half. The tradeoff is possible accuracy loss for precision-heavy work like coding, and slower token generation on very long context.

Another option is changing the Mac shared memory limit with a command such as `sudo sysctl iogpu.wired_limit_mb=92160`, but setting it too high can leave too little room for the operating system, and the setting resets after restart. Lowering the 150,000 context size is the simplest fix if that much context is not truly needed.

Key points

  • A 96GB M2 Max can still hit swap around 91 to 92GB when running a very large llama.cpp model with 150,000 context.
  • macOS gives less direct control over swap behavior than Linux, so llama.cpp flags may not fully prevent it.
  • `-ctk q8_0` and `-ctv q8_0` can quantize the KV cache and greatly reduce memory use.
  • KV cache quantization may hurt accuracy for coding and can slow generation on long context.
  • Reducing context size is the most straightforward way to lower memory use if 150,000 context is not necessary.

Quick term guide

workaround
An alternative way to get something done when the normal way doesn't work.
Long Context
The total amount of text or conversation history an AI can remember and process at once.
operating system
The core software that manages and coordinates all other programs in a computer system.
local model
An AI model you run directly on your own computer, with no internet connection or external service needed.
memory pressure
A condition where the computer is close to running out of working memory, which can slow it down.
quantization
A way to shrink an AI model by reducing the precision of its numbers, trading a little quality for a much smaller file.
context windows
The maximum amount of text an AI can process in a single request.
context window
The amount of text an AI tool can remember and use in one chat.
Read original