Luce Spark runs large MoE models with less GPU memory
Luce Spark is an tool that aims to run 33B to 35B within the memory range of a 16 GB GPU. Qwen3.6 35B-A3B dropped from about 20.5 GiB of needed to 13.3 GiB, and Laguna XS.2 33B-A3B dropped from 18.8 GiB to 14.6 GiB. The method keeps only the expert parts that are used often on the GPU, while less-used parts stay in and are brought in when needed.
Spark watches real requests, learns which expert parts are used most, and saves that layout for future runs. It does not require a separate training set or a separate calibration step. In the shared test, all-GPU running reached 119 , while Spark with learned placement, caching, and a fused execution path reached about 100 .
The results were measured on an RTX 3090, not yet on an actual 16 GB card. There is also no direct same-settings comparison yet against llama.cpp CPU offload.
Key points
- Spark reduced needs for two 33B to 35B to 13.3 to 14.6 GiB.
- It keeps frequently used expert parts on the GPU and moves less-used parts from when needed.
- It learns from live requests, so no separate calibration data is required.
- In the reported RTX 3090 test, Spark reached about 100 versus 119 for all-GPU running.
- Actual 16 GB GPU testing and a direct llama.cpp CPU offload comparison are still missing.