Run Large AI Models Faster on Older Hardware with New Techniques

New AI models like Qwen3.6-MTP can produce text much faster even on older graphics cards. This makes building high-quality AI agents more affordable and responsive for everyone.

The Qwen3.6-MTP-27B model uses a technique called Multi-Token Prediction to guess several words at once. Users are reporting speeds of 55 tokens per second on older Tesla V100 hardware using the llama.cpp software. This is significant because it allows a medium-sized, powerful AI to run at speeds previously reserved for much smaller models. While some are comparing it to other specialized versions like qwopus, the focus remains on squeezing more performance out of existing chips. For those building AI agents, this means you can get smarter answers without needing the most expensive, latest hardware.

Key points

  • Multi-Token Prediction allows models to work faster by guessing several words at a time.
  • Older hardware like the Tesla V100 can still run modern, powerful 27B models efficiently.
  • Higher processing speeds reduce the waiting time for AI-generated responses.
  • llama.cpp remains a vital tool for running AI locally on various types of equipment.

Quick term guide

AI models
The core brain or underlying program that powers an artificial intelligence tool.
AI agents
AI agents are AI tools that can carry out steps toward a goal, not just answer once.
AI agent
An AI program that can inspect information and suggest what to do next.
Multi-Token Prediction
A method where an AI predicts several upcoming words at the same time to speed up its work.
tokens per second
A measurement of how many pieces of text an AI can generate in one second.
llama.cpp
A free, open-source program that lets you run AI language models on a CPU without a GPU.
software
Programs or apps that run on a computer or smartphone.
responses
An OpenAI API feature for creating and handling model answers.

Sources covering this story (2)

Read original