MTP doubles generation speed, but saves only ~3% total time at 64k context

Turning on MTP (multi-token prediction) makes text generation roughly twice as fast. However, at a 64,000-token context length, the overall wait time dropped by only about 3%. The culprit is the prefill stage, which dominates total latency when context is long.

When an AI generates a response, it works in two stages. First, it reads and processes all the text you gave it — this is called the prefill stage. Then it generates the reply word by word (or, with MTP, several words at a time). MTP targets only the second stage, so any speedup there is limited by how much of the total time that stage actually takes.

At a 64,000-token context — the kind you get with long documents or an AI agent that has accumulated a lot of conversation history — the prefill stage takes so long that doubling generation speed barely moves the needle on total wait time. The author measured this directly on an RTX 3090 GPU. The practical takeaway: MTP is most valuable for short-context tasks. For agents or pipelines handling large contexts, reducing the prefill cost (e.g., caching, shorter prompts) matters far more than faster generation.

Key points

  • MTP doubles the raw token generation speed
  • At 64k-token context, total response latency drops by only ~3%
  • The prefill stage (reading all input) is the real bottleneck at long context lengths
  • AI agents with long conversation histories will see little benefit from MTP
  • Shorter contexts get proportionally more benefit from MTP

Quick term guide

MTP (multi-token prediction)
A technique where the AI predicts several words at once instead of one at a time, speeding up text generation.
context
The information an AI uses to understand your request, such as files, notes, and past messages.
prefill
The stage where the AI reads and processes your entire input before it starts writing a reply.
latency
The total time you wait from sending a request to getting a complete response.
AI agent
An AI program that can inspect information and suggest what to do next.
caching
Saving an AI's response so you can reuse it later without sending the same request again.
prompts
Instructions you give to an AI tool.
AI agents
AI agents are AI tools that can carry out steps toward a goal, not just answer once.
Read original