Open SourceImportance: Medium

MTP doubles generation speed, but saves only ~3% total time at 64k context

r/ollamaJun 10, 2026 · 10h ago

Turning on MTP (multi-token prediction) makes text generation roughly twice as fast. However, at a 64,000-token context length, the overall wait time dropped by only about 3%. The culprit is the prefill stage, which dominates total latency when context is long.

When an AI generates a response, it works in two stages. First, it reads and processes all the text you gave it — this is called the prefill stage. Then it generates the reply word by word (or, with MTP, several words at a time). MTP targets only the second stage, so any speedup there is limited by how much of the total time that stage actually takes.

At a 64,000-token context — the kind you get with long documents or an AI agent that has accumulated a lot of conversation history — the prefill stage takes so long that doubling generation speed barely moves the needle on total wait time. The author measured this directly on an RTX 3090 GPU. The practical takeaway: MTP is most valuable for short-context tasks. For agents or pipelines handling large contexts, reducing the prefill cost (e.g., caching, shorter prompts) matters far more than faster generation.

Key points

MTP doubles the raw token generation speed
At 64k-token context, total response latency drops by only ~3%
The prefill stage (reading all input) is the real bottleneck at long context lengths
AI agents with long conversation histories will see little benefit from MTP
Shorter contexts get proportionally more benefit from MTP

Quick term guide

MTP (multi-token prediction): A technique where the AI predicts several words at once instead of one at a time, speeding up text generation.
context: The information an AI uses to understand your request, such as files, notes, and past messages.
prefill: The stage where the AI reads and processes your entire input before it starts writing a reply.
latency: The total time you wait from sending a request to getting a complete response.
AI agent: An AI program that can inspect information and suggest what to do next.
caching: Saving an AI's response so you can reuse it later without sending the same request again.
prompts: Instructions you give to an AI tool.
AI agents: AI agents are AI tools that can carry out steps toward a goal, not just answer once.

Read original ↗