Multi-Token Prediction
Multi-Token Prediction
- Better & Faster Large Language Models via Multi-token Prediction
- DeepSeek-V3 Technical Report
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
- vLLM / MTP (Multi-Token Prediction)
- SGLang / DeepSeek V3 / Multi-token Prediction
- Self-Distillation for Multi-Token Prediction
Multi-Token Prediction (MTP) trains an LLM to predict more than one future token from the same context. Standard next-token prediction optimizes only x[t + 1] at position t. MTP adds auxiliary prediction targets such as x[t + 2], x[t + 3], and x[t + 4].
The main goals are:
- denser supervision during pre-training;
- better representations for future-token planning;
- native proposal tokens for speculative decoding at inference time.
Basic Idea
The Gloeckle et al. ICML 2024 design uses n independent output heads on top of a shared model trunk. At every corpus position, the model predicts the next n tokens. The authors report improved sample efficiency and downstream quality, especially for larger models and code generation, plus up to 3x faster inference for 4-token prediction under their inference setup.
DeepSeek-V3 MTP
DeepSeek-V3 adopts MTP as part of its model architecture and training objective. The report describes MTP as both a training-quality technique and an inference-acceleration technique for speculative decoding.
DeepSeek-V3 differs from the independent-head design:
- it predicts additional tokens sequentially rather than only through parallel independent heads;
- it keeps the causal chain for each prediction depth;
- it uses MTP modules that can act as native draft-token producers during decoding.
This means DeepSeek-V3 MTP is closer to a model-integrated speculative proposer than a generic post-hoc serving trick. The MTP weights are part of the model artifact family, and serving frameworks need explicit support for them.
Inference Path
At inference time, MTP is usually consumed through speculative decoding:
The speedup depends on the acceptance rate. If the proposed future tokens match what the target model would have generated, multiple tokens can be accepted from one verification pass. If predictions are rejected often, the extra proposer work can reduce or erase the benefit.
MTP and speculative decoding can be understood as moving some expensive target-model decoding work into a prefill-like verification pass. Normal decoding asks the target model to emit one token, append it, and run again. Speculative decoding first lets a cheap proposer guess a block of future tokens, then asks the target model to process the prompt plus that drafted block in one pass. The causal mask still preserves autoregressive correctness, but the target gets logits for several draft positions at once. This is why accepted draft tokens can reduce the number of sequential target decode steps.
Verification Mechanism
MTP does not make the draft tokens authoritative. The target model remains the verifier. For a prompt such as:
The capital of France is
an MTP module or draft model may propose:
Paris. The city
The target model receives the prompt plus the drafted tokens and returns logits for each position:
input ids -> embeddings -> transformer -> hidden states -> LM head -> logits
The raw target output is a tensor similar to:
[batch_size, sequence_length, vocabulary_size]
Each position has a vocabulary-sized logit vector. The serving code compares the draft token ids with the target model's distributions at the matching positions.
The key tensor flow for a normal next-token step is:
token_ids: [1, 5]
embeddings: [1, 5, 4096]
hidden states: [1, 5, 4096]
last hidden: [4096]
logits: [50257]
next token id: scalar
For speculative verification with a draft:
draft = " Paris. The city"
the target input becomes:
"The capital of France is Paris. The city"
Assume the combined input has nine tokens:
input_ids.shape
# [1, 9]
After the transformer:
hidden.shape
# [1, 9, 4096]
After the LM head:
all_logits.shape
# [1, 9, 50257]
The verifier checks several positions:
all_logits[0, 4] # predicts token after "is"
all_logits[0, 5] # predicts token after "Paris"
all_logits[0, 6] # predicts token after "."
all_logits[0, 7] # predicts token after "The"
A simplified greedy result might be:
| Position/context | Target top token |
|---|---|
after The capital of France is | Paris |
after ... is Paris | . |
after ... is Paris. | It |
after ... is Paris. The | capital |
Compared with the draft:
| Source | Token 1 | Token 2 | Token 3 | Token 4 |
|---|---|---|---|---|
| draft token | Paris | . | The | city |
| target top | Paris | . | It | capital |
| accept | yes | yes | no | no |
That is how one target forward pass returns enough tensors to verify multiple draft tokens.
| Target context | Draft token | Target signal | Greedy decision |
|---|---|---|---|
The capital of France is | Paris | top token is Paris | accept |
The capital of France is Paris | . | top token is . | accept |
The capital of France is Paris. | The | top token might be It | reject |
The capital of France is Paris. The | city | ignored after the previous reject | reject suffix |
After the first rejection, the remaining drafted suffix is discarded because it was conditioned on tokens that are no longer part of the accepted sequence. Generation then continues from the accepted prefix.
Causal Mask During Verification
The target can verify several drafted tokens in one forward pass because causal attention prevents future-token leakage. In a sequence such as:
The capital of France is Paris. The city
the logit vector after is can see only the prompt up through is, not Paris. The city. The logit vector after Paris can see the prompt plus Paris, but not later tokens. This lower-triangular attention rule lets a single target pass produce valid next-token distributions for many positions:
after "is" -> logits for the token after "is"
after "Paris" -> logits for the token after "Paris"
after "." -> logits for the token after "."
after "The" -> logits for the token after "The"
The verifier is therefore not asking whether the whole draft is factually reasonable. It is checking whether the draft prefix is compatible with what the target model would have produced under the current decoding algorithm.
Why It Can Be Faster
Normal autoregressive decoding requires one target-model step per emitted token:
target run -> token 1
target run -> token 2
target run -> token 3
target run -> token 4
Speculative decoding with MTP changes the expensive part:
cheap proposer -> draft tokens 1..k
target run -> verify tokens 1..k in one pass
If the target accepts four drafted tokens, the system has emitted four tokens with one expensive verification pass instead of four sequential target decode passes. The verification pass is not free because it processes the draft block, but it can be cheaper than repeated target calls when enough drafted tokens are accepted. The realized speedup is approximately controlled by:
accepted tokens per target verification
- proposer cost
- verification overhead
Low acceptance rate removes the benefit because the system pays for draft generation but still emits only a small prefix.
Greedy And Sampling Acceptance
For greedy decoding, acceptance can be understood as top-1 matching:
accept draft token if draft_token == argmax(target_logits)
For sampling, top-1 matching would be wrong because target sampling can legitimately choose non-top-1 tokens. Sampling-based speculative decoding compares the draft distribution q with the target distribution p. A drafted token x is accepted with probability:
min(1, p(x) / q(x))
If the target model assigns at least as much probability to x as the draft did, the token is accepted. If the draft overestimated x, the token is accepted only sometimes. On rejection, the serving algorithm samples from a corrected residual distribution so the final output distribution matches the target model's intended sampling behavior.
Relation to Speculative Decoding
| Technique | Draft source | Model changes | Typical tradeoff |
|---|---|---|---|
| Draft-model speculative decoding | Separate smaller model | No target-model architecture change | Operationally flexible, but requires serving and maintaining a draft model |
| Medusa-style heads | Extra decoding heads | Fine-tuned heads, optionally joint training | Avoids a separate draft model, but head quality controls acceptance rate |
| Native MTP | MTP heads or modules trained with the model | Requires model-family support | Minimal serving configuration when supported, but not portable to arbitrary models |
MTP is best understood as a native way to produce speculative tokens. It does not remove the need for verification when exact sampling behavior matters.
Serving Support
As of April 30, 2026, serving support is model-family-specific.
vLLM documents MTP as a speculative decoding method for models with native MTP support. It uses speculative_config with "method": "mtp" and a num_speculative_tokens value. The vLLM documentation recommends a small value such as 1 as a starting point and says to use another speculative method when the model family does not support MTP.
vllm serve XiaomiMiMo/MiMo-7B-Base \
--tensor-parallel-size 1 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
SGLang documents DeepSeek V3 MTP through EAGLE speculative decoding. Its DeepSeek V3 guide reports speedups of 1.8x for batch size 1 and 1.5x for batch size 32 on an H200 TP8 setting.
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--speculative-algorithm EAGLE \
--trust-remote-code \
--tp 8
Practical Limits
- MTP is not a universal runtime flag. The model must expose compatible MTP weights or modules.
- Throughput gains depend on acceptance rate, batch size, hardware, attention backend, and decoding parameters.
- More speculative tokens are not always better because deeper predictions are usually harder to accept.
- Fine-tuning or quantization can drop, damage, or make MTP weights unsupported unless the tooling preserves them.
- Exact output distribution still depends on the verifier and sampling algorithm, not just the MTP proposer.
Research Direction
Recent work after the original MTP paper focuses on improving MTP head acceptance rates and reducing training cost. For example, MTP-D proposes self-distillation to improve acceptance while preserving main-head quality, and a looped extension strategy to add more MTP depth economically.
The central open question is not whether multiple-token proposals are useful. The practical question is how to keep the proposal distribution close enough to the target model that the verification step accepts enough tokens to pay for the extra compute.