LK losses: Training speculative decoding draft models to directly maximize acceptance rate
We introduce LK losses — training objectives that directly optimize the acceptance rate for speculative decoding draft models. They are a drop-in replacement for KL divergence, with no computational overhead, and they work with any draft architecture and any target model size, to deliver consistent improvements in inference throughput across models ranging from 8B to 685B parameters. We are open-sourcing our trained draft models (LK-Speculators) and training datasets (Infinity-Instruct-Completions). An implementation of LK losses is also available as a pull request to SpecForge.
April 10, 2026
16 mins to read
We introduce LK losses — training objectives that directly optimize the acceptance rate for speculative decoding draft models. They are a drop-in replacement for KL divergence, with no computational overhead, and they work with any draft architecture and any target model size, to deliver consistent improvements in inference throughput across models ranging from 8B to 685B parameters. We are open-sourcing our trained draft models — LK-Speculators and training datasets — Infinity-Instruct-Completions. An implementation of LK losses is also available as a pull request to SpecForge.
Speculative decoding has become one of the most widely used techniques for accelerating LLM inference in production. The approach separates token generation into two stages: a small, fast draft model proposes a batch of candidate tokens; and the large target model verifies all of them in a single forward pass. The target model evaluates all drafted tokens and accepts a prefix of them by using a rejection sampling procedure that preserves the target’s output distribution exactly. This method allows obtaining multiple tokens in one pass, substantially improving throughput without changing model quality.
One of the key performance metrics in speculative decoding is the acceptance rate α — the probability that a given draft token is accepted by the target model. Everything else equal, higher acceptance rate means more tokens per forward pass and better end-to-end speedup. Drafter latency and batch size also matter for real deployment efficiency, but acceptance rate is the quantity that draft model training most directly controls.
Against this background, there is a notable mismatch in standard practice: most of draft models are trained by minimizing KL divergence from the target distribution, an objective that does not directly maximize acceptance rate.
The justification for KL-based training is straightforward: the global optimum of KL divergence is where the draft distribution perfectly matches the target. This is also the global optimum of acceptance rate α. The two objectives agree when a perfect solution is achievable.
Draft models, however, are small by design. In practice, they have roughly 1–5% of the target model’s parameter count and cannot perfectly replicate its output distribution. At these suboptimal solutions, minimizing KL offers no formal guarantee of maximizing the acceptance rate.
The connection between acceptance rate and Total Variation (TV) distance makes this concrete. Acceptance rate is defined as
α=x∈Y∑min(p(x),q(x))
where q is the draft and p is the target distribution, respectively. This quantity is the total probability mass that the two distributions share. It is directly related to TV distance:
α=1−TV(p,q)
This identity, established in the original speculative decoding paper by Leviathan et al. (2023), means that maximizing acceptance rate is exactly equivalent to minimizing TV distance. KL divergence is an indirect surrogate that happens to share the global optimum, but leads to different suboptimal solutions when model capacity is constrained.
This relationship also motivates the name. In the original speculative decoding paper, TV distance appears as DLK — a notation that reverses “KL.” Our losses are designed as an alternative to the standard KL optimization approach, replacing a proxy objective with the quantity we actually care about.
To build intuition for why the choice of objective matters at suboptimal solutions, consider a simple experiment: fitting a single Gaussian to a three-mode Gaussian mixture. This is an analog of what draft models face: a low-capacity approximation to a multi-modal target.
Motivating figure: KL, Reverse KL, and TV distance fitting a single Gaussian to a Gaussian mixture. Green areas show density overlap (= acceptance rate in the speculative sampling sense).
The results under three different objectives:
Forward KL divergence is mode-covering: it spreads probability mass broadly to avoid placing zero mass anywhere the target has support. The result is a wide distribution centered between the modes, with an acceptance rate of 50.2%;
Reverse KL divergence is mode-seeking: it collapses to cover the single mode, yielding an acceptance rate of 50.8%;
TV distance places the distribution to maximize the area of overlap, achieving a substantially higher acceptance rate of 60.2%.
The parametric family and parameter count are identical across all three. The large difference is entirely explained by the objective. When capacity is limited and the draft model cannot perfectly match the target, the different objectives converge to different solutions and KL’s solution does not necessarily correspond to an optimal acceptance rate.
If TV distance is the right objective, the natural question is why is it not already in use. The obstacle is gradient behavior when we train draft models from scratch.
The gradient of TV distance with respect to the draft logits zq is:
∇zqTV(p,q)=21q⊙(s−Eq[s])
Where si=sign(qi−pi). Several problems make this impractical for training from random weights.
Sign-only gradient direction. The TV gradient encodes only the sign of the prediction error for each token, not its magnitude. A token with probability q slightly below p receives the same gradient as one severely under-predicted. In contrast, the KL gradient q−p scales linearly with the gap, providing natural prioritization.
Small gradients at initialization. A randomly initialized draft model spreads q approximately uniformly over the vocabulary. For a vocabulary of size V with the target p concentrating on k≪V tokens, the gradient norm satisfies:
∥∇zLTV∥=O(Vk)
With k≈100 and V=128,000, the norm has the order of 10−5, essentially no useful signal at the start of training. In comparison, KL gradient magnitude is:
∥∇KL(p∥q)∥=O(1/k)
In order of 10−1 in this case.
Non-smooth loss landscape. The TV loss contains non-differentiable points along the manifold {zq:qi=pi}, where gradients change discontinuously.
These properties make pure TV optimization unstable when starting from random weights. Our experiments confirm this: draft models trained with TV loss from scratch fall far short of KL baselines.
Since α probability, its negative logarithm is a natural training objective:
LLKα(p,q)=−logα=−logx∈V∑min(p(x),q(x))
The gradient of this loss is:
∇zqLLKα=α1∇zqTV(p,q)
The 1/α factor is the key mechanism. When alignment is poor (low α, early training), it amplifies the otherwise negligibly small TV gradient to a magnitude matching KL’s O(1/k) regime. The gradient direction follows TV throughout, so the objective directly targets the acceptance rate. Note, that in the degenerate case where p is a point mass (for example, if we train with a setup with temperature = 0), LLKα reduces to standard cross-entropy loss −log(q(x)).
The second variant explicitly interpolates between KL and TV:
LLKλ(p,q)=λ⋅KL(p∥q)+(1−λ)⋅TV(p,q)
with an adaptive weight driven by the current acceptance rate:
λ=exp(−η⋅sg[α]),η>0
Where sg[⋅] denotes stop-gradient (the weight λ is computed independently for each draft decoding position). This schedule satisfies: λ→1 when α→0 (early training, poor alignment) and λ→0 when λ→1 (late training, perfect alignment). The KL component dominates at the start, providing stable gradients that bring the draft distribution close enough to the target for TV gradients to become effective. As alignment improves, the balance shifts toward direct acceptance rate optimization.
Modern draft models often use a truncated vocabulary, restricting the draft’s LM head to the most frequent tokens, following FR-Spec (Zhao et al., 2025). This reduces the cost of the LM head substantially, but introduces a complication for KL-based training.
When qi=0 (tokens outside the draft vocabulary), but pi>0, KL divergence is infinite. The standard workaround is to renormalize the target: p~=softmax(m⊙zp), masking out-of-vocabulary logits. The training objective becomes KL(p~∥q) — a proxy for the true KL(p∥q), which is itself already a proxy for acceptance rate.
LK losses handle this naturally. Tokens outside the draft vocabulary contribute min(pi,0)=0 to the acceptance sum and simply drop out. No renormalization of p is required, and LK losses optimize acceptance rate with respect to the original target distribution regardless of vocabulary truncation.
We evaluate LK losses across a broad range of configurations.
Target models: Six models across three orders of magnitude — Llama-3.1-8B-Instruct, Llama-3.3-70B-Instruct, gpt-oss-20b, gpt-oss-120b, Qwen3-235B-A22B-Instruct and DeepSeek-V3-0324 (685B).
Draft architectures: EAGLE-3, MLP Speculator, MEDUSA and DeepSeek’s native MTP module. For Llama-3.1-8B we train all four; for larger models we use EAGLE-3 and, for DeepSeek-V3, MTP fine-tuning.
Training: 660K instruction-response pairs generated from each target model by using prompts from Infinity-Instruct-0625. All draft models except DeepSeek-MTP are trained from scratch for 10 epochs; DeepSeek-MTP is fine-tuned from pretrained weights for 1 epoch.
Evaluation: We use vLLM with a patch implementing correct rejection sampling at non-zero temperatures. The default vLLM implementation samples draft tokens greedily regardless of the temperature setting, which systematically underestimates acceptance rate at T>0. We evaluate on MT-bench, HumanEval and GSM8K.
Metric: Average acceptance length τ=K×drafted tokensaccepted tokens+1, where K is the maximum draft length.
Architecture-agnostic improvements (Llama-3.1-8B, temperature = 1)
Both LK variants consistently outperform the KL baseline across all architectures. The hybrid objective with adaptive scheduling achieves the strongest results. Lower-capacity architectures benefit most: MEDUSA and MLP Speculator see average improvements of +7.8% and +8.3%, compared to +3.9% for EAGLE-3. This aligns with the theoretical analysis — low-capacity draft models converge further from the global optimum, where direct acceptance optimization provides the greatest leverage.
Scalability across target model sizes (EAGLE-3, temperature = 1, K = 7)
Improvements are consistent across every model. All EAGLE-3 draft models in this experiment are single-layer dense transformers, while target models range from 32 layers (Llama-3.1-8B) to 94 layers (Qwen3-235B), and include large MoE architectures. The harder the approximation task, the further the KL solution sits from the true acceptance optimum. GPT-OSS 120B achieves +7.7% and Qwen3-235B achieves +8.2%, the largest improvements among EAGLE-3 configurations.
DeepSeek-V3 ships with a pretrained MTP module, but this module was originally trained to predict only the next token and is reused autoregressively for later speculative positions. This mismatch causes sharp acceptance rate degradation beyond the first position.
Fine-tuning with KL divergence substantially improves performance, while LK loss achieves an additional +5.6% on top of this already-improved KL baseline, confirming that the benefit of direct acceptance optimization is not limited to random-initialization training.
LK-Speculators — Trained draft model weights for all six target models on HuggingFace.
Infinity-Instruct-Completions — 660K instruction-response pairs generated from our target models for draft model training.
SpecForge PR — Implementation of LK losses in the SpecForge training framework.
Our KL baselines trained from scratch on target-model-generated responses already significantly outperform the best publicly available EAGLE-3 checkpoints on HuggingFace. LK losses extend this lead further.
Standard speculative decoding training uses KL divergence as a proxy for acceptance rate. This works well when draft models can closely match the target, but introduces systematic suboptimality under capacity constraints, which is always the case in practice.
LK losses bridge this gap directly, at no additional cost. Our method applies to any draft architecture and target model, introduces no computational overhead and delivers consistent improvements across model scales from 8B to 685B parameters, with the largest gains precisely where the approximation task is hardest.
Full gradient derivations, ablation studies and per-checkpoint comparisons are available in the paper.
Contributors
Alexander Samarin, Sergei Krutikov, Anton Shevtsov, Sergei Skvortsov, Filipp Fisin, Alexander Golubev
Citation information
@misc{samarin2026lklosses, title={LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding}, author={Samarin, Alexander and Krutikov, Sergei and Shevtsov, Anton and Skvortsov, Sergei and Fisin, Filipp and Golubev, Alexander}, year={2026}, eprint={2602.23881}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.23881} }