1UC Berkeley 2Apple 3ICSI 4LBNL
* Equal contribution
TL;DR: We propose Arbitrage, a step-level speculative decoding framework that routes between draft and target LLMs based on expected quality advantage. This achieves ~2× latency reduction on math reasoning benchmarks at matched accuracy.
Modern LLMs achieve impressive reasoning capabilities through long Chain-of-Thought generation, but this comes at substantial computational cost. Speculative Decoding (SD) offers a solution by using a fast draft model to propose tokens, which are then verified by a larger target model.
However, traditional token-level SD struggles in reasoning tasks. Minor token mismatches in semantically equivalent reasoning steps lead to unnecessary rejections. Recent step-level methods improve on this by accepting or rejecting entire reasoning steps, but they still regenerate many rejected steps with little improvement — wasting valuable compute.
Existing step-level methods like RSD use a fixed quality threshold: if a draft step's quality is below some threshold, regenerate it with the target model. This is advantage-blind — it doesn't consider whether the target model would actually produce a better step.
Our key insight is simple: only invoke the target model when it's expected to provide a meaningfully better step than the draft. We call this advantage-aware routing.
Arbitrage introduces two key components:
We evaluate Arbitrage across multiple mathematical reasoning benchmarks using various LLaMA and Qwen model families. Key findings:
Small draft model (1B) routing to larger target (8B):
Scaling to larger models with 8B draft and 70B target:
Quantized draft model routing to full-precision target:
Wall-clock time comparison showing Arbitrage achieves better accuracy-latency trade-offs:
@misc{maheswaran2025arbitrageefficientreasoningadvantageaware,
title={Arbitrage: Efficient Reasoning via Advantage-Aware Speculation},
author={Monishwaran Maheswaran and Rishabh Tiwari and Yuezhou Hu and Kerem Dilmen and Coleman Hooper and Haocheng Xi and
Nicholas Lee and Mehrdad Farajtabar and Michael W. Mahoney and Kurt Keutzer and Amir Gholami},
year={2025},
eprint={2512.05033},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.05033},
}