DuoAttention: Single GPU Achieves 3.3 Million Token Context Inference
Boost long-context reasoning with DuoAttention! Reduce memory usage and enhance decoding speeds while maintaining accuracy for tasks involving millions of tokens.
DuoAttention significantly improves the efficiency of long-context reasoning by dividing the attention heads of large language models into Retrieval Heads (which require a complete KV cache) and Streaming Heads (which need only a fixed amount of KV cache). This approach notably reduces memory consumption while enhancing decoding and pre-filling speeds, all while maintaining accuracy in both long and short-context tasks.
A demo video of context reasoning with 3.3 million tokens on a single GPU:
With the widespread application of large language models (LLMs) in various tasks, especially in handling massive text information in long-context scenarios, how to reduce memory and computational costs without compromising model performance has become an urgent issue to address.
To tackle this, research teams from MIT, Tsinghua University, Shanghai Jiao Tong University, the University of Edinburgh, and NVIDIA jointly proposed the DuoAttention framework.
This innovative technology enhances the efficiency of long-context reasoning through a refined design of the attention mechanism in large language models, significantly lowering memory requirements and advancing the development of LLMs in long-context tasks without sacrificing accuracy.