Table of Links
2 Background and 2.1 Transformer-Based Large Language Models
2.2 LLM Service & Autoregressive Generation
2.3 Batching Techniques for LLMs
3 Memory Challenges in LLM Serving
3.1 Memory Management in Existing Systems
4 Method and 4.1 PagedAttention
4.3 Decoding with PagedAttention and vLLM
4.4 Application to Other Decoding Scenarios
6 Evaluation and 6.1 Experimental Setup
6.3 Parallel Sampling and Beam Search
10 Conclusion, Acknowledgement and References
2.2 LLM Service & Autoregressive Generation
Once trained, LLMs are often deployed as a conditional generation service (e.g., completion API [34] or chatbot [19, 35]). A request to an LLM service provides a list of input prompt tokens (π₯1, . . . , π₯π), and the LLM service generates a list of output tokens (π₯π+1, . . . , π₯π+π ) according to Eq. 1. We refer to the concatenation of the prompt and output lists as sequence.
Due to the decomposition in Eq. 1, the LLM can only sample and generate new tokens one by one, and the generation process of each new token depends on all the previous tokens in that sequence, specifically their key and value vectors. In this sequential generation process, the key and value vectors of existing tokens are often cached for generating future tokens, known as KV cache. Note that the KV cache of one token depends on all its previous tokens. This means that the KV cache of the same token appearing at different positions in a sequence will be different.
Given a request prompt, the generation computation in the LLM service can be decomposed into two phases:
The prompt phase takes the whole user prompt (π₯1, . . . , π₯π) as input and computes the probability of the first new token π (π₯π+1 | π₯1, . . . , π₯π). During this process, also generates the key vectors π1, . . . , ππ and value vectors π£1, . . . , π£π. Since prompt tokens π₯1, . . . , π₯π are all known, the computation of the prompt phase can be parallelized using matrixmatrix multiplication operations. Therefore, this phase can efficiently use the parallelism inherent in GPUs.
The autoregressive generation phase generates the remaining new tokens sequentially. At iteration π‘, the model takes one token π₯π+π‘ as input and computes the probability π (π₯π+π‘+1 | π₯1, . . . , π₯π+π‘) with the key vectors π1, . . . , ππ+π‘ and value vectorsπ£1, . . . , π£π+π‘ . Note that the key and value vectors at positions 1 to π + π‘ β 1 are cached at previous iterations, only the new key and value vector ππ+π‘ and π£π+π‘ are computed at this iteration. This phase completes either when the sequence reaches a maximum length (specified by users or limited by LLMs) or when an end-of-sequence (<eos>) token is emitted. The computation at different iterations cannot be parallelized due to the data dependency and often uses matrix-vector multiplication, which is less efficient. As a result, this phase severely underutilizes GPU computation and becomes memory-bound, being responsible for most portion of the latency of a single request.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Woosuk Kwon, UC Berkeley with Equal contribution;
(2) Zhuohan Li, UC Berkeley with Equal contribution;
(3) Siyuan Zhuang, UC Berkeley;
(4) Ying Sheng, UC Berkeley and Stanford University;
(5) Lianmin Zheng, UC Berkeley;
(6) Cody Hao Yu, Independent Researcher;
(7) Cody Hao Yu, Independent Researcher;
(8) Joseph E. Gonzalez, UC Berkeley;
(9) Hao Zhang, UC San Diego;
(10) Ion Stoica, UC Berkeley.