LLM Service & Autoregressive Generation: What This Means

cover
14 Dec 2024

Abstract and 1 Introduction

2 Background and 2.1 Transformer-Based Large Language Models

2.2 LLM Service & Autoregressive Generation

2.3 Batching Techniques for LLMs

3 Memory Challenges in LLM Serving

3.1 Memory Management in Existing Systems

4 Method and 4.1 PagedAttention

4.2 KV Cache Manager

4.3 Decoding with PagedAttention and vLLM

4.4 Application to Other Decoding Scenarios

4.5 Scheduling and Preemption

4.6 Distributed Execution

5 Implementation

6 Evaluation and 6.1 Experimental Setup

6.2 Basic Sampling

6.3 Parallel Sampling and Beam Search

6.4 Shared prefix

6.5 Chatbot

7 Ablation Studies

8 Discussion

9 Related Work

10 Conclusion, Acknowledgement and References

2.2 LLM Service & Autoregressive Generation

Once trained, LLMs are often deployed as a conditional generation service (e.g., completion API [34] or chatbot [19, 35]). A request to an LLM service provides a list of input prompt tokens (π‘₯1, . . . , π‘₯𝑛), and the LLM service generates a list of output tokens (π‘₯𝑛+1, . . . , π‘₯𝑛+𝑇 ) according to Eq. 1. We refer to the concatenation of the prompt and output lists as sequence.

Due to the decomposition in Eq. 1, the LLM can only sample and generate new tokens one by one, and the generation process of each new token depends on all the previous tokens in that sequence, specifically their key and value vectors. In this sequential generation process, the key and value vectors of existing tokens are often cached for generating future tokens, known as KV cache. Note that the KV cache of one token depends on all its previous tokens. This means that the KV cache of the same token appearing at different positions in a sequence will be different.

Given a request prompt, the generation computation in the LLM service can be decomposed into two phases:

The prompt phase takes the whole user prompt (π‘₯1, . . . , π‘₯𝑛) as input and computes the probability of the first new token 𝑃 (π‘₯𝑛+1 | π‘₯1, . . . , π‘₯𝑛). During this process, also generates the key vectors π‘˜1, . . . , π‘˜π‘› and value vectors 𝑣1, . . . , 𝑣𝑛. Since prompt tokens π‘₯1, . . . , π‘₯𝑛 are all known, the computation of the prompt phase can be parallelized using matrixmatrix multiplication operations. Therefore, this phase can efficiently use the parallelism inherent in GPUs.

The autoregressive generation phase generates the remaining new tokens sequentially. At iteration 𝑑, the model takes one token π‘₯𝑛+𝑑 as input and computes the probability 𝑃 (π‘₯𝑛+𝑑+1 | π‘₯1, . . . , π‘₯𝑛+𝑑) with the key vectors π‘˜1, . . . , π‘˜π‘›+𝑑 and value vectors𝑣1, . . . , 𝑣𝑛+𝑑 . Note that the key and value vectors at positions 1 to 𝑛 + 𝑑 βˆ’ 1 are cached at previous iterations, only the new key and value vector π‘˜π‘›+𝑑 and 𝑣𝑛+𝑑 are computed at this iteration. This phase completes either when the sequence reaches a maximum length (specified by users or limited by LLMs) or when an end-of-sequence (<eos>) token is emitted. The computation at different iterations cannot be parallelized due to the data dependency and often uses matrix-vector multiplication, which is less efficient. As a result, this phase severely underutilizes GPU computation and becomes memory-bound, being responsible for most portion of the latency of a single request.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Woosuk Kwon, UC Berkeley with Equal contribution;

(2) Zhuohan Li, UC Berkeley with Equal contribution;

(3) Siyuan Zhuang, UC Berkeley;

(4) Ying Sheng, UC Berkeley and Stanford University;

(5) Lianmin Zheng, UC Berkeley;

(6) Cody Hao Yu, Independent Researcher;

(7) Cody Hao Yu, Independent Researcher;

(8) Joseph E. Gonzalez, UC Berkeley;

(9) Hao Zhang, UC San Diego;

(10) Ion Stoica, UC Berkeley.