Apparate: Early-Exit Models for ML Latency and Throughput Optimization - Conclusion, References

3 Oct 2024


(1) Yinwei Dai, Princeton University (Equal contributions);

(2) Rui Pan, Princeton University (Equal contributions);

(3) Anand Iyer, Georgia Institute of Technology;

(4) Ravi Netravali, Georgia Institute of Technology.

Abstract and 1 Introduction

2 Background and Motivation and 2.1 Model Serving Platforms

2.2 Early-Exit Models

2.3 Challenges

3 Design

3.1 Preparing Models with Early Exits

3.2 Accuracy-Aware Threshold Tuning

3.3 Latency-Focused Ramp Adjustments

4 Implementation

5 Evaluation and 5.1 Methodology

5.2 Overall Results

5.3 Comparison with Existing EE Strategies

5.4 Microbenchmarks

6 Additional Related Work

7 Conclusion, References, Appendix


We present Apparate, the first system that automatically injects and manages early exiting for ML inference. Key to Apparate’s ability to alleviate latency-throughput tensions in serving is its use of exiting only for fast results (not compute savings). This provides continual feedback on exits, and powers Apparate’s novel adaptation strategies for EE ramps and thresholds. Apparate lowers median latencies by 40.5- 91.5% for CV and 10.0-24.2% for NLP workloads, while meeting accuracy constraints and preserving throughputs.


Table 5: Different SLOs used. All numbers are in ms, measured on the A6000.


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.