Apparate: Early-Exit Models for ML Latency and Throughput Optimization - Conclusion, References

cover
3 Oct 2024

Authors:

(1) Yinwei Dai, Princeton University (Equal contributions);

(2) Rui Pan, Princeton University (Equal contributions);

(3) Anand Iyer, Georgia Institute of Technology;

(4) Ravi Netravali, Georgia Institute of Technology.

Abstract and 1 Introduction

2 Background and Motivation and 2.1 Model Serving Platforms

2.2 Early-Exit Models

2.3 Challenges

3 Design

3.1 Preparing Models with Early Exits

3.2 Accuracy-Aware Threshold Tuning

3.3 Latency-Focused Ramp Adjustments

4 Implementation

5 Evaluation and 5.1 Methodology

5.2 Overall Results

5.3 Comparison with Existing EE Strategies

5.4 Microbenchmarks

6 Additional Related Work

7 Conclusion, References, Appendix

7 CONCLUSION

We present Apparate, the first system that automatically injects and manages early exiting for ML inference. Key to Apparate’s ability to alleviate latency-throughput tensions in serving is its use of exiting only for fast results (not compute savings). This provides continual feedback on exits, and powers Apparate’s novel adaptation strategies for EE ramps and thresholds. Apparate lowers median latencies by 40.5- 91.5% for CV and 10.0-24.2% for NLP workloads, while meeting accuracy constraints and preserving throughputs.

REFERENCES

[1] Apache TVM: An End to End Machine Learning Compiler Framework for CPUs, GPUs and accelerators. https://tvm.apache.org/.

[2] Neural Network Exchange Format (NNEF). https://ww w.khronos.org/nnef/.

[3] NVIDIA TensorRT: Programmable Inference Accelerator. https://developer.nvidia.com/tensorrt.

[4] NVIDIA Triton Inference Server. https://developer.nv idia.com/nvidia-triton-inference-server.

[5] ONNX Run Time. https://github.com/microsoft/onnxr untime. [6] Open Neural Network Exchange (ONNX). https://on nx.ai/.

[7] PyTorch. https://pytorch.org/.

[8] The Yelp Reviews Dataset. https://www.yelp.com/dat aset.

[9] TorchServe. https://pytorch.org/serve/.

[10] Web data: Amazon reviews. https://snap.stanford.edu/ data/web-Amazon.html.

[11] N. Agarwal and R. Netravali. Boggart: Towards General-Purpose acceleration of retrospective video analytics. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 933–951, Boston, MA, Apr. 2023. USENIX Association.

[12] S. Ahmed, A. R. Chowdhury, K. Fawaz, and P. Ramanathan. Preech: A system for Privacy-Preserving speech transcription. In 29th USENIX Security Symposium (USENIX Security 20), pages 2703–2720. USENIX Association, Aug. 2020.

[13] R. Bhardwaj, Z. Xia, G. Ananthanarayanan, J. Jiang, Y. Shu, N. Karianakis, K. Hsieh, P. Bahl, and I. Stoica. Ekya: Continuous learning of video analytics models on edge compute servers. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 119–135, Renton, WA, Apr. 2022. USENIX Association.

[14] T. Bolukbasi, J. Wang, O. Dekel, and V. Saligrama. Adaptive neural networks for efficient inference. In International Conference on Machine Learning, pages 527–536. PMLR, 2017.

[15] Y. Choi, Y. Kim, and M. Rhu. Lazy batching: An sla-aware batching system for cloud machine learning inference. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 493–506, Los Alamitos, CA, USA, mar 2021. IEEE Computer Society.

[16] D. Crankshaw, G.-E. Sela, X. Mo, C. Zumar, I. Stoica, J. Gonzalez, and A. Tumanov. Inferline: Latencyaware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing, SoCC ’20, page 477–491, New York, NY, USA, 2020. Association for Computing Machinery.

[17] D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica. Clipper: A Low-Latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613–627, Boston, MA, Mar. 2017. USENIX Association.

[18] W. Cui, Z. Han, L. Ouyang, Y. Wang, N. Zheng, L. Ma, Y. Yang, F. Yang, J. Xue, L. Qiu, L. Zhou, Q. Chen, H. Tan, and M. Guo. Optimizing dynamic neural networks with brainstorm. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 797–815, Boston, MA, July 2023. USENIX Association.

[19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, page arXiv:1810.04805, Oct. 2018.

[20] Gigaspaces. Amazon Found Every 100ms of Latency Cost them 1% in Sales. https://www.gigaspaces.com/b log/amazon-found-every-100ms-of-latency-cost-the m-1-in-sales, 2023.

[21] A. Gujarati, S. Elnikety, Y. He, K. S. McKinley, and B. B. Brandenburg. Swayam: Distributed autoscaling to meet slas of machine learning inference services with resource efficiency. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, Middleware ’17, page 109–120, New York, NY, USA, 2017. Association for Computing Machinery.

[22] A. Gujarati, R. Karimi, S. Alzayat, W. Hao, A. Kaufmann, Y. Vigfusson, and J. Mace. Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462. USENIX Association, Nov. 2020.

[23] P. Guo, B. Hu, and W. Hu. Mistify: Automating DNN model porting for On-Device inference at the edge. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 705–719. USENIX Association, Apr. 2021.

[24] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang. Applied machine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 620–629, 2018.

[25] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang. Applied machine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 620–629, 2018.

[26] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.

[27] K. Hsieh, G. Ananthanarayanan, P. Bodik, S. Venkataraman, P. Bahl, M. Philipose, P. B. Gibbons, and O. Mutlu. Focus: Querying large video datasets with low latency and low cost. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, OSDI’18, page 269–286, USA, 2018. USENIX Association.

[28] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger. Multi-scale dense networks for resource efficient image classification, 2018.

[29] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in neural information processing systems, pages 103–112, 2019.

[30] HuggingFace. Pretrained Models. https://huggingface. co/transformers/v3.3.1/pretrained models.html, 2023.

[31] M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian, W. Xiao, and F. Yang. Analysis of large-scale multitenant GPU clusters for DNN training workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 947–960, Renton, WA, July 2019. USENIX Association.

[32] M. Jeon, S. Venkataraman, J. Qian, A. Phanishayee, W. Xiao, and F. Yang. Multi-tenant gpu clusters for deep learning workloads: Analysis and implications. Technical report, Microsoft Research, 2018.

[33] J. Jiang, G. Ananthanarayanan, P. Bodik, S. Sen, and I. Stoica. Chameleon: Scalable adaptation of video analytics. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’18, page 253–266, New York, NY, USA, 2018. Association for Computing Machinery.

[34] Y. Kaya, S. Hong, and T. Dumitras. Shallow-deep networks: Understanding and mitigating network overthinking. In International conference on machine learning, pages 3301–3310. PMLR, 2019.

[35] M. Khani, G. Ananthanarayanan, K. Hsieh, J. Jiang, R. Netravali, Y. Shu, M. Alizadeh, and V. Bahl. RECL: Responsive Resource-Efficient continuous learning for video analytics. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 917–932, Boston, MA, Apr. 2023. USENIX Association.

[36] W. Liu, P. Zhou, Z. Wang, Z. Zhao, H. Deng, and Q. Ju. FastBERT: a self-distilling BERT with adaptive inference time. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6035–6044, Online, July 2020. Association for Computational Linguistics.

[37] P. Mattson, V. J. Reddi, C. Cheng, C. Coleman, G. Diamos, D. Kanter, P. Micikevicius, D. Patterson, G. Schmuelling, H. Tang, G.-Y. Wei, and C.-J. Wu. Mlperf: An industry standard benchmark suite for machine learning performance. IEEE Micro, 40(2):8–16, 2020.

[38] D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, pages 1–15, New York, NY, USA, 2019. Association for Computing Machinery.

[39] C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V. Rajashekhar, S. Ramesh, and J. Soyke. Tensorflow-serving: Flexible, high-performance ml serving, 2017.

[40] A. Padmanabhan, N. Agarwal, A. P. Iyer, G. Ananthanarayanan, Y. Shu, N. Karianakis, G. H. Xu, and R. Netravali. GEMEL: model merging for memoryefficient, real-time video analytics at the edge. CoRR, abs/2201.07705, 2022.

[41] A. Pal, A. Barigidad, and A. Mustafi. Imdb movie reviews dataset, 2020.

[42] J. Park, M. Naumov, P. Basu, S. Deng, A. Kalaiah, D. Khudia, J. Law, P. Malani, A. Malevich, S. Nadathur, et al. Deep learning inference in facebook data centers: Characterization, performance optimizations and hardware implications. arXiv preprint arXiv:1811.09886, 2018.

[43] PyTorch. Model Zoo. https://pytorch.org/serve/mode l zoo.html, 2023.

[44] F. Romero, Q. Li, N. J. Yadwadkar, and C. Kozyrakis. {INFaaS}: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 397–411, 2021.

[45] V. Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.

[46] R. Schwartz, G. Stanovsky, S. Swayamdipta, J. Dodge, and N. A. Smith. The right tool for the job: Matching model and instance complexities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6640–6651, Online, July 2020. Association for Computational Linguistics.

[47] B. Selman and C. P. Gomes. Hill-climbing search. Encyclopedia of cognitive science, 81:82, 2006.

[48] J. Sevilla, P. Villalobos, and J. Ceron. Parameter counts ´ in Machine Learning. https://www.lesswrong.com/po sts/GzoWcYibWYwJva8aL/parameter-counts-in-mac hine-learning, 2021.

[49] H. Shen, L. Chen, Y. Jin, L. Zhao, B. Kong, M. Philipose, A. Krishnamurthy, and R. Sundaram. Nexus: A gpu cluster engine for accelerating dnn-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, page 322–337, New York, NY, USA, 2019. Association for Computing Machinery.

[50] C. Sima, Y. Fu, M.-K. Sit, L. Guo, X. Gong, F. Lin, J. Wu, Y. Li, H. Rong, P.-L. Aublin, and L. Mai. Ekko: A Large-Scale deep learning recommender system with Low-Latency model update. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 821–839, Carlsbad, CA, July 2022. USENIX Association.

[51] A. Suprem, J. Arulraj, C. Pu, and J. Ferreira. Odin: Automated drift detection and recovery in video analytics. Proc. VLDB Endow., 13(12):2453–2465, jul 2020.

[52] S. Tanwar, S. Tyagi, I. Budhiraja, and N. Kumar. Tactile internet for autonomous vehicles: Latency and reliability analysis. IEEE Wireless Communications, 26(4):66–72, 2019.

[53] S. Teerapittayanon, B. McDanel, and H. T. Kung. Branchynet: Fast inference via early exiting from deep neural networks, 2017.

[54] Think with Google. The Need for Mobile Speed: How Mobile Latency Impacts Publisher Revenue. https:// www.thinkwithgoogle.com/marketing-strategies/app-a nd-mobile/mobile-speed-latency-impacts-publisher-r evenue/, 2017.

[55] C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, B. Reagen, J. Spisak, F. Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, B. Wasti, Y. Wu, R. Xian, S. Yoo, and P. Zhang. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 331–344, 2019.

[56] K. Xie, S. Lu, M. Wang, and Z. Wang. Elbert: Fast albert with confidence-window based early exit. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7713–7717, 2021.

[57] J. Xin, R. Tang, J. Lee, Y. Yu, and J. Lin. Deebert: Dynamic early exiting for accelerating bert inference, 2020.

[58] J. Xin, R. Tang, Y. Yu, and J. Lin. BERxiT: Early exiting for BERT with better fine-tuning and extension to regression. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 91–104, Online, Apr. 2021. Association for Computational Linguistics.

[59] S. yiin Chang, B. Li, D. J. Rybach, W. Li, Y. R. He, T. N. Sainath, and T. D. Strohman. Low latency speech recognition using end-to-end prefetching. In Interspeech 2020, 2020.

[60] G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association.

[61] C. Zhang, L. Ma, J. Xue, Y. Shi, Z. Miao, F. Yang, J. Zhai, Z. Yang, and M. Yang. Cocktailer: Analyzing and optimizing dynamic control flow in deep learning. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 681–699, Boston, MA, July 2023. USENIX Association.

[62] C. Zhang, M. Yu, W. Wang, and F. Yan. MArk: Exploiting cloud services for Cost-Effective, SLO-Aware machine learning inference serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 1049–1062, Renton, WA, July 2019. USENIX Association.

[63] H. Zhang, Y. Tang, A. Khandelwal, and I. Stoica. SHEPHERD: Serving DNNs in the wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 787–808, Boston, MA, Apr. 2023. USENIX Association.

[64] W. Zhou, C. Xu, T. Ge, J. McAuley, K. Xu, and F. Wei. Bert loses patience: Fast and robust inference with early exit. In Advances in Neural Information Processing Systems, volume 33, pages 18330–18341. Curran Associates, Inc., 2020.

Table 5: Different SLOs used. All numbers are in ms, measured on the A6000.

A APPENDIX

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.