Layercake: Efficient Inference Serving with Cloud and Mobile Resources

Abstract

Many mobile applications are now integrating deep learning models into their core functionality. These functionalities have diverse latency requirements while demanding high-accuracy results. Currently, mobile applications statically decide to use either in-cloud inference, relying on a fast and consistent network, or on-device execution, relying on sufficient local resources. However, neither mobile networks nor computation resources deliver consistent performance in practice. Consequently, mobile inference often experiences variable performance or struggles to meet performance goals, when inference execution decisions are not made dynamically. In this paper, we introduce Layer Cake, a deep-learning inference framework that dynamically selects the best model and location for executing inferences. Layercake accomplishes this by tracking model state and availability, both locally and remotely, as well as the network bandwidth, allowing for accurate estimations of model response time. By doing so, Layercake achieves latency targets in up to 96.4% of cases, which is an improvement of 16.7% over similar systems, while decreasing the cost of cloud-based resources by over 68.33% than in-cloud inference.

Ogden2023a

BibTeX

@INPROCEEDINGS{10171588, author={Ogden, Samuel S. and Guo, Tian}, booktitle={2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)}, title={Layercake: Efficient Inference Serving with Cloud and Mobile Resources}, year={2023}, volume={}, number={}, pages={191-202}, keywords={Deep learning;Cloud computing;Costs;Target tracking;Computational modeling;Estimation;Bandwidth;Deep-Learning Inference;Mobile Devices;Cloud Computing}, doi={10.1109/CCGrid57682.2023.00027} }