Table of Contents


MLaaS (ML-as-a-Service) offerings by cloud computing platforms are becoming increasingly popular. Hosting pre-trained machine learning models in the cloud enables elastic scalability as the demand grows. But providing low latency and reducing the latency variance is a key requirement. Variance is harder to control in a cloud deployment due to uncertainties in resource allocations across many virtual instances. We propose the collage inference technique which uses a novel convolutional neural network model, collage-CNN, to provide low-cost redundancy. A collage-CNN model takes a collage image formed by combining multiple images and performs multi-image classification in one shot, albeit at slightly lower accuracy. We augment a collection of traditional single image classifier models with a single collage-CNN classifier which acts as their low-cost redundant backup. collage-CNN provides backup classification results if any single image classification requests experience a slowdown. Deploying the collage-CNN models in the cloud, we demonstrate that the 99th percentile tail latency of inference can be reduced by 1.2x to 2x compared to replication-based approaches while providing high accuracy. Variation in inference latency can be reduced by 1.8x to 15x.


Computer Vision, Pattern Recognition, Distributed, Parallel and Cluster Computing,  Information Theory, Machine Learning


Deep learning is used across many fields such as live video analytics, autonomous driving, health care, data center management, and machine translation. Providing low latency and low variance inference is critical in these applications. On the deployment front, machine learning as a service (MLaaS) platforms (azu; goo; aws) is being introduced by many data center operators. Prediction serving, namely inference, on MLaaS platforms is attractive for scaling inference traffic. Inference requests can be served by deploying the trained models on the MLaaS platforms. To achieve the scalability of the prediction serving systems, incoming queries are distributed across multiple replicas of the trained model. As the inference demands grow, an enterprise can simply increase the cloud instances to meet the demand. However, virtualized and distributed services are prone to slowdowns, which lead to the high variability and long tail in the inference latency. Slowdowns and failures are more acute in cloud-based deployments because of the widespread sharing of computing, memory, and network resources (Dean & Barroso, 2013).

Existing straggler mitigation techniques can be broadly classified into three categories: replication (Dean & Barroso, 2013; Zaharia et al., 2008), approximation (Goiri et al., 2015), coded computing (Lee et al., 2016; Li et al., 2015; 2016b). In replication-based techniques, additional resources are used to add redundancy during execution: either a task is replicated at its launch or a task is replicated on the detection of a straggler node. Approximation techniques ignore the results from tasks running on straggler nodes. Coded computing techniques add redundancy in a coded form at the launch of tasks and have proven useful for linear computing tasks. In deep learning, several of these techniques have been studied for mitigating stragglers in the training phase. However, these solutions need to be revisited when using MLaaS for inference. For example, replicating every request pro-actively as a straggler mitigation strategy could lead to a significant increase in resource costs. Replicating a request reactively on the detection of a straggler, on the other hand, can increase latency.

In this work, we argue that, while prediction serving using MLaaS platforms would be more prone to slowdowns, they are also more amenable to low-cost redundancy schemes. Prediction serving systems deploy a front-end load balancer that receives requests from multiple users and submits them to the back-end cloud instances. In this setting, the load balancer has the unique advantage of treating multiple requests as a single collective and create a more cost-effective redundancy strategy. We propose the collage inference technique as a cost-effective redundancy strategy to deal with variance in inference latency. Collage inference uses a unique convolutional neural network (CNN) based coded redundancy model, referred to as a collage-CNN, that can perform multiple predictions in one shot, albeit at some reduction in prediction accuracy.

Collage-CNN is like a parity model where the input encoding is the collection of images that are spatially arranged into a collage as depicted in figure 1. Its output is decoded to get the missing predictions of images that are taking too long to complete. This coded redundancy model is run concurrently as a single backup service for a collection of individual image inference models. We refer to an individual single image inference model as an s-CNN model. In this paper, we describe the design of the collage-CNN model and we demonstrate the effectiveness of collage inference on cloud deployments. The main contributions of this paper are as follows:

• We propose a collage-CNN model that performs multi-image classification as a low-cost redundant solution to mitigate slowdowns in distributed inference systems.

• We describe the design and architecture of the collage-CNN models. We describe techniques to generate large training datasets for training the collage-CNN models.

• We evaluate the collage-CNN models by deploying them in the cloud and show their effectiveness in mitigating slowdowns without compromising prediction accuracy.

We demonstrate that collage-CNN models can reduce 99-th percentile latency by 1.2x to 2x compared to alternate approaches. The rest of the paper is organized as follows. In section 2 we provide background and motivation. In section 3 we describe the collage inference techniques. We describe the architecture of models and implementation in section 4. Section 5 provides experimental evaluations and design space exploration, section 6 discusses related works and in section 7 we draw conclusions.


Cloud-based prediction serving systems are being increasingly used to serve image classification based requests. Serving requests at low latency, high accuracy, and low resource cost becomes very important. In this paper, we described collage inference where a coded redundancy model is used to reduce the tail latency during inference while maintaining high accuracy. Collage inference uses novel collage-CNN models to provide recovery from slowdown during runtime. Collage-CNN models provide a good tradeoff between accuracy, resource cost, and tail latency. Deploying the models in the cloud we demonstrate that the 99-th percentile latency can be reduced by up to 2x compared to replication-based approaches while maintaining high prediction accuracy. We conclude that collage inference is a promising approach to mitigate stragglers in distributed inference. Our future work includes extending the coded redundancy approach to more deep learning applications.

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:




Krishna Giri Narra, Zhifeng Lin, Ganesh Ananthanarayanan, Salman Avestimehr, Murali Annavaram





Publish in


PDF reference and original file: Click here

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.