Performance Analysis of Distributed and Scalable Deep Learning(DL)

Performance Analysis of Distributed and Scalable Deep Learning(DL)

Table of Contents


With renewed global interest for Artificial Intelligence (AI) methods, the past decade has seen a myriad of new programming models and tools that enable better and faster Machine Learning (ML). More recently, a subset of MLknown as Deep Learning (DL) raised an increased interest due to its inherent ability to tackle efficiently novel cognitive computing applications. DL allows computational models that are composed of multiple processing layers to learn in automated way representations of data with multiple levels of abstractions and can deliver higher predictive accuracy when trained on larger data sets. Based on Artificial neural networks (ANN), DL is now at the core of the state of the art voice recognition systems (which enable easy control over e.g.Internet-of-Things (IoT) smart home appliances for instance),self-driving car engine, online recommendation systems. The ecosystem of DL frameworks is fast evolving, as well as the DL architectures that are shown to perform well on specialized tasks and to exploit GPU accelerators. For this reason, the frequent performance evaluation of the DL ecosystem is required, especially since the advent of novel distributed training frameworks such as Horovod allowing for scalable training across multiple computing resources. In this paper, the scalability evaluation of the reference DL frameworks (Tensorflow, Keras, MXNet, and PyTorch)is performed over up-to-date High-Performance Computing (HPC) resources to compare the efficiency of differ-ent implementations across several hardware architectures(CPU and GPU). Experimental results demonstrate that the theDistributedDataParallelfeatures in the Pytorch library seem to be the most efficient framework for distributing the training process across many devices, allowing to reach a throughput speedup of 10.11 when using 12 NVidia Tesla V100GPUs when training Resnet44 on the CIFAR10 dataset.

  • IEEE Keywords

    • Training,
    • Computational modeling,
    • C++ languages ,
    • Graphics processing units,
    • Biological system modeling,
    • Machine learning,
    • Data models
  • Controlled Indexing

    • artificial intelligence,
    • cognitive systems,
    • graphics processing units,
    • Internet of Things,
    • learning (artificial intelligence),
    • neural nets,
    • parallel processing,
    • performance evaluation
  • Non-Controlled Indexing

    • artificial neural networks,
    • voice recognition,
    • online recommendation systems,
    • DL frameworks,
    • GPU accelerators,
    • performance evaluation,
    • DL ecosystem,
    • distributed training frameworks,
    • multiple computing resources,
    • high-performance computing resources,
    • hardware architectures,
    • renewed global interest,
    • artificial intelligence,
    • AI,
    • programming models,
    • ML,
    • cognitive computing applications,
    • computational models,
    • Resnet44,
    • Internet of Things smart home appliances


The past decade has seen a renewed interest toward artificial Intelligence (AI) often presented as the study of”intelligent agents” i.e., any device that perceives its environment, learns from it and takes actions based on this knowledge that maximize its chance of successfully achieving its goals [1]. A myriad of new programming models and tools were developed to enable better and faster Machine Learning(ML), and more precisely a subset of this field known as deep learning (DL). Based on Artificial Neural Networks(ANNs), DL has emerged as an effective analysis method loosely inspired by the biological neural networks that constitute animal brains. DL allows computational models that are composed of multiple processing layers to learn in an automated way representation of data with multiple levels of abstraction and can deliver higher predictive accuracy when trained on larger data sets. This approach proved to be particularly well suited for providing new solutions to cognitive problems tied to image, sound, and text analysis. In practice, DL is nowadays at the core of state-of-the-art voice recognition and Natural Language Processing (NLP)systems (which enable easy control over the Internet of Things(IoT) smart home appliances for instance), self-driving car engine, or online recommendation systems. More generally, domain scientists are embracing DL as both a standalone data science method, as well as an effective approach to reducing dimensionality in solving problems in academia and in industry. However, training deep neural networks, which often contain hundreds of thousands, if not millions of parameters, can be extremely computationally expansive(for example transformers [2]). For this reason and besides the quest for algorithmic innovation, the application of DL paradigms is limited by the available computational resources. Indeed, the resources needed for training are further increased with the additional data required to fit larger models. Exploiting the parallelism in training data(i.e., batch) is a sound approach as the quality and quantity of data are reported more important for the accuracy of a statistical model than the classifier’s design [3]. For the specific case of DL, similar observations were made, where more data improves the generalization performance of the model [4]. Therefore, it is natural to attempt to distribute training over multiple processors where possible. This ex-plains the witnessed convergence between DL and High-Performance Computing (HPC) as supercomputers demonstrate an unparalleled capacity to reduce DL training time from days to minutes by leveraging the distributed training over HPC clusters, and more specifically over GPU accelerated resources.

There are multiple ways of approaching the distributed training with the simplest and most common being data parallelism [5]. This was implemented in a number of common DL frameworks and the efficiencies of the frameworks were compared by running experiments training Resnet models onCIFAR10, a common architecture, and well-known dataset in computer vision. Other approaches are proposed in the literature, often related to the distribution of computed gradients across model instances. In this context, this paper reports on the scalability of different distributed data-parallelism frameworks.

This paper is organized as follows: section2 details the background of this work reviews the main DL frameworks, and the proposed approaches enabling scalable and distributed DL in the literature. Implementation details of the performance evaluation are provided in the section3which also reviews the experimental setup used to conduct this study. The experimental results obtained are discussed in the section4. Section5 reviews the related work. Finally, the section6 concludes the paper and provides some future directions and perspectives opened by this study.


This paper presented the scalability evaluation of a selected set of reference DL frameworks (TensorFlow, Keras, MxNet, and PyTorch) is performed over up-to-date HPC resources to compare the efficiency of different implementations across several hardware architectures (CPU and GPU). Whereas each of these frameworks features a native MPI-based training distribution capability, the evaluation of theropod [9] paradigm relying on a novel and efficient distributed training algorithm is proposed. When used in combination with the above-mentioned DL frameworks Horovodis claimed to allow for faster and easier training across multiple nodes and GPUs. With the objective to assert this claim, this article expounds a performance evaluation campaign performed using the reference CIFAR10 [15] dataset. While some aspects of the results found are likely to be specific to this particular neural network, it appears that, for problems of this type. TheDistributedDataParallelfeatures in thePyTorchlibrary seems to be the most efficient of the framework considered for distributing the training process across many devices.TensorFlowwithHorovodalso scales well and was easy to use but it appears that some of the shortcuts taken to improve throughput had a negative effect on test accuracy. The other frameworks tried generally did not scale as well as these two. It should be noted that, for both PyTorch and TensorFlow, it was evident that some of the common changes to hyperparameters, such as increasing the batch size, caused the time-to-accuracy results to stop improving once too many processors were used, at least in this case. Similar results were found when using CPUs and GPUs.In the case of the former, little improvement was seen in the time-to-accuracy results once more than 56 CPUs were used. However, it was noted that, despite being lower than the GPU case in general, the speedup in throughput when more CPUs were used was very close to ideal, likely due to the proportion of time spent on communication being lower.

The future work induced by this study includes more large-scale experiments, collecting a wider variety of metrics, and examining how well our results generalize to different network architectures. Furthermore, while this article includes some of the most widely used DL frameworks, there are more which could also be considered. Arguably the most prominent of these would be distributed Tensorflow. It may also be interesting to consider a wider variety of tasks than supervised learning in future experiments. Indeed, there are many improvements that could be made to the existing codebase which could facilitate further interesting results and ease of use for other users. In general, we would like to perform further experimentation on a larger set of applications and machines. In particular, it may be insightful to consider larger neural networks, which would likely have a substantially lower throughput on one device. In this scenario, it is likely that larger messages would have to be transferred less often by the workers, potentially creating different bottlenecks and potentially affecting which frameworks perform best. Some of the frameworks also include additional parameters giving the user some extra control over the distribution of the training process, such as the option to compress data to 16bit precision during all reduce operations in Horovod. While some tuning of these parameters was done to get the best possible results in this project, it may also be worthwhile to further examine the effect of changing parameters of this type has on the speed and efficiency of training.

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

Performance Analysis of Distributed and Scalable Deep Learning



S. Mahon, S. Varrette, V. Plugaru, F. Pinel and P. Bouvry,




Performance Analysis of Distributed and Scalable Deep Learning

Publish in

2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), Melbourne, Australia, 2020, pp. 760-766



PDF reference and original file: Click here

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.