A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications

A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications

Table of Contents


With the explosive increase of big data, it is necessary to apply large-scale data processing systems to analyze Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays, due to its good properties including generality, fault tolerance, high performance of in-memory data processing, and scalability. Spark adopts a flexible Resident Distributed Dataset(RDD) programming model with a set of provided transformation and action operators whose operating functions can be customized by users according to their applications. It is originally positioned as a fast and general data processing system. A large body of research efforts have been made to make it more efficient(faster) and general by considering various circumstances since its introduction. In this survey, we aim to have a thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark. We introduce the Spark programming model and computing system, discuss the pros and cons of Spark, and have an investigation on various solving techniques in the literature. Moreover, we also introduce various data management and processing systems, machine learning algorithms, and applications supported by Spark. Finally, we make a discussion on the open issues and challenges for Spark.


  • Author Keywords

    • Spark ,
    • Shark ,
    • RDD ,
    • In-Memory Data Processing
  • IEEE Keywords

    • Sparks,
    • Task analysis,
    • Computational modeling,
    • Programming,
    • Big Data,
    • Ecosystems



In the current era of ‘big data’, the data is collected at an unprecedented scale in many application domains, including eCommerce [112], social network [140], and computational biology [146]. Given the characteristics of the unprecedented amount of data, the speed of data production, and the multiple of the structure of data, large-scale data processing is essential to analyzing and mining such big data timely. A number of large-scale data processing frameworks have thereby been developed, such as MapReduce [87], Storm [14], Flink [1], Dryad [102], Caffe [103], Tensorflow [64]. Specifically, MapReduce is a batch processing framework, while Storm is a streaming processing system. Flink is a big data computing system for batch and streaming processing. Dryad is a graph processing framework for graph applications. Caffe and Tensorflow are deep learning frameworks used for model training and inference in computer vision, speech recognition, and natural language processing. However, all of the aforementioned frameworks are not general computing systems since each of them can only work for certain data computation. In comparison, Spark [160] is a general and fast large-scale data processing system widely used in both industry and academia with many merits. For example, Spark is much faster than MapReduce in performance, benefiting from its in-memory data processing. Moreover, as a general system, it can support batch, interactive, iterative, and streaming computations in the same runtime, which is useful for complex applications that have different computation modes. Despite its popularity, there are still many limitations to Spark. For example, it requires a considerable amount of learning and programming efforts under its RDD programming model. It does not support new emerging heterogeneous computing platforms such as GPU and FPGA by default. Being a general computing system, it still does not support certain types of applications such as deep learning-based applications [25]. To make Spark more general and fast, there has been a lot of work made to address the limitations of Spark [121], [63], [94], [115] mentioned above, and it remains an active research area. A number of efforts have been made on performance optimization for the Spark framework. There have been proposals for more complex scheduling strategies [137], [150], and efficient memory I/O support (e.g., RDMA support) to improve the performance of Spark. There have also been a number of studies to extend Spark for more sophisticated algorithms and applications (e.g., deep learning algorithms, genomes, and Astronomy). To improve the ease of use, several high-level declarative [156], [23], [129], and procedural languages [54], [49] have also been proposed and supported by Spark. Still, with the emergence of new hardware, software, and application demands, it brings new opportunities as well as challenges to extend Spark for improved generality and performance efficiency. In this survey, for the sake of better understanding these potential demands and opportunities systematically, we classify the study of Spark ecosystem into six support layers as illustrated in Figure 1, namely, Storage Supporting Layer, Processor Supporting Layer, Data Management Layer, Data Processing Layer, Highlevel Language Layer and Application Algorithm Layer. The aim of this paper is two-fold. We first seek to have an investigation of the latest studies on the Spark ecosystem. We review related work on Spark and classify them according to their optimization strategies in order to serve as a guidebook for users on the problems and addressing techniques in data processing with Spark. It summarizes existing techniques systematically as a dictionary for expert researchers to look up. Second, we show and discuss the development trend, new demands, and challenges at each support layer of the Spark ecosystem as illustrated in Figure 1. It provides researchers with insights and potential study directions on Spark. The rest part of this survey is structured as follows. Section 2 introduces the Spark system, including its programming model, runtime computing engine, pros and cons, and various optimization techniques. Section 3 describes new caching devices for Spark in-memory computation. Section 4 discusses the extensions of Spark for performance improvement by using new accelerators. Section 5 presents distributed data management, followed by processing systems supported by Spark in Section 6. Section 7 shows the languages that are supported by Spark. Section 8 reviews the Spark-based machine learning libraries and systems, Spark-based deep learning systems, and the major applications that the Spark system is applied to. Section 9 makes some open discussion on the challenging issues. Finally, we conclude this survey in Section 10.


Spark has gained significant interests and contributions both from industry and academia because of its simplicity, generality, fault tolerance, and high performance. However, there is a lack of work to summarize and classify them comprehensively. In view of this, it motives us to investigate the related work on Spark. We first overview the Spark framework and present the pros and cons of Spark. We then provide a comprehensive review of the current status of Spark studies and related work in the literature that aims at improving and enhancing the Spark framework and give the open issues and challenges regarding the current Spark finally. In summary, we hopefully expect to see that this work can be a useful resource for users who are interested in Spark and want to have further study on Spark.

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications



S. Tang, B. He, C. Yu, Y. Li and K. Li




A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications

Publish in

in IEEE Transactions on Knowledge and Data Engineering,



PDF reference and original file: Click here

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.