Distributed Feature Extraction on Apache Spark for Human Action Recognition

Distributed Feature Extraction on Apache Spark forHuman Action Recognition

Table of Contents


Local feature extraction is one of the most im-portant tasks to build robust video representation in humanaction recognition. Recent advances in computing visual features,especially deep-learned features, have achieved excellent perfor-mance on a variety of action datasets. However, the extractionprocess is computing-intensive and extremely time-consumingwhen conducting it on large-scale video data. Consequently,to extract video features over big data, most of the existingmethods that run on single machine become inefficient due tothe limit of computation power and memory capacity. In thispaper, we propose the elastic solutions for feature extractionbased on the Spark framework. Particularly, exploiting the in-memory computing capability of Spark, the process of computingfeatures are parallelized by partitioning video data into videosor frames and place them into resilient distributed datasets(RDDs) for the subsequent processing. Then, we present theparallel algorithms to extract the state-of-the-art deep-learnedfeatures on the Spark cluster. Subsequently, using the distributedencoding, the extracted features are aggregated into the globalrepresentation which is fed into the learned classifier to recognizeactions in videos. Experimental results on a benchmark datasetdemonstrate that our proposed methods can significantly speedup the extraction process and achieve the promising scalabilityperformance.


  • Author Keywords

    • Human action recognition ,
    • local feature ,
    • deep learning ,
    • Spark ,
    • MapReduce
  • IEEE Keywords

    • Feature extraction ,
    • Sparks ,
    • Trajectory ,
    • Task analysis ,
    • Liquid crystal displays ,
    • Visualization ,
    • Streaming media



n recent years, the rapid growth of surveillance, socialmedia, and even robotics has created a vast volumes of videodata and made it become abundant. The potential for variousworks in the computer vision community has dramaticallygrown to intelligently analyze and gain insights into the videocontent. Accordingly, human action recognition (HAR) [1]–[5] with the ability of understanding complex actions playsan important role in many multimedia applications includingsmart surveillance, web-video search and retrieval, patientmonitoring, sports analysis, and human-computer interaction.

Although many researches have achieved the reliable HAR insimple scenarios, action recognition still remains challenging,especially in complex scenarios, due to several reasons: videostaken with background clutter and motion; large variety oflighting conditions and viewpoints; limited quantities of la-beled data for learning process; and large intra-class variabilityversus small inter-class variability.Over the past decade, a lot of works have been carriedout to leverage the local space-time features in handlingthese challenges and build the efficient action representation.Different feature types have been presented in the literature.Generally, they can be divided into two main types: hand-crafted local features and deep-learned local features. The firsttype such as SpaceTime Interest Points (STIP) [3] and DenseTrajectories (DT) [15] can be extracted directly from a videowithout tracking human bodies and then fed into the generativeor discriminative recognition models. Despite the success onbenchmark datasets, the discriminative power of this type isstill limited due to the unoptimized visual representation. Thesecond type is extracted by end-to-end neural network trainingfrom large and well-annotated datasets that automaticallylearns from raw video. Then, visual representation learnedby deep models captures more discriminative information andenables the high recognition rate.There exist many works [2], [10] using local features thatachieved the excellence performance to recognize complexactions. At the same time, these local feature based methodsalso generate the huge amount of visual data. Handling sucha large-scale data storage and computation management is thechangeling task. However, the number of works aiming forthe system efficiency are limited. The single machine systemswith multiple processors usually face the space-efficiency andtime-efficiency problems due to the limit of memory storageand the computation capability. The effectiveness of parallelsolutions on single machines is also prevented from developingscalable HAR applications.

In order to meet the elastic computing requirements, thepotential solution is to deploy vision algorithms on computerclusters that enable to build the HAR framework for pro-duction data and workflow on Big Data platforms. HadoopMapReduce [8] and Spark [9] are the most popular Big DataStacks to develop parallel computing frameworks. Severalrecent works [6], [7], [12] have exploited these platform forthe development of video analytic systems. Particularly, in[7], authors proposed video processing platform that employedemploy Hadoop MapReduce to perform distributed computing.Despite the achievement of the potential efficiency, visiontasks presented in their platform are rather simple and theapplicability to high-level video understanding is not verified.In [12], the distributed real-time video processing were pro-posed for objects and event detection but like [7], this work isstill limited to a few applications. On the other hand, HadoopMapReduce is not well-suited for iterative computation thatis the key factor of vision algorithms. Later on, Spark hasbeen emerging as a solution to overcome the limitations andinherit the properties of Hadoop MapReduce. Apache Spark isa fast and general-purpose engine for large-scale data storageand processing. Spark programming model both supports awide variety of applications in data analytics and maintainsautomatic fault tolerance. This is done by the use of anabstraction called resilient distributed datasets (RDDs) thatenable to carry out distributed applications efficiently. Hence,Spark is particularly suitable for the development of distributedvision algorithms.

In this paper, we propose the elastic solutions to extract localfeatures from video sequence on the distributed environment.Our proposed method is based on the Spark paradigm andthe state-of-the-art bag of visual words (BoV) model [2], [10]which has been widely used for human action recognition.First, we present the BoV framework built on the top ofApache Spark. Then, we present how to extract the localfeatures under the distributed fashion. Specifically, the featureextraction is parallelized by partitioning video datasets intovideos or frames which are placed into RDD for the subse-quent distributed processing. We primarily focus on designingthe parallel algorithms to extract the popular deep-learnedlocal feature, namely, Trajectory-pooled deep-convolutionalDescriptor [16] (TDD). This type of feature is a deep variant ofIDT and computed by pooling feature maps along trajectories.For the distributed processing, we employ the implementationpresented in [13] to extract the trajectories and then adoptthe distributed deep learning library (called BigDL [19]) tocompute CNN feature maps. We also present the simplifiedversion of TDD called Latent Concept Descriptor (LCD)and its parallel algorithm by using only CNN feature maps.Furthermore, we introduce the distributed variant of VLADencoding to aggregate the extracted local features into theglobal representation. Finally, these representation are fedinto the trained classifier to recognize actions in videos. Weexperimentally demonstrate the efficiency of the proposedmethod as running on the Spark clusters. The remainder ofthis paper is organized as follows. Section II presents our proposed framework and methodology. Experimental resultson benchmark datasets are conducted and discussed in SectionIII. Finally, conclusions are presented in Section IV.


In this paper, we proposed parallel algorithms to extractpopular deep-learned features of HAR including LCD andTDD. We show that efficiency can be significantly improved by executing the process of feature extraction on Spark clus-ters. The experiments on benchmark dataset UCF101 not onlydemonstrate the scalability of our SPark framework but alsoverify the effectiveness of distributed variant of LCD and TDDwhich are capable of balancing well the accuracy and time-efficiency.

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

Distributed Feature Extraction on Apache Spark for Human Action Recognition



N. A. Tu, T. Huynh-The, K. Wong, D. Bui and Y. Lee,




Distributed Feature Extraction on Apache Spark for Human Action Recognition

Publish in

2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM), Taichung, Taiwan, 2020, pp. 1-6,



PDF reference and original file: Click here

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.