Abstract
Human activity recognition is a challenging problem with many applications including visual surveillance, human-computer interactions, autonomous driving, and entertainment. In this study, we propose a hybrid deep model to understand and interpret videos focusing on human activity recognition. The proposed architecture is constructed combining dense optical flow approach and auxiliary movement information in video datasets using deep learning methodologies. To the best of our knowledge, this is the first study based on a novel combination of 3D-convolutional neural networks (3D-CNNs) fed by optical flow and long short-term memory networks (LSTM) fed by auxiliary information over video frames for the purpose of human activity recognition. The contributions of this paper are sixfold. First, a 3D-CNN, also called multiple frames is employed to determine the motion vectors. With the same purpose, the 3D-CNN is secondly used for dense optical flow, which is the distribution of apparent velocities of movement in captured imagery data in video frames. Third, the LSTM is employed as auxiliary information in the video to recognize hand-tracking and objects. Fourth, the support vector machine algorithm is utilized for the task of classification of videos. Fifth, a wide range of comparative experiments are conducted on two newly generated chess datasets, namely the magnetic wall chess board video dataset (MCDS), and standard chess board video dataset (CDS) to demonstrate the contributions of the proposed study. Finally, the experimental results reveal that the proposed hybrid deep model exhibits remarkable performance compared to the state-of-the-art studies.
-
IEEE Keywords
- Videos,
- Optical imaging,
- Deep learning,
- Feature extraction,
- Activity recognition,
- Support vector machines,
- Optical computing
Introduction
Board games have been played for centuries in all cultures and societies. There are elegant rules, deep strategies, and numerous tactical possibilities involved in such games. Chess is one of the most popular board games in the world. The World Chess Federation (FIDE) informs that 605 million people play chess [1]. Chess has always been a challenge for computer hardware designers and software developers. For this purpose, robot systems interacting with people in complex environments need the capability to correctly interpret and respond to human behaviors. Entertainment is a natural way of integrating social robots into our lives. Playing chess with a social robot requires the recognition of human behavior using computer vision. Thus, chess is an attractive topic for research on human-machine interaction. It is necessary to produce solutions in areas such as image processing, strategy establishment, and others to solve chess problems.
Human activity recognition (HAR) is a challenging task that provides recognition of activities in complex interactions without verbal communication. HAR has numerous potential applications such as in devices to determine the communication between humans and the environment, surveillance systems, video understanding applications including online advertising, or video retrieval, and video surveillance. Depending on their complexity, human activities are categorized as gestures, atomic actions, human-to-object or human-to-human interactions, group actions, behaviors, and events [2]. A software system that handles HAR is required, which should perform three functions: background subtraction to separate the parts of the image that are invariant over time; human tracking in which the system locates human motion; and human action and object detection, in which the system is able to localize human activity in an image [3]. In order to localize action recognition, there are local representations such as the pipeline of interest point detection, local descriptor extraction, and aggregation of local descriptors.
Local descriptors based on pixel values and optical flow are used for HAR tasks. Local representations for action recognition on Space-Time Interest Points (STIPs) follow the pipeline of interest point detection, local descriptor extraction, and aggregation of local descriptors. Klaser et al. [4] suggest the histogram of gradient orientations as a motion descriptor. Optical flow is the distribution of apparent velocities of movement in captured imagery data. The motions are extracted by comparing two images, which might be considered between two images captured at two different times (temporal) or two images captured at exactly the same time but using two different cameras with known camera parameters (static optical flow). Generally, optical flow is a 2D vector where each vector is a displacement vector showing the movement of pixels from the first frame to the second in the perpendicular image axes. The Farnebäck method [5] is a two-frame motion estimation algorithm that uses polynomial expansion, where a neighborhood of each image pixel is approximated by a polynomial. In this work, we focus on quadratic polynomials, which give the local signal model represented in a local coordinate system.
The application of deep learning (DL) algorithms has become very popular in recent years in different research fields such as image processing, natural language processing, speech recognition, and machine translation. Deep neural network models are also now being also proposed for human activity recognition field. DL methods are preferred by researchers because they provide better predictions and results compared with traditional machine learning algorithms. Deep learning models are mainly used to provide automatic feature extraction by training complex features with minimal external support to obtain meaningful representation of data through deep neural networks. Furthermore, deep learning methods are also employed for the purpose of classification tasks in many fields. Convolutional neural networks (CNNs), recurrent neural networks (RNNs), long short-term memory networks (LSTMs), and deep belief networks (DBNs) are well-known architectures. In this work, a novel technique is presented for recognizing human activity from videos using deep learning algorithms.
In this study, the main purpose is to design a hybrid video-based HAR classification system for complex interactions inspired by a study in [6]. To the best of our knowledge, this is the first study to consolidate the optical flow and auxiliary movement information in the video by using deep learning models. Furthermore, the hybrid deep model classifies videos and generates video subtitles in addition to the consolidation process. Because it can determine many complex activities such as playing chess, playing checkers, playing peg solitaire, or playing cards, the proposed system is referred to as flexible. It can also be called as extendable because of adding any number of features as auxiliary information in long-short term memory networks. The system is simulated for recognizing the playing of chess in order to demonstrate the efficiency of the proposed model. For this purpose, a 3D-CNN model for the optical flow part and the LSTM algorithm for auxiliary information in videos are employed in the feature extraction step. Hand tracking, movement based on chessboard recognition, and chess pawn recognition are evaluated as auxiliary information in LSTM. After that, Optical Flow-based extracted features with the CNN and Hand Tracking – Chessboard Recognitions – Chess Pawn Recognition based extracted features are consolidated in our proposed system. After the feature extraction step, the Support Vector Machine (SVM) algorithm is utilized for video classification purposes. To demonstrate the effectiveness of the proposed system, two types of chess playing datasets are employed in the experiments.
The rest of this paper is organized as follows: Section 2 gives a summary of related work about human activity recognition, optical flow, and video classification. Section 3 explains deep learning methods used in the experiments. Section 4 describes the proposed framework. Sections 5 and 6 present the experimental setup, and experimental results, and the conclusions, respectively.
Conclusion
Human activity recognition is a challenging problem with many applications in fields such as visual surveillance, human-computer interaction, autonomous driving, and entertainment. To overcome this issue, there are many possible motion estimation approaches.
In this study, it is proposed to construct a hybrid deep model for the purpose of HAR. The proposed architecture is built combining a dense optical flow approach and auxiliary movement information in videos using deep learning methodologies. First, deep learning models, namely 3D convolutional neural network (3D-CNN), 3D-CNN with optical flow, long short-term memory network (LSTM) are combined to determine the motion vectors. The classification task for videos is then processed by a support vector machine algorithm. A wide range of comparative experiments are conducted on two newly generated chess datasets, namely the magnetic wall chess board video dataset (the MCDS), and the standard chess board video dataset (CDS) to demonstrate the contributions of the proposed study. Finally, the experimental results represent that the proposed hybrid deep model exhibits considerable classification success compared to the state-of-the-art studies. Furthermore, to the best of our knowledge, this is the first study based on a novel combination of 3D-CNN, 3D-CNN with optical flow, and LSTM over video frames to recognize the human activity.
In conclusion, the experimental results demonstrate that the proposed architecture represents significant advantages for recognizing and classifying human activities in videos. First, the proposed hybrid deep model is flexible, extendable, and customizable as it is able to determine many complex activities in various video datasets including playing chess, playing checkers, playing peg solitaire, and playing cards. Second, any number of features can be easily consolidated as auxiliary information for the proposed architecture. In addition to these advantages, the proposed hybrid deep model architecture allows the connection of other deep learning models to our proposed model as auxiliary features for purposes such as object recognition, hand tracking, and so on.
As future work, we plan to enrich the set of features by employing other deep learning techniques. Furthermore, heuristic optimization-based algorithms can also be used to enrich the feature set for the purpose of improving the classification performance of HAR.
About KSRA
The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.
KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.
Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.
The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.
FULL Paper PDF file:
A Hybrid Deep Model Using Deep Learning and Dense Optical Flow Approaches for Human Activity RecognitionBibliography
author
Year
2020
Title
A Hybrid Deep Model Using Deep Learning and Dense Optical Flow Approaches for Human Activity Recognition
Publish in
in IEEE Access, vol. 8, pp. 19799-19809, 2020,
Doi
10.1109/ACCESS.2020.2968529.
PDF reference and original file: Click here
Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.
-
Somayeh Nosratihttps://ksra.eu/author/somayeh/
-
Somayeh Nosratihttps://ksra.eu/author/somayeh/
-
Somayeh Nosratihttps://ksra.eu/author/somayeh/
-
Somayeh Nosratihttps://ksra.eu/author/somayeh/
Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.
-
siavosh kavianihttps://ksra.eu/author/ksadmin/
-
siavosh kavianihttps://ksra.eu/author/ksadmin/
-
siavosh kavianihttps://ksra.eu/author/ksadmin/
-
siavosh kavianihttps://ksra.eu/author/ksadmin/
Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.
-
Nasim Gazeranihttps://ksra.eu/author/nasim/
-
Nasim Gazeranihttps://ksra.eu/author/nasim/
-
Nasim Gazeranihttps://ksra.eu/author/nasim/
-
Nasim Gazeranihttps://ksra.eu/author/nasim/