M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues

M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues

Table of Contents




Abstract

We present M3ER, a learning-based method for emotion recognition from multiple input modalities. Our approach combines cues from multiple co-occurring modalities (such as face, text, and speech) and also is more robust than other methods to sensor noise in any of the individual modalities. M3ER models a novel, data-driven multiplicative fusion method to combine the modalities, which learn to emphasize the more reliable cues and suppress others on a per-sample basis. By introducing a check step that uses Canonical Correlational Analysis to differentiate between ineffective and effective modalities, M3ER is robust to sensor noise. M3ER also generates proxy features in place of the ineffectual modalities. We demonstrate the efficiency of our network through experimentation on two benchmark datasets, IEMOCAP and CMU-MOSEI. We report a mean accuracy of 82.7% on IEMOCAP and 89.0% on CMU-MOSEI, which, collectively, is an improvement of about 5% over prior work.

Introduction

The perception of human emotions plays a vital role in our everyday lives. People modify their responses and behaviors based on their perception of the emotions of those around them. For example, one might cautiously approach a person they perceive to be angry, whereas they might be more forthcoming when approaching a person they perceive to be happy and calm. Given the importance of emotion perception, emotion recognition from sensor data is important for various applications, including affective computing (Yates et al. 2017; Atcheson, Sethu, and Epps 2017), human-computer interaction (Cowie et al. 2001), surveillance (Clavel et al. 2008), robotics, games and entertainment, and more. In this work, we address the problem of perceived emotion recognition rather than recognition of the actual emotional state.

One of the primary tasks in developing efficient AI systems for perceiving emotions is to combine and collate information from the various modalities by which humans express emotion. These modalities include, but are not limited to, facial expressions, speech and voice modulations, written text, body postures, gestures, and walking styles. Many researchers have advocated combining more than one modality to infer perceived emotion for various reasons, including:

(a) Richer information: Cues from different modalities can augment or complement each other, and hence lead to more sophisticated inference algorithms.

(b) Robustness to Sensor Noise: Information on different modalities captured through sensors can often be corrupted due to signal noise, or be missing altogether when the particular modality is not expressed, or cannot be captured due to occlusion, sensor artifacts, etc. We call such modalities ineffectual. Ineffectual modalities are especially prevalent in in-the-wild datasets.

However, multimodal emotion recognition comes with its own challenges. At the outset, it is important to decide which modalities should be combined and how. Some modalities are more likely to co-occur than others, and therefore are easier to collect and utilize together. For example, some of the most popular benchmark datasets on multiple modalities, such as IEMOCAP (Busso et al. 2008) and CMU-MOSEI (Zadeh et al. 2018b), contain commonly cooccurring modalities of facial expressions with associated speech and transcribed text. With the growing number of social media sites and data on the internet (e.g., YouTube), often equipped with automatic caption generation, it is easier to get data for these three modalities. Many of the other existing multimodal datasets (Ringeval et al. 2013; Kossaifi et al. 2017) are also a subset of these three modalities. Consequently, these are the modalities we have used in our work.

Another challenge is the current lack of agreement on the most efficient mechanism for combining (also called “fusing”) multiple modalities (Baltrusaitis, Ahuja, and Morency 2017). The most commonly used techniques are early fusion (also “feature-level” fusion) and late fusion (also “decision level” fusion). Early fusion combines the input modalities into a single feature vector on which a prediction is made. In late fusion methods, each of the input modalities is used to make an individual prediction, which is then combined for the final classification. Most prior works on emotion recognition works have explored early fusion (Sikka et al. 2013) and late fusion (Gunes and Piccardi 2007) techniques in additive combinations. Additive combinations assume that every modality is always potentially useful and hence should be used in the joint representation. This assumption makes the additive combination not ideal for in-the-wild datasets that are prone to sensor noise. Hence, in our work, we use a multiplicative combination, which does not make such an assumption. Multiplicative methods explicitly model the relative reliability of each modality on a per-sample basis, such that reliable modalities are given higher weight in the joint prediction.

Main Contributions: We make the following contributions:

1. We present a multimodal emotion recognition algorithm called M3ER, which uses a data-driven multiplicative fusion technique with deep neural networks. Our input consists of the feature vectors for three modalities — face, speech, and text.

2. To make M3ER robust to noise, we propose a novel preprocessing step where we use Canonical Correlational Analysis (CCA) (Hotelling 1936) to differentiate between an ineffectual and effectual input modality signal. 3. We also present a feature transformation method to generate proxy feature vectors for ineffectual modalities given the true feature vectors for the effective modalities. This enables our network to work even when some modalities are corrupted or missing.

We compare our work with prior methods by testing our performance on two benchmark datasets IEMOCAP and CMU-MOSEI. We report an accuracy of 82.7% on the IEMOCAP dataset and 89.0% on the CMU-MOSEI dataset, which is a collective 5% accuracy improvement on the absolute over prior methods. We show ablation experiment results on both datasets, where almost 75% of the data has at least one modality corrupted or missing, to demonstrate the importance of our contributions. As per the annotations in the datasets, we classify IEMOCAP into 4 discrete emotions (angry, happy, neutral, sad) and CMU-MOSEI into 6 discrete emotions (anger, disgust, fear, happy, sad, surprise). According to the continuous space representations, emotions are seen as points on a 3D space of arousal, valence, and dominance (Ekman and Friesen 1967). We depict two of these dimensions in Figure 2. Our algorithm is not limited to specific emotion labels, as any combination of the emotions can be used to represent other emotions.

Conclusion, Limitations, and Future Work

We present M3ER, a multimodal emotion recognition model that uses a multiplicative fusion layer. M3ER is robust to sensor because of a modality check step that distinguishes between good and bad signals to regenerate a proxy feature vector for bad signals. We use multiplicative fusion to decide on a per-sample basis which modality should be relied on more for making a prediction. Currently, we have applied our results to databases with three input modalities, namely face, speech, and text. Our model has limitations and often confuses between certain class labels. Further, we currently perform binary classification per class; however, human perception is rather subjective in nature and would resemble a probability distribution over these discrete emotions. Thus, it would be useful to consider multi-class classification in the future. As part of future work, we would also explore more elaborate fusion techniques that can help improve the accuracies. We would like to extend M3ER for more than three modalities. As suggested in psychological studies, we would like to explore more naturalistic modalities like walking styles and even contextual information.

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues

Bibliography

author

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha
Department of Computer Science, University of Maryland College Park, Maryland 20740, USA {trisha, uttaranb, rohan, ab, dm}@cs.umd.edu

Year

2020

Title

M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues

Publish in

Electrical Engineering and Systems Science > Signal Processing

PDF reference and original file: Click here

 

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.