Temporarily-Aware Context Modeling Using Generative Adversarial Networks for Speech Activity Detection

Temporarily-Aware Context Modeling Using Generative Adversarial Networks for Speech Activity Detection

Table of Contents


This paper presents a novel framework for Speech Activity Detection (SAD). Inspired by the recent success of multi-task learning approaches in the speech processing domain, we propose a novel joint learning framework for SAD. We utilize generative adversarial networks to automatically learn a loss function for joint prediction of the frame-wise speech/ non-speech classifications together with the next audio segment. In order to exploit the temporal relationships within the input signal, we propose a temporal discriminator which aims to ensure that the predicted signal is temporally consistent. We evaluate the proposed framework on multiple public benchmarks, including NIST OpenSAT’ 17, AMI Meeting, and HAVIC, where we demonstrate its capability to outperform state-of-the-art SAD approaches. Furthermore, our cross-database evaluations demonstrate the robustness of the proposed approach across different languages, accents, and acoustic environments.


  • Author Keywords

    • Speech activity detection,
    • generative adversarial networks,
    • context modeling
  • IEEE Keywords

    • Task analysis,
    • Speech processing,
    • Generative adversarial networks,
    • Gallium nitride,
    • Mel frequency cepstral coefficient,
    • Generators,
    • Context modeling



Speech Activity Detection (SAD) plays a pivotal role in many speech processing systems. Despite the consistent progress attained in this subject, the problem is far from being solved as evidenced by evaluation results across the vast variety of acoustic conditions featured in challenging benchmarks such as HAVIC [1] and NIST OpenSAT’ 17 [2]. Our work is inspired by recent observations in speech processing where multi-task learning approaches have shown to outperform single task learning methods in numerous areas, including, speech synthesis [3], speech recognition [4], speech enhancement [5], and speech emotion recognition [6]. For instance, the seminal work by Pironkov et. al [4] demonstrated that significant improvements in the accuracy of Automatic Speech Recognition (ASR) can be obtained by combining the ASR task with context recognition and gender classification as auxiliary tasks, as opposed to performing ASR alone. Furthermore, the evaluations in [5], [6] suggested that methods learned using the multi-task learning paradigm are not only robust when evaluated in cross-database scenarios, but also learn powerful and more discriminative features to facilitate both tasks. Inspired by these findings, we exploit the power of Generative Adversarial Networks (GAN) [7], [8] to accurately perform speech/non-speech classification together with an auxiliary task. In choosing the appropriate auxiliary task for SAD we draw inspiration from a conclusion in the field of neuroscience that humans recognize speech in noisy conditions through the awareness of the next segment of speech which is most likely to be heard [9], [10]. We, therefore, chose the prediction of the next audio segment as the auxiliary task as it also complements the primary SAD task via learning the context of the input audio embedding. Through the prediction of the next audio segment our model tries to learn a contextual mapping between the input audio segments and the next segment which is likely to be heard.

Even though the final speech activity decision is agnostic to the actual content of speech, there are reasons to conjecture that the SAD accuracy could be improved by making use of the semantic information of the speech. It is known that humans make use the semantic information to understand speech that is affected significantly by noise [9], [10]. In [11] the authors demonstrated that our inferior-frontal cortex predicts what someone is likely to hear next even before the actual sound reaches the superior temporal gyrus, allowing us to separate noise from what is actually spoken. One of our aims in this paper is to investigate how and to what extent we could improve the performance of SAD if we were to use semantic information to predict the next speech segment. Current SAD methods simply classify whether a sample is speech or nonspeech, without paying attention to the temporal context. Even though the current state-of-the-art SAD systems extract features from a sliding window surrounding the event frame of interest, they consider the frame as an isolated event and do not consider the entire sequence when detecting the speech activity. We show in this paper that through the prediction of the next audio segment by exploiting the task-specific loss function learning capability of the GAN framework, we can improve SAD accuracy by a significant amount.

The proposed architecture is shown in Fig. 1. The model utilizes audio, Mel-Frequency Cepstral Coefficients (MFCC) and Deltas of MFCC as the inputs and encodes these inputs into an encoded representation, C. The generators receive this input embedding, C, and a noise vector, z, as the inputs. We utilize two generators, Gη, for synthesizing the frame-wise speech/ non-speech classifications and Gw, which synthesizes the audio signal for the next time window. It should be noted that in Fig. 1 the generators, Gη, and Gw are denoted as two separate LSTM blocks, each with two cells of LSTMs.The static discriminator, Dη, receives the current input embeddings and either the synthesized or ground truth speech classification sequences and tries to discriminate between the two. The temporal discriminator, Dw, also receives the current input embeddings and either the synthesized or ground truth future audio segments and learns to classify them, considering the temporal consistency of those signals.


In this paper, we propose a novel multi-task learning framework for speech activity detection, by properly analyzing the context of the input embeddings and their temporal accordance. We contribute a novel data-driven method to capture salient information from the observed audio segment by jointly predicting the speech activity classification sequence and the audio for the next time frame. Additionally, we introduce a temporal discriminator to enforce these relationships in the synthesized data. Our quantitative evaluations using multiple supervised SAD benchmarks, including NIST OpenSAT’ 17 [2], AMI Meeting [50] OpenKWS’13 [51], and HAVIC [1] demonstrated the utility of the proposed multi-task learning framework compared to the single task-based supervised SAD baselines. Furthermore, through ablation model evaluations presented Sec. V-E we demonstrate that the automatic learning of a loss function specifically considering the task at hand, as opposed to using hand-engineered losses, has significantly contributed to the superior performance attained in the proposed multi-task learning framework. In addition, in Tab. IV we provide comparisons regarding systems with and without using the proposed temporal discriminator. The evaluation of the temporal discriminator, which enforces the temporal relationships between audio frames of the synthesized outputs, demonstrates the utility of incorporating this intelligence in the discriminator, which guides the generator to generate realistic outputs. With empirical evaluations we illustrate that the future audio segment prediction auxiliary task contributes to augment the performance of the SAD task, demonstrating the utility of multi-task learning and the importance of the future audio segment prediction task for learning the context of the input embeddings. To better demonstrate the robustness of the proposed framework we conducted a cross-database evaluation where we train the model using a separate dataset and tested on another dataset. This experiment revealed that the proposed multi-task learning framework learns better discriminative features which are more robust across multiple datasets, compared to the current state-of-the-art supervised SAD models. We would like to emphasize that these evaluated datasets are of different languages, accents, and acoustics and the proposed method exhibits 37-52% relative gain over the best alternate approach (ACAM [28]) when evaluated with NIST OpenSAT’ 17 [2].


his research was supported by an Australian Research Council (ARC) Discovery grant DP140100793.

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

Temporarily-Aware Context Modeling Using Generative Adversarial Networks for Speech Activity Detection



T. Fernando, S. Sridharan, M. McLaren, D. Priyasad, S. Denman and C. Fookes,




Temporarily-Aware Context Modeling Using Generative Adversarial Networks for Speech Activity Detection

Publish in

in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1159-1169, 2020



PDF reference and original file: Click here




+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.