Novel Frameworks for Attribute-Based Speech Emotion Recognition using Time-Continuous Traces and Sentence-Level Annotations




Journal Title

Journal ISSN

Volume Title



Speech emotion recognition (SER) plays an important role in a growing world of automation and artificial intelligence. Robust and accurate SER systems are crucial for enhancing human-computer interaction. Emotional attributes such as arousal (calm versus excited), valence (negative versus positive), and dominance (weak versus strong) provide a powerful representation to describe a wide variety of complex emotions expressed in everyday interactions. However, SER systems for emotional attributes face important key challenges. First, defining an effective temporal granularity for the analysis and recognition of emotion is an open research question. Most of the emotions expressed during human interactions are neutral. While existing SER frameworks often analyze isolated sentences, it is important to identify and focus the analysis on emotional salient regions. Second, emotional attribute labels are collected with perceptual evaluations from multiple annotators. The subjectiveness of the annotators and the complex nature of natural human interaction make the evaluation process noisy, leading to low inter-evaluator agreement affecting the quality of SER systems. It is important to define formulations that are effective in the presence of noisy labels. Another major challenge facing SER systems is their generalization across multiple conditions. SER systems have to maintain performance in the presence of different speakers, channels vi and recording conditions. This dissertation proposes novel frameworks to address these open challenges. For time-continuous annotations, this dissertation proposes the definition of emotionally salient regions (hotspots) using the qualitative agreement (QA) method. The QA method combines annotations from multiple evaluators by identifying agreeable trends. We illustrate the benefits of the QA method over averaging absolute values of the traces without considering trends across evaluators. After defining these hotspots, we propose fusion techniques to predict their presence with novel machine learning formulations. The detection method combines multiple deep learning regressors by averaging their predictions, or relying again on the QA method over the predicted traces. The results indicate that hotspots can be reliably identified with the proposed methods. To address the noisy nature of emotional labels, we formulate the SER task as an ordinal problem where the objective is to rank emotional recording according to a given emotional attribute. We obtain relative scores by establishing preferences between absolute scores. First, we explore whether a preference learning framework relying on deep learning can outperform conventional ranking algorithms (i.e., RankSVM). The proposed approach implemented with the RankNet algorithm achieves state-of-the-art results. Likewise, we modify the QA method to learn reliable preferences for sentence-level annotations, avoiding learning preferences from averaged absolute labels. We compare the QA-based preference labels with the absolute preference-labels for rank-ordering emotional attributes using the RankNet and RankMargin algorithms, obtaining more accurate predictions. Finally, this dissertation proposes frameworks to increase the generalization of SER systems. First, this dissertation presents methods to jointly learn multiple emotional attributes by exploiting their interdependencies. The framework relies on multi-task learning (MTL) with shared hidden layers to learn rich representations for attribute prediction. Next, we improve our MTL framework by including an unsupervised auxiliary task to reconstruct hidden layer representations of an autoencoder. The framework relies on ladder networks that utilize skip connections between encoder and decoder layers to learn powerful representations of emotional attributes. This approach achieves state-of-the-art performance for within-corpus and cross-corpus experiments. Collectively, this dissertation addresses various challenges affecting SER systems. Our novel contributions produce state-of-the-art performance for emotional attribute predictions, making clear advances toward SER-based applications.



Automatic speech recognition, Emotion recognition, Human-computer interaction, Qualitative reasoning, Computer multitasking, Supervised learning (Machine learning)


©2019 Srinivas Parthasarathy