Novel Frameworks for Attribute-Based Speech Emotion Recognition using Time-Continuous Traces and Sentence-Level Annotations
Abstract
Abstract
Speech emotion recognition (SER) plays an important role in a growing world of automation and artificial intelligence. Robust and accurate SER systems are crucial for enhancing
human-computer interaction. Emotional attributes such as arousal (calm versus excited),
valence (negative versus positive), and dominance (weak versus strong) provide a powerful
representation to describe a wide variety of complex emotions expressed in everyday interactions. However, SER systems for emotional attributes face important key challenges. First,
defining an effective temporal granularity for the analysis and recognition of emotion is an
open research question. Most of the emotions expressed during human interactions are neutral. While existing SER frameworks often analyze isolated sentences, it is important to
identify and focus the analysis on emotional salient regions. Second, emotional attribute labels are collected with perceptual evaluations from multiple annotators. The subjectiveness
of the annotators and the complex nature of natural human interaction make the evaluation
process noisy, leading to low inter-evaluator agreement affecting the quality of SER systems.
It is important to define formulations that are effective in the presence of noisy labels. Another major challenge facing SER systems is their generalization across multiple conditions.
SER systems have to maintain performance in the presence of different speakers, channels
vi
and recording conditions. This dissertation proposes novel frameworks to address these open
challenges.
For time-continuous annotations, this dissertation proposes the definition of emotionally
salient regions (hotspots) using the qualitative agreement (QA) method. The QA method
combines annotations from multiple evaluators by identifying agreeable trends. We illustrate
the benefits of the QA method over averaging absolute values of the traces without considering trends across evaluators. After defining these hotspots, we propose fusion techniques
to predict their presence with novel machine learning formulations. The detection method
combines multiple deep learning regressors by averaging their predictions, or relying again
on the QA method over the predicted traces. The results indicate that hotspots can be
reliably identified with the proposed methods.
To address the noisy nature of emotional labels, we formulate the SER task as an ordinal
problem where the objective is to rank emotional recording according to a given emotional
attribute. We obtain relative scores by establishing preferences between absolute scores.
First, we explore whether a preference learning framework relying on deep learning can
outperform conventional ranking algorithms (i.e., RankSVM). The proposed approach implemented with the RankNet algorithm achieves state-of-the-art results. Likewise, we modify
the QA method to learn reliable preferences for sentence-level annotations, avoiding learning
preferences from averaged absolute labels. We compare the QA-based preference labels with
the absolute preference-labels for rank-ordering emotional attributes using the RankNet and
RankMargin algorithms, obtaining more accurate predictions.
Finally, this dissertation proposes frameworks to increase the generalization of SER systems.
First, this dissertation presents methods to jointly learn multiple emotional attributes by
exploiting their interdependencies. The framework relies on multi-task learning (MTL) with
shared hidden layers to learn rich representations for attribute prediction. Next, we improve our MTL framework by including an unsupervised auxiliary task to reconstruct hidden layer
representations of an autoencoder. The framework relies on ladder networks that utilize
skip connections between encoder and decoder layers to learn powerful representations of
emotional attributes. This approach achieves state-of-the-art performance for within-corpus
and cross-corpus experiments.
Collectively, this dissertation addresses various challenges affecting SER systems. Our novel
contributions produce state-of-the-art performance for emotional attribute predictions, making clear advances toward SER-based applications.