Unsupervised Personalization and Deep Uncertainty Modeling for Speech Emotion Recognition
Date
ORCID
Journal Title
Journal ISSN
Volume Title
Publisher
item.page.doi
Abstract
Robust, reliable and generalizable speech emotion recognition (SER) systems have wide application in areas such as healthcare, security and defense. These areas require mission critical applications where accuracy, test-retest reliability and scalability are of high importance. This dissertation aims to develop SER solutions that can address these requirements. We focus on two main directions to improve SER performance: personalization of a SER system to target individuals, and exploring uncertainty modeling to quantify the reliability of the SER predictions. We formulate the prediction of emotional attributes as a regression problem and the recognition of primary emotions as a classification problem implemented using deep neural networks (DNNs). In our first research direction, we develop personalization approaches for SER models, which are adapted to improve the performance on target speakers. We study the role of regularization to understand the need for personalization. We focus on the prediction of valence (negative versus positive), which has often been shown to have lower recognition performance compared to other emotional attributes such as arousal (calm versus active) and dominance (weak versus strong). Through an exhaustive analysis, we demonstrate that the prediction of valence needs higher regularization in DNNs than other emotional attributes such as arousal and dominance. We explore the nature of valence emotional cues conveyed in speech, finding that they possess stronger speaker dependent traits. Higher regularization forces the network to learn global patterns that generalize across speakers. This finding suggests that the accuracy of SER models, especially in the valence dimension, can be improved by developing a personalization strategy that enables the models to capture the speaker dependent traits from the target subjects. We explore an unsupervised learning approach to adapt DNN models to target speakers in the test set by searching for speakers in the train set with similar acoustic patterns as the speaker in the test set. With the data from the selected speakers in the train set (i.e., the closest speakers to the speakers in the test set), we propose transfer learning and weighting strategies to adapt the SER system to target speakers, achieving statistically significant gains in the prediction of valence. This approach is an effective personalization method for SER problems. The second direction pursued in this dissertation is the modeling of uncertainty in the SER predictions. SER is a difficult task when the recordings come from everyday spontaneous interactions, involving complex emotional behaviors in human communication. Therefore, it is imperative to develop systems that can provide a confidence score on their predictions, leading to better calibrated SER systems. We explore approaches for reliability estimation for SER using Bayesian learning methods, demonstrating the benefits in machine learning formulation with a reject option. This strategy enables a SER model to abstain from prediction when the confidence of the prediction is low. Therefore, it selectively trades the coverage on the test set for which the system provide a prediction for better performance. We propose uncertainty modeling approaches for SER classification and regression tasks using criteria such as empirical risk minimization and methods such as Monte Carlo dropout (MCD). We analyze the prediction uncertainties as a function of the predicted emotional attribute scores and the inter-evaluator agreement of the corresponding ground truth annotation. We find that speech segments that are harder for the model to evaluate are harder for the evaluators to annotate as well. Using probabilistic graphical networks, we demonstrate a novel uncertainty modeling scheme where subjective uncertainties can be learned using the true annotator label distributions. Likewise, we propose a novel teacher-student ensemble formulation to perform SER in a scalable and consistent manner that captures uncertainty in the predictions. Finally, we provide a comprehensive study on the benefits in uncertainty modeling techniques for SER for the applications of active learning, reject options and curriculum learning algorithms.