|dc.description.abstract||One of the main barriers in the deployment of speech emotion recognition systems in real
applications is the lack of generalization of the emotion classifiers. The recognition performance achieved in controlled recordings drops when the models are tested with different
speakers, channels, environments and domain conditions. Annotations of new data in the
new domain is expensive and time-consuming. Therefore, it is important to design strategies
that efficiently use limited amounts of labeled data in the new domain and extract as much
useful information as possible from the available unlabeled data to improve the robustness of
the system. This thesis studies approaches to generalize emotion classifiers to new domains.
First, we explore supervised model adaptation, which modifies the trained model using labeled data from the new domain. We study the data requirements and different approaches
for SVM adaptation in the context of supervised adaptation for speech based emotion recognition. The results indicate that even small portion of data used for adaptation can significantly improve the performance. Increasing the speaker diversity in the labeled data used
for adaptation does not provide significant gain in performance. Also, we observe that classifiers trained with naturalistic or acted data achieve similar performance after adapting the
models to the target domain.
Second, we propose solutions for semi-supervised domain adaptation. We explore the use
of active learning (AL) in speech emotion recognition. Active learning selects samples in
the new domain that are used to adapt the classification models using domain adaptation.
We consider two approaches. The first approach focuses on selecting samples that are more
beneficial to the classifier. We propose a novel iterative fast converging incremental adaptation algorithm that only uses correctly classified samples at each iteration. This conservative
framework creates sequences of smooth changes in the decision hyperplane, resulting in statistically significant improvements over conventional schemes that adapt the models at once
using all the available data. The second approach focuses on selecting the features that
optimize the performance in the new domain. The method combines AL along with feature
selection to build a diverse ensemble that performs well in the new domain. The use of
ensembles is an attractive solution, since they can be built to perform well across different
mismatches. We study various data selection criteria, and different sample sizes to determine
the best approach toward building a robust and diverse ensemble of classifiers. The results
demonstrated that we can achieve a significant improvement by performing feature selection
on a small set from the target domain.
Finally, we explore unsupervised adaptation for speech emotion recognition. We propose
to use adversarial multitask training to extract a common representation between training
and testing domains. The primary task is to predict emotional attribute-based descriptors
for arousal, valence or dominance. The secondary task is to learn a common representation
where the train and test domains cannot be distinguished. By using a gradient reversal
layer, the gradients coming from the domain classifier are used to bring the source and target
domain representations closer. We show that exploiting unlabeled data consistently leads to
better emotion recognition performance across all emotional dimensions. We visualize the
effect of adversarial training on the feature representation across the proposed deep learning
architecture. The analysis shows that the data representations for the train and test domains
converge as the data is passed to deeper layers of the network.
The proposed advances create appealing strategies to build robust speech emotion classifiers that generalize across domains, providing practical affective-aware solutions to real-life