Machine Learning Solutions for Emotional Speech: Exploiting the Information of Individual Annotations
Speech emotion recognition solutions designed under controlled conditions do not translate well into real-life applications. The first step to address this issue is developing a large, natural emotional database. Current approaches used to collect spontaneous databases tend to provide unbalanced emotional content, which is dictated by the given recording protocol. The size and speaker diversity are also limited. We propose a novel approach to effectively build a large, naturalistic emotional database with balanced emotional content, reduced cost and reduced manual labor. It relies on existing spontaneous recordings obtained from audio-sharing websites. To balance the emotional content, we apply preference learning methods to retrieve speech samples with emotions that are less frequent in everyday conversation. The target emotions in this approach can be represented with attribute based labels (e.g., arousal, valence, dominance) or by categorical emotion (e.g., happy, angry, sad). We address both of these problems. Motivated by positive results from applying preference learning to emotion recognition problem, we propose to use neutral reference to detect the absolute emotional content of speech samples. We use synthetic speech which synthesized using same transcript and use it as reference to account for the lexical dependencies which is considered a nuisance factor in detecting emotions. The collected MSP-PODCAST database opens new research opportunity to address some challenging aspects of emotion recognition. One challenge is that expressive behaviors tend to be ambiguous with blended emotions during spontaneous conversations. Therefore, evaluators disagree on the perceived emotion, assigning multiple emotional classes to the same stimuli. These observations have clear implications on emotion classification, where assigning a single descriptor per stimuli oversimplifies the intrinsic subjectivity in emotion perception. We propose a new formulation, where the emotional perception of a stimuli is a multidimensional Gaussian random variable with an unobserved distribution. Each dimension corresponds to an emotion characterized by a numerical scale. We train the deep network to predict the mean vector of this distribution instead of single label. The second challenge we address is the limited examples from minority emotional for training. We argue that individual labels convey more information than the consensus labels. We present a novel over-sampling approach, where the samples are over-sampled according to the labels from individual evaluations. This approach (1) increases the number of samples from classes with underrepresented consensus labels, and (2) efficiently uses samples with ambiguous emotional content. The next machine learning concept we discuss in this study is the order which the samples need to be introduced to the learning process. We introduce a method to design a curriculum for machine learning to maximize the efficiency during learning. The curriculum is arranged to gradually learn samples from easy to difficult. For emotion recognition, the challenge is to establish an order of difficulty in the training set. We propose to use the disagreement between evaluators as a measure of difficulty of the classification. Our experimental results show that relying on a curriculum based on human judgment of emotion consistently improves the classification performance across emotion recognition task, and increase the convergence rate.