Perception and Production of Emotional Prosody in the Speech of Mandarin-Speaking Adults with Cochlear Implants




Journal Title

Journal ISSN

Volume Title



Emotional prosody, which refers to the process of expressing emotions through spoken language, is essential for correctly recognizing speakers’ emotional states during spoken communication. Previous research has shown that non-tonal language-speaking individuals with cochlear implants (CIs) demonstrate deficits in perceiving and producing emotional prosody, as compared to their typical hearing (TH) counterparts. These research, however, did not explore how well tonal language-speaking individuals (e.g., Mandarin) with CIs perceive and produce emotional prosody in speech, in comparison to their TH counterparts. Additionally, no data are available to clarify whether CI adults who speak tonal languages differ from those who speak non-tonal languages with respect to the extent of emotional prosodic processing. These concerns were addressed in this dissertation through four experiments. The first experiment explores the differences in emotional prosody perception between 15 TH adults and 15 CI adults. All were native Mandarin-speaking adults. The TH listeners were required to listen to natural speech stimuli and noise-vocoded speech stimuli designed to stimulate CI input. The CI adults were only asked to listen to natural speech. The results showed that overall emotional prosody recognition by TH and CI listeners for natural speech is 72.8% and 50.3%, respectively. These findings suggest that CI adults demonstrate deficits in perceiving emotional prosody. TH listeners performed better with natural speech than with noise-vocoded speech, and their intelligibility was lower when the number of noisevocoded filter channels was reduced. In addition, the performance of CI listeners in natural speech was similar to that of TH listeners at a lower channel setting (at 4-channel), in contrast to 8-channel shown in previous comparable studies of non-tonal languages (e.g., English). This finding provide evidence consistent with a “functional view” hypothesis, which claims that Mandarin (a tonal language that uses pitch for purposes of linguistic tone) has relatively little prosodic space to signal emotional prosody through the pitch dimension. The second experiment was intended to determine whether enhancement of secondary cues (duration and amplitude) can benefit CI listeners to perceive emotions. This was explored by modifying the prosodic cues for two contrasting emotions, “happy” and “sad”, and observing how the CI listeners perceived these modifications. The result showed that increased duration cues can slightly improve recognition of the “sad” emotion and increased amplitude can improve identification of the “happy” emotion. These findings suggest that the selected enhancement of secondary cues could potentially benefit CI listeners. The third experiment investigated whether TH and CI talkers differ in terms of acoustic cue production and examined which acoustic measures are most predictive of emotions produced by these talkers. This was done by analyzing the fundamental frequency (F0), intensity, and duration patterns of short sentences spoken by the TH and CI talkers in the “angry”, “happy” and “sad” emotional contexts. The results suggested that CI talkers showed decreased mean intensity, increased intensity range, and sentence duration values in their emotional prosody productions compared to their TH counterparts. In addition, a machine learning (decision tree) model of emotion classification was used to analyze which acoustic measures were most predictive of three emotions produced by TH and CI talkers. The results indicated that TH talkers utilize intensity as the most important classifier, followed by F0 to predict the three emotions, while CI talkers used duration as the most important classifier, followed by intensity. The findings of model indicated that the secondary cues (duration and intensity) are most predictive in classifying the three emotions in the CI talkers’ productions. The fourth experiment examined the production data in a perceptual manner to determine whether the deficits of CI talkers described acoustically in Experiment 3 could be perceived by TH listeners. The results confirm that CI users show impaired emotional prosody production, and this deficit is reflected in lowered perception scores by TH listeners. In addition, CI talkers received more judgments for the “neutral” emotion than did TH talkers, even though these produced sentences were not intended to express a “neutral” emotion. This pattern of result suggests that the CI users produced speech with impaired F0, resulting in a less perceptible (and therefore more monotone or “neutral”) judgement. Finally, there was a significant correlation (r=0.524, p < 0.05) between the emotional prosody perception ability of CI individuals and TH listeners’ rating scores of the same CI individuals’ productions. This implies that difficulties with emotional prosody perception contributes to imprecise speech prosody production, through a reduced ability to form correct speech internal models and/or by problems in monitoring auditory feedback relating to prosodic cues in the speech.



Cochlear implants, Versification, Language and emotions, Speech perception, Machine learning, Mandarin dialects