Lexical Dependent Emotion Detection Using Synthetic Speech Reference

dc.contributor.ORCID0000-0003-1891-0267 (Lotfian, R)
dc.contributor.ORCID0000-0002-4075-4072 (Busso, CA)
dc.contributor.authorLotfian, Reza
dc.contributor.authorBusso, Carlos A.
dc.contributor.utdAuthorLotfian, Reza
dc.contributor.utdAuthorBusso, Carlos A.
dc.date.accessioned2020-08-15T21:45:24Z
dc.date.available2020-08-15T21:45:24Z
dc.date.issued2019-02-08
dc.description.abstractThis paper aims to create neutral reference models from synthetic speech to contrast the emotional content of a speech signal. Modeling emotional behaviors is a challenging task due to the variability in perceiving and describing emotions. Previous studies have indicated that relative assessments are more reliable than absolute assessments. These studies suggest that having a reference signal with known emotional content (e.g., neutral emotion) to compare a target sentence may produce more reliable metrics to identify emotional segments. Ideally, we would like to have an emotionally neutral sentence with the same lexical content as the target sentence where their contents are timely aligned. In this fictitious scenario, we would be able to identify localized emotional cues by contrasting frame-by-frame the acoustic features of the target and reference sentences. This paper explores the idea of building these reference sentences leveraging the advances in speech synthesis. This paper builds a synthetic speech signal that conveys the same lexical information and is timely aligned with the target sentence in the database. Since it is expected that a single synthetic speech will not capture the full range of variability observed in neutral speech, we build multiple synthetic sentences using various voices and text-to-speech approaches. This paper analyzes whether the synthesized signals provide valid template references to describe neutral speech using feature analysis and perceptual evaluation. Finally, we demonstrate how this framework can be used in emotion recognition, achieving improvements over classifiers trained with the state-of-the-art features in detecting low versus high levels of arousal and valence.
dc.description.departmentErik Jonsson School of Engineering and Computer Science
dc.identifier.bibliographicCitationLotfian, Reza, and Carlos Busso. 2019. "Lexical Dependent Emotion Detection Using Synthetic Speech Reference." IEEE Access 7: 22071-22085, doi: 10.1109/ACCESS.2019.2898353
dc.identifier.issn2169-3536
dc.identifier.urihttp://dx.doi.org/10.1109/ACCESS.2019.2898353
dc.identifier.urihttps://hdl.handle.net/10735.1/8802
dc.identifier.volume7
dc.language.isoen
dc.publisherIEEE-Inst Electrical Electronics Engineers Inc
dc.rightsOpen Access Publishing Agreement. Commercial reuse is prohibited.
dc.rights©2019 IEEE
dc.rights.urihttps://open.ieee.org/index.php/about-ieee-open-access/faqs/
dc.source.journalIEEE Access
dc.subjectSpeech—Analysis
dc.subjectSpeech perception
dc.subjectSpeech synthesis
dc.subjectComputer science
dc.subjectEngineering
dc.subjectTelecommunication
dc.titleLexical Dependent Emotion Detection Using Synthetic Speech Reference
dc.type.genrearticle

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
JECS-6770-261799.46.pdf
Size:
12.32 MB
Format:
Adobe Portable Document Format
Description:
Article