Lexical Dependent Emotion Detection Using Synthetic Speech Reference

Lotfian, Reza; Busso, Carlos A.

Lexical Dependent Emotion Detection Using Synthetic Speech Reference

dc.contributor.ORCID	0000-0003-1891-0267 (Lotfian, R)
dc.contributor.ORCID	0000-0002-4075-4072 (Busso, CA)
dc.contributor.author	Lotfian, Reza
dc.contributor.author	Busso, Carlos A.
dc.contributor.utdAuthor	Lotfian, Reza
dc.contributor.utdAuthor	Busso, Carlos A.
dc.date.accessioned	2020-08-15T21:45:24Z
dc.date.available	2020-08-15T21:45:24Z
dc.date.issued	2019-02-08
dc.description.abstract	This paper aims to create neutral reference models from synthetic speech to contrast the emotional content of a speech signal. Modeling emotional behaviors is a challenging task due to the variability in perceiving and describing emotions. Previous studies have indicated that relative assessments are more reliable than absolute assessments. These studies suggest that having a reference signal with known emotional content (e.g., neutral emotion) to compare a target sentence may produce more reliable metrics to identify emotional segments. Ideally, we would like to have an emotionally neutral sentence with the same lexical content as the target sentence where their contents are timely aligned. In this fictitious scenario, we would be able to identify localized emotional cues by contrasting frame-by-frame the acoustic features of the target and reference sentences. This paper explores the idea of building these reference sentences leveraging the advances in speech synthesis. This paper builds a synthetic speech signal that conveys the same lexical information and is timely aligned with the target sentence in the database. Since it is expected that a single synthetic speech will not capture the full range of variability observed in neutral speech, we build multiple synthetic sentences using various voices and text-to-speech approaches. This paper analyzes whether the synthesized signals provide valid template references to describe neutral speech using feature analysis and perceptual evaluation. Finally, we demonstrate how this framework can be used in emotion recognition, achieving improvements over classifiers trained with the state-of-the-art features in detecting low versus high levels of arousal and valence.
dc.description.department	Erik Jonsson School of Engineering and Computer Science
dc.identifier.bibliographicCitation	Lotfian, Reza, and Carlos Busso. 2019. "Lexical Dependent Emotion Detection Using Synthetic Speech Reference." IEEE Access 7: 22071-22085, doi: 10.1109/ACCESS.2019.2898353
dc.identifier.issn	2169-3536
dc.identifier.uri	http://dx.doi.org/10.1109/ACCESS.2019.2898353
dc.identifier.uri	https://hdl.handle.net/10735.1/8802
dc.identifier.volume	7
dc.language.iso	en
dc.publisher	IEEE-Inst Electrical Electronics Engineers Inc
dc.rights	Open Access Publishing Agreement. Commercial reuse is prohibited.
dc.rights	©2019 IEEE
dc.rights.uri	https://open.ieee.org/index.php/about-ieee-open-access/faqs/
dc.source.journal	IEEE Access
dc.subject	Speech—Analysis
dc.subject	Speech perception
dc.subject	Speech synthesis
dc.subject	Computer science
dc.subject	Engineering
dc.subject	Telecommunication
dc.title	Lexical Dependent Emotion Detection Using Synthetic Speech Reference
dc.type.genre	article

Files

Original bundle

Now showing 1 - 1 of 1

Name:: JECS-6770-261799.46.pdf
Size:: 12.32 MB
Format:: Adobe Portable Document Format
Description:: Article

Download

Collections

Busso-Recabarren, Carlos A.