Conversational Speech Understanding in Highly Naturalistic Audio Streams




Journal Title

Journal ISSN

Volume Title



Humans spend an average of more than 35% of waking time communicating to someone through the mode of speech. Whether it is individual/group conversations, meetings or speech broadcasts, valuable information is disseminated. In the present world of highly socially active environment; there is a profound need to model and understand the conversations to gain knowledge on the quality of the conversation and to provide feedback to the users.

Conversational speech understanding can also help in business communication environment with the perspective of providing a better service. This is very challenging task due to enormous variability introduced by speakers, accents, variable domains of communication, natural background environments, enormous subtleties inherent in the spoken language and computational challenges. Conversational speech understanding systems lies at the intersection of speech processing, natural language understanding and human psychology. Hence building conversational speech understanding systems is a very challenging. Recent advances in neural network modeling techniques have provided a fresh perspective for building conversations speech understanding systems. Modeling conversations in a multi-tier fashion requires enormous computational capability which are available presently. Conversational Speech Understanding comprises three basic modules namely; Speaker recognition/verification attributing to "who is speaking", Continuous Large Vocabulary Speech Recognition system attributing to "what is spoken", last but not the least, various natural language understanding methodologies (e.g. Sentiment, topic, laughter and filler density, turns taken, word count, words per turn, term frequency, intent, tone, satisfaction) conveying "how was the conversation". Combining all the above modalities in a cohesive framework will provide a complete picture of a conversation.

In this thesis we propose various novel methods to accomplish the above discussed modules of conversational speech understanding systems (CSUS) in highly naturalistic environments. Towards building a CSUS we have developed,

  1. A novel deep learning based application specific Large Vocabulary Continuous Speech Recognition (LVCSR) system.
  2. A novel long term Speech Activity Detection (SAD) system using curriculum learning.
  3. A new Audio Keyword Spotting based method is proposed to perform audio sentiment/opinion detection.
  4. A novel system that performs Laughter and Filler detection is proposed.
  5. A new method to analyze and improve dirty transcripts in multichannel environments is proposed.

Using all the above methods and few other already established methods that can detect parameters like topic detection, turns taken, word count we have accomplished the task of building a very functional conversational speech understanding system in highly naturalistic environments.

The NASA Apollo missions are one of the greatest accomplishments of mankind. Every minute of conversation between astronauts, ground mission control and the supporting scientist staff is recorded on 30 track analog tapes (resulting to more than 150,000 hours of audio vii data) making it one of the most valuable and natural source of realistic data to analyze conversations in professional environment. We have developed a new 30 track analog tape decoder and digitized 19,000 hours of audio data from Apollo missions (Apollo 11, 13 and Gemini 8 missions) and are used in the development and evaluation of conversational speech understanding systems.



Conversation analysis, Automatic speech recognition, Computational linguistics, Laughter


©2018 The Author. Digital access to this material is made possible by the Eugene McDermott Library. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.