Deep Learning Strategies for Monaural Speech Enhancement in Reverberant Environments
Date
Authors
ORCID
Journal Title
Journal ISSN
Volume Title
Publisher
item.page.doi
Abstract
In naturalistic settings, reverberation and background noise can introduce distortions into speech captured by distant microphones, reducing overall quality and intelligibility. Distant speech processing is gaining recognition as a critical component in developing robust human-machine solutions for interactions that do not require humans to wear wearable devices. Several back-end speech applications, including but not limited to speech recognition, speaker identification, speaker verification, and speaker emotion/stress detection, benefit from the addition of a front-end speech enhancement system. A variety of systems utilizing single and multiple microphones have been developed to address this difficult task. However, as technology advances and devices become smaller, mobile, and smarter, there has been a strong push to maintain the performance of these smart devices with fewer microphones. The focus of this dissertation is to develop systems for enhancing reverberant and noisy speech signals with the goal of enhancing human communication and human-machine interaction. We develop strategies for addressing dereverberation by designing real-valued, complex-valued, generative, and adaptive neural networks to enhance the quality of speech captured by a single microphone. This dissertation addresses robust front-end advancements with the following contributions: (i) An unsupervised speech activity detection (SAD) with dereverberation solution based on signal processing; and (ii) a supervised fully convolutional deep neural network capable of enhancing the magnitude spectrum of reverberant speech and reusing the phase information to synthesize enhanced speech (e.g., SkipConvNet), (iii) a supervised deep complex-valued network with self-attention adapted for the complex domain with the goal of simultaneously enhancing the magnitude and phase of reverberant speech (e.g., FCSA), (iv) a generative adversarial complex-valued deep neural network with the goal of regenerating the formant structure lost in speech due to reverberation (e.g., SkipConvGAN), and finally (v) an inference-adaptive neural network that can adapt its processing to the level of distortion present in a given speech utterance. The proposed systems’ performance are evaluated using the REVERB challenge corpus and also compare against various signal processing and deep learning approaches that have been previously proposed to address the same issue of speech dereverberation. We evaluate the proposed networks’ performance using speech quality metrics such as cepstral distance (CD), signal-to-noise ratio (SNR), perceptual evaluation of speech quality (PESQ), and signal-to-modulation energy ratio (SRMR). The experimental results demonstrate that the proposed networks consistently outperform several previously proposed systems in terms of overall speech quality. These proposed solutions have made important strides to addressing the challenges of distant based speech capture which include distortions due to reverberation and background noise. Addressing distortions due to reverberation will offer opportunities to advance subsequent speech technologies in naturalistic environments.