Large Receptive Field Convolutional Neural Networks for Robust Speech Recognition




Journal Title

Journal ISSN

Volume Title



Despite significant efforts over the last few years to build a robust automatic speech recognition (ASR) systems for different acoustic settings, the performance of the current state-of-the-art technologies significantly degrades in noisy reverberant environments. Convolutional Neural Networks (CNNs) have been successfully used to achieve substantial improvements in many speech processing applications including distant speech recognition (DSR). However, standard CNN architectures were not efficient in capturing long-term speech dynamics, which are essential in the design of a robust DSR system. In this thesis, we address this issue by investigating variants of large receptive field CNNs (LRF-CNNs) which include deeply recursive networks, dilated convolutional neural networks, and stacked hourglass networks. To compare the efficacy of the aforementioned architectures with the standard CNN for Wall Street Journal (WSJ) corpus, we use a hybrid DNN-HMM based speech recognition system. Then in order to evaluate the system performances in reverberated environments (the case for distant speech recognition) we evaluated the system in both simulated and realistic reverberated environments. For the former, we used realistic room impulse responses (RIRs) to simulate the reverberated versions from a clean channel. Finally, for realistic reverberation settings, we used UTD-Distance corpus to evaluate our system. Our experiments show that with fixed number of parameters across all architectures, the large receptive field networks show consistent improvements over the standard CNNs for both clean and distant speech. Amongst the explored LRF-CNNs, stacked hourglass network has shown improvements with a 8.9% relative reduction in word error rate (WER) and 10.7 % relative improvement in frame accuracy compared to the standard CNNs for distant simulation setups. Stack of hourglass also gave a 13.68 % and 12.90 % relative reduction for 1 m and 3 m distanced microphones respectively. For 6 m far microphones recursive networks were the one with the most WER gain of 7.46 %. This thesis is a study on a set of unsupervised techniques achieved by modifications on acoustic modeling component of the HMM-based ASR engine for robustness in reverberate environments. These techniques showed a consistent improvements in both simulated and realistic settings and demonstrates a track of research in the field of alternative acoustic modeling structures.



Speech perception, Machine learning, Neural networks (Computer science)