Large Receptive Field Convolutional Neural Networks for Robust Speech Recognition

dc.contributor.advisorHansen, John H.L.
dc.creatorJafarlou, Salar
dc.date.accessioned2021-12-17T16:09:38Z
dc.date.available2021-12-17T16:09:38Z
dc.date.created2020-12
dc.date.issued2020-08-18
dc.date.submittedDecember 2020
dc.date.updated2021-12-17T16:09:40Z
dc.description.abstractDespite significant efforts over the last few years to build a robust automatic speech recognition (ASR) systems for different acoustic settings, the performance of the current state-of-the-art technologies significantly degrades in noisy reverberant environments. Convolutional Neural Networks (CNNs) have been successfully used to achieve substantial improvements in many speech processing applications including distant speech recognition (DSR). However, standard CNN architectures were not efficient in capturing long-term speech dynamics, which are essential in the design of a robust DSR system. In this thesis, we address this issue by investigating variants of large receptive field CNNs (LRF-CNNs) which include deeply recursive networks, dilated convolutional neural networks, and stacked hourglass networks. To compare the efficacy of the aforementioned architectures with the standard CNN for Wall Street Journal (WSJ) corpus, we use a hybrid DNN-HMM based speech recognition system. Then in order to evaluate the system performances in reverberated environments (the case for distant speech recognition) we evaluated the system in both simulated and realistic reverberated environments. For the former, we used realistic room impulse responses (RIRs) to simulate the reverberated versions from a clean channel. Finally, for realistic reverberation settings, we used UTD-Distance corpus to evaluate our system. Our experiments show that with fixed number of parameters across all architectures, the large receptive field networks show consistent improvements over the standard CNNs for both clean and distant speech. Amongst the explored LRF-CNNs, stacked hourglass network has shown improvements with a 8.9% relative reduction in word error rate (WER) and 10.7 % relative improvement in frame accuracy compared to the standard CNNs for distant simulation setups. Stack of hourglass also gave a 13.68 % and 12.90 % relative reduction for 1 m and 3 m distanced microphones respectively. For 6 m far microphones recursive networks were the one with the most WER gain of 7.46 %. This thesis is a study on a set of unsupervised techniques achieved by modifications on acoustic modeling component of the HMM-based ASR engine for robustness in reverberate environments. These techniques showed a consistent improvements in both simulated and realistic settings and demonstrates a track of research in the field of alternative acoustic modeling structures.
dc.format.mimetypeapplication/pdf
dc.identifier.urihttps://hdl.handle.net/10735.1/9362
dc.language.isoen
dc.subjectSpeech perception
dc.subjectMachine learning
dc.subjectNeural networks (Computer science)
dc.titleLarge Receptive Field Convolutional Neural Networks for Robust Speech Recognition
dc.typeThesis
dc.type.materialtext
thesis.degree.departmentElectrical Engineering
thesis.degree.grantorThe University of Texas at Dallas
thesis.degree.levelMasters
thesis.degree.nameMSEE

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
JAFARLOU-THESIS-2020.pdf
Size:
6.67 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
1.84 KB
Format:
Plain Text
Description: