Large Receptive Field Convolutional Neural Networks for Robust Speech Recognition

Jafarlou, Salar

Large Receptive Field Convolutional Neural Networks for Robust Speech Recognition

dc.contributor.advisor	Hansen, John H.L.
dc.creator	Jafarlou, Salar
dc.date.accessioned	2021-12-17T16:09:38Z
dc.date.available	2021-12-17T16:09:38Z
dc.date.created	2020-12
dc.date.issued	2020-08-18
dc.date.submitted	December 2020
dc.date.updated	2021-12-17T16:09:40Z
dc.description.abstract	Despite significant efforts over the last few years to build a robust automatic speech recognition (ASR) systems for different acoustic settings, the performance of the current state-of-the-art technologies significantly degrades in noisy reverberant environments. Convolutional Neural Networks (CNNs) have been successfully used to achieve substantial improvements in many speech processing applications including distant speech recognition (DSR). However, standard CNN architectures were not efficient in capturing long-term speech dynamics, which are essential in the design of a robust DSR system. In this thesis, we address this issue by investigating variants of large receptive field CNNs (LRF-CNNs) which include deeply recursive networks, dilated convolutional neural networks, and stacked hourglass networks. To compare the efficacy of the aforementioned architectures with the standard CNN for Wall Street Journal (WSJ) corpus, we use a hybrid DNN-HMM based speech recognition system. Then in order to evaluate the system performances in reverberated environments (the case for distant speech recognition) we evaluated the system in both simulated and realistic reverberated environments. For the former, we used realistic room impulse responses (RIRs) to simulate the reverberated versions from a clean channel. Finally, for realistic reverberation settings, we used UTD-Distance corpus to evaluate our system. Our experiments show that with fixed number of parameters across all architectures, the large receptive field networks show consistent improvements over the standard CNNs for both clean and distant speech. Amongst the explored LRF-CNNs, stacked hourglass network has shown improvements with a 8.9% relative reduction in word error rate (WER) and 10.7 % relative improvement in frame accuracy compared to the standard CNNs for distant simulation setups. Stack of hourglass also gave a 13.68 % and 12.90 % relative reduction for 1 m and 3 m distanced microphones respectively. For 6 m far microphones recursive networks were the one with the most WER gain of 7.46 %. This thesis is a study on a set of unsupervised techniques achieved by modifications on acoustic modeling component of the HMM-based ASR engine for robustness in reverberate environments. These techniques showed a consistent improvements in both simulated and realistic settings and demonstrates a track of research in the field of alternative acoustic modeling structures.
dc.format.mimetype	application/pdf
dc.identifier.uri	https://hdl.handle.net/10735.1/9362
dc.language.iso	en
dc.subject	Speech perception
dc.subject	Machine learning
dc.subject	Neural networks (Computer science)
dc.title	Large Receptive Field Convolutional Neural Networks for Robust Speech Recognition
dc.type	Thesis
dc.type.material	text
thesis.degree.department	Electrical Engineering
thesis.degree.grantor	The University of Texas at Dallas
thesis.degree.level	Masters
thesis.degree.name	MSEE

Files

Original bundle

Now showing 1 - 1 of 1

Name:: JAFARLOU-THESIS-2020.pdf
Size:: 6.67 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 5.84 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 1.84 KB
Format:: Plain Text
Description:

Download

Collections

UTD Theses and Dissertations