Robust acoustic modeling and front-end design for distant speech recognition
MetadataShow full item record
In recent years, there has been a significant increase in the popularity of voice-enabled technologies which use human speech as the primary interface with machines. Recent advancements in acoustic modeling and feature design have increased the accuracy of Automatic Speech Recognition (ASR) to levels that enable voice interfaces to be used in many applications. However, much of the current performance is dependent on the use of close-talking microphones, (i.e., scenarios in which the user speaks directly into a hand-held or body-worn microphone). There is still a rather large performance gap experienced in distant-talking scenarios in which speech is recorded by far-field microphones that are placed at a distance from the speaker. In such scenarios, the distorting effects of distance (such as room reverberation and environment noise) make the recognition task significantly more challenging. In this dissertation, we propose novel approaches for designing a distant-talking ASR front-end as well as training robust acoustic models to reduce the existing gap between far-field and close-talking ASR performance. Specifically, we i) propose a novel multi-channel front-end enhancement algorithm for improved ASR in reverberant rooms using distributed non-uniform microphone arrays with random unknown locations; ii) propose a novel neural network model training approach using adversarial training to improve the robustness of multi-condition acoustic models that are trained directly on far-field data; iii) study alternate neural network adaptation strategies for far-field adaptation to the acoustic properties of specific target environments. Experimental results are provided based on far-field benchmark tasks and datasets which demonstrate the effectiveness of the proposed approaches for increasing far-field robustness in ASR. Based on experiments using reverberated TIMIT sentences, the proposed multi-channel front-end provides WER improvements of +21.5% and +37.7% in two-channel and four-channel scenarios over a single-channel scenario in which the channel with best signal quality is selected. On the acoustic modeling side and based on results of experiments on AMI corpus, the proposed multi-domain training approach provides a relative character error rate reduction of +3.3% with respect to a conventional multi-condition trained baseline, and +25.4% with respect to a clean-trained baseline.