Browsing by Author "Assmann, Peter F."

Now showing 1 - 14 of 14

Advances in Methodologies Using EEG to Characterize the Cortical Processing of Speech and Its Perceived Sound Quality
(2022-12-01T06:00:00.000Z) Raghavendra, Shruthi; Tan, Chin-Tuan; Hansen, John H. L.; Marcus, Andrian; Martin, Brett A.; Nourani, Mehrdad; Assmann, Peter F.
Speech perception is dependent on access to the amplitude, spectral and temporal information in speech. This dissertation focuses on the temporal structure of speech, which consists of a slow- varying amplitude (temporal envelope, ENV) and a rapid-varying frequency (temporal fine structure, TFS). Past studies on speech perception [for review, see Lorenzi and Moore 2008] suggest ENV alone is sufficient for speech perception in quiet and TFS alone is used to segregate speech from the background noise (e.g., a competing talker scenario). It has been shown that the reduction in subjective quality ratings obtained through behavioral quality assessment is correlated to the degree of degradation in the temporal envelope. However, the neural correlates of sound quality perception with continuous speech are still unclear. This dissertation explores two complementary research goals proposed as studies which consider speech perception as it relates to ENV and TFS and its sound quality perception. The dissertation is comprised of two studies: Study 1 attempts to characterize the cortical processing of speech and Study 2 attempts to characterize the perceived sound quality in normal- hearing listeners. First, the overall introduction to both studies is provided in Chapter 1. Next, we lay out the background of both studies in detail in Chapter 2. Chapter 3 presents Study 1 of this dissertation and investigates the role and relative contribution of ENV and TFS to speech perception in normal hearing listeners in quiet. The synchronization between brain oscillations at different frequency bands is commonly used as a marker for the key mechanisms in coordinating neural dynamics for different temporal and spatial domains [Canolty and Knight 2010]. When neural oscillations of two different frequency bands synchronize, their “peak” frequencies usually exhibit a harmonic relationship. A recent study [Rodriguez and Alaerts 2019] showed a prominent occurrence of this 2:1 harmonic cross-frequency relationship between alpha (8-14 Hz) and theta (4-8 Hz) rhythms when task-relevant efficient cognitive processing is engaged. Study 1 examined this power-power cross-frequency coupling (CFC) between alpha (8-14 Hz) and theta (4-8 Hz) and also between gamma (30-100 Hz) and theta frequency bands of cortical activity in normal- hearing listeners using electroencephalography (EEG) signals when processing ENV and TFS of speech. The results showed a relatively increased CFC when listening to ENV alone. This finding may suggest more synchrony across different frequency bands of cortical activity in processing ENV than TFS. Recent studies have shown that cortical activity basically tracks the envelope of continuous natural speech, which could potentially serve as a useful method to study the underlying processes for speech perception. Study 2 of the dissertation, presented in Chapters 4 & 5) investigates the differences in cortical entrainment to the envelope of speech spoken by cochlear implant (CI) talkers (degraded speech) and normal-hearing (NH) talkers. Although, a CI may help individuals with hearing loss to restore or improve the ability to hear and provide the auditory feedback necessary for improved speech production, speech produced by CI users is mostly abnormal compared to normal hearing individuals (Gautam et al., 2019). The motivation is to achieve a metric to assess “how well” hard-of-hearing talkers have spoken and the auditory feedback they received in their current aural compensation. The results showed higher perceived sound quality and closer tracking of speech envelope in normal-hearing listeners when listening to a sample of speech produced by NH talkers than that for CI talkers. Finally, Chapter 6 presents overall conclusions with contributions of the dissertation, and a discussion of possible directions for future work. The two key research aims pertaining to Study 1 and Study 2 respectively were to: 1) examine the brain electrical activity and brain networks underlying the perception of ENV and TFS information as compared to processing the original speech itself and thereby investigating the relative role of ENV and TFS in speech perception in normal-hearing listeners and 2) to determine how well the envelope of speech is represented neurophysiologically by objectively quantifying the cortical tracking of speech envelope and to show how this cortical tracking of speech envelope differentiate between the sample of speech produced by CI talkers and NH talkers in relation to speech’s perceived sound quality. The findings together from Study 1 and Study 2 provide insight into the neural mechanisms involved in the cortical processing of ENV and TFS of continuous speech and its perceived sound quality in normal-hearing listeners.
Deep Convolutional Neural Network Encoding of Face Shape and Reflectance in Synthetic Face Images
(August 2023) Hill, Matthew Q. 1988-; O'Toole, Alice J.; Rugg, Michael; Assmann, Peter F.; Golden, Richard M.; Castillo, Carlos D.
Deep Convolutional Neural Networks (DCNNs) trained for face identification recognize faces across a wide range of imaging and appearance variations including illumination, viewpoint, and expression. In the first part of this dissertation, I showed that identity-trained DCNNs retain non-identity information in their top-level face representations, and that this informa- tion is hierarchically organized in this representation (Hill et al., 2019). Specifically, the sim- ilarity space was separated into two large clusters by gender, identities formed sub-clusters within gender, illumination conditions clustered within identity, and viewpoints clustered within illumination conditions. In the second part of this dissertation, I further examined the representations generated by face identification DCNNs by separating face identity into its constituent signals of “shape” and “reflectance”. Object classification DCNNs demon- strate a bias for “texture” over “shape” information, whereas humans show the opposite bias (Geirhos et al., 2018). No studies comparing “shape” and “texture” information have yet been performed on DCNNs trained for face identification. Here, I used a 3D Morphable Model (3DMM, Li, Bolkart, Black, Li, and Romero 2017) to determine the extent to which face identification DCNNs encode the shape and/or spectral reflectance information in a face. I also investigated the presence of illumination, expression, and viewpoint information in the top-level representations of face images generated by DCNNs. Synthetic face stimuli were generated using a 3DMM with separate components for a face shape’s “identity” and “facial expression”, as well as spectral reflectance information in the form of a “texture map”. The dataset comprised ten randomized levels each of face shape, reflectance, and expression, with three levels of illumination (spotlight, ambient, 3 point), three levels of viewpoint pitch (-30°, 0°, 30°), and five levels of viewpoint yaw (0°, 15°, 30°, 45°, 60°) in a complete factorial design for a total of 45,000 images. All analyses were conducted with an Inception ResNet V1-based network (Szegedy, Ioffe, Vanhoucke, & Alemi, 2017) trained on the VGGFace2 dataset (Cao, Shen, Xie, Parkhi, & Zisserman, 2018) and replicated with a ResNet-101- based network (He, Zhang, Ren, & Sun, 2016) trained on University of Maryland’s Universe dataset (Bansal, Castillo, Ranjan, & Chellappa, 2017; Bansal, Nanduri, Castillo, Ranjan, & Chellappa, 2017; Guo, Zhang, Hu, He, & Gao, 2016). Area Under the Receiver Operating Characteristic Curve (AUC) was used as a measure of information for each variable in the top-level representation and t-distributed Stochastic Neighbor Embedding (Van der Maaten & Hinton, 2008) was used to visualize the similarity space of top-level representations. The results showed that both shape and reflectance information were encoded in the top-level representation, and both signals were required for optimal performance. Shape-reflectance bias was mediated by illumination such that the network showed a reflectance bias in ambient and 3 point (photography style) illumination environments, whereas no bias was found under spotlight illumination. Consistent with Hill et al. (2019), we found information about all non-identity variables (illumination, expression, pitch, yaw) in the top-level representation, although each of these signals was weakly encoded.
Evaluation of Consonant Error Patterns in Speech Perception and Production of Pediatric Cochlear Implant Users
(2020-12-01T06:00:00.000Z) Peskova, Olga; Assmann, Peter F.; Abdi, Herve; Geers, Ann E.; Goffman, Lisa A.; Warner-Czyz, Andrea D.; Katz, William F.
This dissertation uses the framework of the speech chain to examine the associations between perception/production and associated error patterns in children using cochlear implants (CIs), children with normal hearing (NH), and children with NH listening to vocoder simulations (NHV). Chapter 1 introduces background information on the populations represented by the three groups. This chapter focuses on typical perception and production development and introduces the challenges in evaluating perception and production patterns in children with hearing loss (HL). Chapter 2 examines the association between consonant perception, production error patterns and speech ineligibility in two groups of children using CIs: one group implanted at early ages using newer CI technologies, and the other group implanted at later ages using older CI technologies. Data from the Chapter 2 helps to establish the methodological basis for constructing a new database described in the Chapter 3. Chapter 3 examines perception and production error patterns in children using CIs who were implanted after 2010 using newer CI technologies in comparison to NH and NHV control groups. Chapter 4 provides a general discussion that relates the findings of these three studies. Results from Chapter 2 indicate lower speech perception scores with a higher number and greater variability of errors in the CI group implanted with older technologies compared to the CI group implanted with newer technologies. Methodological limitations in the data presented in Chapter 2 did not permit a direct comparison of errors in consonant production and perception. The testing procedures developed in Chapter 3, which allowed for such comparisons, showed that production and perception error patterns generally did not mirror one another. Although no overall differences in mean error rate were observed between CI and NH groups, the error patterns for individual consonants in these groups were different. Perception performance of children in the NHV group was worse than in CI group suggesting caution is needed when interpreting CI vocoder simulation studies in children. The results provide important clinical information suggesting that intervention for children using CIs needs to consider how speech perception confusions interact with the production errors in order to develop more effective techniques for including them in clinical protocols.
Investigating Misconception Resolution and Learning From Scientific Texts Using a Cognitive Diagnostic Model
(December 2023) Pattisapu, Candice; Assmann, Peter F.; Golden, Richard M.; Brown, Matthew; Ackerman, Robert; Krawczyk, Daniel
Research shows that refutation texts are more effective than expository texts at helping students resolve science misconceptions. However, little work has been conducted on the effects of related thought experiment discourse structures. To empirically investigate the effects of thought experiments on student science learning, high and low prior knowledge college students read refutation, expository, and thought experiment physics texts. Their understanding was assessed using the Force Concept Inventory (FCI). Differences were evaluated using correct and misconception responses in ANOVA analyses. In addition, FCI response data was analyzed using a novel cognitive diagnostic model of skills and misconceptions. Effects of prior knowledge on learning only were observed in the ANOVA analyses. However, the item-specific probability parameters from the cognitive diagnostic model analysis indicated that both high and low prior knowledge students learned more from thought experiment texts than expository texts for one of three topics that were presented. In addition, these parameters indicated that misconception possession was more prevalent among low prior knowledge participants reading thought experiment texts for this topic as well. Person-specific skill mastery probabilities indicated that high and low prior knowledge students learned the least from expository texts for a different topic but that there was little difference in misconception resolution between discourse structures. The implications of these results for improving student learning using various discourse structures are discussed.
Melody Recognition, Repeated Exposure, and Timbre Change
(May 2023) Gryder, Kieth Douglas 1991-; Dowling, W. Jay; Kroener, Sven; Warner-Czyz, Andrea; Assmann, Peter F.; Abdi, Herve
Previous research showed that people are less likely to recognize a recently heard melody if it has changed timbre. This dissertation investigates how changing timbre influences melody recognition. Participants in Experiment 1 (N = 33) sorted a series of timbres into groups. The data were analyzed with DiSTATIS and five timbres were derived as perceptually different and were used in Experiment 2. In Experiment 2, participants (N = 145) heard a series of melodies repeatedly over five sessions. Participants were assigned to one of three conditions: one in which the timbre of the melodies never changed, one in which timbres changed one time (either at Session 3 or Session 5), or timbre changed every session. Participants rated melodies on a 4- point confidence scale of recognition. Results suggest that regardless of the rate of timbre change, recognition improves over multiple exposures. Exploratory analyses suggest that high levels of music training may significantly improve performance and that different kinds of melody recognition are influenced differently by timbre change.
Modeling the Perception of Children's Age from Speech Acoustics
(Acoustical Society of America) Barreda, S.; Assmann, Peter F.; 138145911066327061564 (Assmann, PF); Assmann, Peter F.
Adult listeners were presented with /hVd/ syllables spoken by boys and girls ranging from 5 to 18 years of age. Half of the listeners were informed of the sex of the speaker; the other half were not. Results indicate that veridical age in children can be predicted accurately based on the acoustic characteristics of the talker's voice and that listener behavior is highly predictable on the basis of speech acoustics. Furthermore, listeners appear to incorporate assumptions about talker sex into their estimates of talker age, even when information about the talker's sex is not explicitly provided for them. © 2018 Acoustical Society of America.
Multidimensional Evaluation of Daily Device Use, Communication, and the Family System in Pediatric Cochlear Implant Users
(2020-12-01T06:00:00.000Z) Wiseman, Kathryn Beverly; Warner-Czyz, Andrea; Rugg, Michael; Assmann, Peter F.; Nelson, Jackie; Walker, Elizabeth
This dissertation investigates samples of children with hearing loss who use cochlear implants (CIs) through a family systems lens. The aim is to understand the impact of pediatric hearing loss on the parents and siblings of the affected child and the impact that families have on their children (by way of facilitating daily device use). A series of interconnected manuscripts centered on parents, siblings, and children with CIs combine to illuminate the multidimensional and bidirectional effects of pediatric hearing loss and the family system. Chapters 2 and 3 address the state of parents and siblings, respectively, of school-age and adolescent children with CIs, to understand the impact of hearing loss on other members of the family. Chapter 2 (Study 1) compares general and condition-specific stress (via the Family Stress Scale) in 31 parents of CI users (8-16 years) to previously published samples of children with HL, finding similarities and differences across samples. Child temperament significantly predicted parental stress after controlling for other variables. Chapter 3 (Study 2) examines quantitative and qualitative perspectives of 36 children and adolescents with typical hearing (age 6-17 years) who have a sibling with CIs (age 7-17). Quantitative results indicated that siblings with TH express positive perspectives of their brother or sister with CIs and report having a CI user in the family does not affect them much, particularly if the CI user has good speech understanding and intelligibility. Qualitative responses diverge from quantitative data, with siblings expressing more negative feelings surrounding differential attention from parents and the CI user’s social communication skills. Chapter 4 (Study 3) considers an aspect of parental involvement - daily device use – in 65 young children with CIs (< 5 years), exploring the impact of device use on of emerging communication skills in 65 young children. Results of this retrospective chart review indicate better early auditory skills, speech recognition in quiet skills, and expressive/receptive language outcomes in children who wear their CIs more hours per day (on average). Chapter 5 provides pilot data for a prospectively examination of daily device use, family-related variables (e.g., parental involvement, socials support, family hardiness), and communication outcomes (i.e., speech recognition, spoken language) in a small group of young children who predominantly use CIs (n = 9, age < 6 years), representing the intersection of topics in Studies 1-3. These data demonstrate variability in daily device use and communication outcomes, but little difference in family variables (e.g., low parental stress, high parental involvement), revealing both feasibility and challenges of data collection moving forward. These studies collectively highlight the importance of considering the entirety of the family system in cases of pediatric hearing loss to maximize communication outcomes in children with hearing loss and to optimize well-being in all members of the family.
Perceiving Foreign-Accented Speech with Decreased Spectral Resolution in Single- and Multiple-Talker Conditions
(Acoustical Soc Amer) Kapolowicz, Michelle R.; Montazeri, Vahid; Assmann, Peter F.; 138145911066327061564 (Assmann, PF); Kapolowicz, Michelle R.; Montazeri, Vahid; Assmann, Peter F.
To determine the effect of reduced spectral resolution on the intelligibility of foreign-accented speech, vocoder-processed sentences from native and Mandarin-accented English talkers were presented to listeners in single- and multiple-talker conditions. Reduced spectral resolution had little effect on native speech but lowered performance for foreign-accented speech, with a further decrease in multiple- talker conditions. Following the initial exposure, foreign-accented speech with reduced spectral resolution was less intelligible than unprocessed speech in both single- and multiple-talker conditions. Intelligibility improved with extended exposure, but only for single- talker conditions. Results indicate a perceptual impairment when perceiving foreign-accented speech with reduced spectral resolution.
Perception of Novel Sounds in the Presence of Background Noise : Comparison Between Individuals with Normal Hearing, Cochlear Implant Users, and Recurrent Neural Networks
(2019-01-30) Montazeri, Vahid; Assmann, Peter F.
The goal of this dissertation is to investigate how listeners and learning machines cope with the ambiguity caused by interfering multiple novel sound sources. Starting from an ambiguous auditory scene with competing sound sources, this dissertation investigates how a particular sound source draws listeners’ attention while the remaining sources lose their salience and become background (noise). Listeners’ perception of competing novel sounds is investigated in a series of experiments that varied in terms of listening conditions, simulating the difficulties experienced by hearing-impaired individuals in noise. In Chapter 1, the mechanisms behind listeners' perception of speech in the presence of competing sounds are reviewed. Chapter 2 describes three experiments that investigated the recognition of novel sounds in the presence of background noise. The chapter begins with a replication of a previous study, providing evidence that listeners can segregate a novel target sound from the competing distractor only if it repeats across different distractors. A subsequent experiment tested the hypothesis that listeners’ ability to detect change in a sound depends on their knowledge of its source, which is gained via repetition. It is concluded that listeners are able to perceptually learn patterns of the repeating target while suppressing the changes in the masker stream. Two neural network architectures previously employed to study mechanisms of learning, generalized Hebbian and anti-Hebbian, are evaluated. It is shown that the generalized Hebbian learning network produces similar results to those obtained from the listeners. Experiments in Chapter 3 provide evidence that recognition of a novel target sound becomes robust against new (unheard) distractors when listeners go through an exposure stage in which the target is presented repeatedly across multiple distractors. Chapter 3 concludes by reporting experiments 3-2 and 3-3 that investigated recognition of consonant-vowel-consonant-vowel (CVCV) words in the presence of novel distractors. Experiment 3-2 showed that upon exposing the listeners to target tokens across multiple distractors, the process of learning new CVCV tokens shifts from context-specificity to an adaptation-plus-prototype mechanism. The goal in experiment 3-3 was to investigate whether or not cochlear implant users, who have limited spectral resolution, would show the same behavior as listeners with normal hearing in experiment 3-2. The main goal in Chapter 4 is to investigate the extent to which the findings in experiment 3-2 can be replicated by recurrent neural networks (RNNs). This chapter begins with a brief introduction to RNNs and long short-term memories (LSTMs). In experiment 4-1 a recurrent LSTM auto-encoder was trained to reconstruct an input CVCV target when mixed with a distractor with or without the presence of a context sequence prior to the input. It was shown that the network could reconstruct the input with better accuracy when the context sequence contained the repeating CVCV target across multiple distractors. Furthermore, similar to the findings in experiment 3-2, the presence of such a context sequence improved the network’s generalizability to unseen data (novel distractors). Experiment 4-2 showed that the presence of the context sequence led to an improved semi-supervised speech enhancement algorithm that recovered the target CVCV tokens while suppressing the distractors.
Production and Perception of Affective Prosody by Adults with Autism Spectrum Disorder
(2016-12) Hubbard, Daniel James; Assmann, Peter F.
Affective prosody, defined as the use of paralinguistic elements in speech to convey emotion, is important for effective social functioning. While generally a trivial task for typically-developing (TD) adults, individuals with autism spectrum disorder (ASD) present with significant challenges in social communication and interaction, including prosody. Previous research has shown that talkers with ASD produce pragmatic prosody with increased variability in fundamental frequency (f0, which is closely correlated with voice pitch), but it was unclear whether those differences carry over to speaking tasks involving emotion elicitation. A controlled set of expressive speech recordings was obtained from talkers with ASD and controls in five emotion contexts: angry, happy, interested, sad, and neutral. Emotion-specific group differences in f0, intensity, and duration were found in multiple speech types, and the pattern of results was characterized by inconsistent and exaggerated patterns of affective prosody production in talkers with ASD compared to controls. The perceptual relevance of the affective acoustic differences was tested in three listening experiments involving talkers and listeners with ASD and controls. The first two experiments involving TD listeners were designed to examine the perceptual impact of increased f0 variability and intensity found in recordings produced by talkers with ASD. Compared to the intensity manipulation, modifying the f0 contour had a larger impact on emotion recognition accuracy. The third experiment was designed to compare perception of affective prosody in listeners with ASD and controls using unmodified stimuli, and revealed that differences in affective prosody perception were more closely related to talker group production differences than listener group differences. The results are consistent with previous work in face perception showing increased emotion identification rates but lower naturalness ratings when listeners responded to stimuli produced by individuals with ASD. The findings are interpreted within the context of the speech attunement framework, which suggests that individuals with ASD lack the motivation to attune their prosodic speech to sound like TD talkers.
Speech Analysis and Single Channel Enhancement to Improve Speech Intelligibility for Cochlear Implant Recipients
(2017-05) Wang, Dongmei; Hansen, John H. L.; Panahi, Issa M. S.; Busso, Carlos A.; Assmann, Peter F.
Cochlear implant (CI) devices are able to help deaf individuals recover hearing ability by surgically inserting electrode arrays into the inner ear, to stimulate the auditory nerve and transmit the sound to the auditory cortex in the brain. CI listeners achieve high speech intelligibility in quiet environments, however their speech perception degrades dramatically when presented in noisy backgrounds. This is especially true in fluctuating noise, such as competing-speaker or babble noise, where CI users have difficulties in speech understanding. One of the reasons is that low spectral resolution provided by CI encoding strategies is insufficient to distinguish speech components from noise. In this dissertation, we propose a new speech enhancement solution to improve speech intelligibility for CI recipients in noise. Speech energy is primarily carried in the harmonic structure located at multiple integer harmonics of the fundamental frequency. In order to combat noise, we propose to use harmonic structure as the frequency domain cues to estimate the degraded noise. The proposed speech enhancement is based on harmonic structure estimation combined with a traditional statistical based leveraged solution. This dissertation has investigated robust fundamental frequency estimation in noise, along with integrating such novel in formulate to improve harmonic based speech enhancement in both stationary and non-stationary noise scenarios. Noise-robust pitch estimation is proposed based on temporal harmonic structure in local time-frequency (TF) segments. To reduce the noise affect, we take advantage of the sparsity of speech signal that only the high signal to noise ratio (SNR) TF segments are used for pitch estimation. Robust harmonic features are investigated for neural network classification based pitch estimation. The harmonic features map the pitch candidates into a more separable space for classification. Experimental results show that the proposed pitch estimation method improves global pitch error in noise. Next, harmonic structure estimation is combined with the traditional statistical based method to speech enhancement. Noise estimation is performed based on harmonic structures. The estimated noise variance is employed in a traditional MMSE framework for a prior and posterior SNR estimation to obtain a gain function for the target speech. Listening experiments with CI subjects demonstrated improved speech intelligibility for both stationary and non-stationary noise.
The Effects of First Impressions on Predicting Honesty Outcomes Using Audiovisual Integration
(2022-12-01T06:00:00.000Z) Rakic, Jelena; Katz, William F; Krawczyk, Daniel; Assmann, Peter F.; O'Toole, Alice J; Spence, Jeffrey S
People automatically and unconsciously process first impressions of faces and voices during social interactions. There are consequences to our first impressions because they can influence and predict a variety of societal outcomes including dating choices, voting behaviors, job interview success, and judicial court rulings. The implications of first impressions also affect perceptions of honesty. As a result, the first impressions we form influence how we evaluate honesty. Perceptions of honesty involve evaluations of how honest, truthful, or deceptive someone appears in a situation. In first impression research, perceptions of honesty (i.e., honesty evaluations) usually focus on high-stakes honesty judgments (e.g., judicial case rulings, police lineups, and police investigations). However, the effects of first impressions on perceptions of honesty are present in our daily social interactions and affect mundane aspects of our lives. For example, in our daily social interactions, we form first impressions of individuals, and this affects how we perceive the honesty of these individuals’ opinions, preferences, excuses, or ideas (i.e., low-stakes honesty evaluations). The first goal of this project was to examine whether instantaneous first impression judgments of trustworthiness and dominance predict honesty outcomes in a low-stakes situation. The results suggest that initial perception of trustworthiness predicts honesty evaluations but only during shorter durations of honesty evaluations. Thus, if participants are presented with longer durations to evaluate honesty, initial trustworthiness ratings do not predict honesty outcomes. The second goal was to examine the early stages of first impression trait judgments and honesty evaluations using thin slices (i.e., 8 and 15-second clip excerpts from video and audio). A minimum of 8 seconds was chosen to give participants enough time for top-down processing and in return, they will have enough information to be able to evaluate honesty. A maximum of 15 seconds was chosen to give participants extra time to process stimuli but not too much time so that there is less risk of attention loss. The results suggest that when participants are given more time to evaluate honesty and trustworthiness the judgments are more positive, which supports the truth-bias theory. The third goal was to examine whether face and voice cues significantly influence perceptions of trustworthiness (video) and dominance (audio), respectively. Results were consistent with earlier findings that suggest voices are linked to dominance. Videos of individuals describing a movie they liked or disliked that are consistent or inconsistent with their true opinion were used.
The Role of Spectral Information in Foreign-Accented Speech Perception
(2017-12) Kapolowicz, Michelle Rae; Assmann, Peter F.
Source signals, vocal tract resonances and articulatory movements encode talker-specific spectral information that allows for appropriate adjustment of a listener’s perceptual system to the acoustic characteristics of a particular talker. This implicit learning of talker-specific properties is known as talker normalization. Talker normalization requires prior experience and also structured knowledge about pronunciation variation across talkers that share the same native accent to guide perception. This process becomes difficult when the talker has an accent that is perceived as foreign. Although research suggests that listeners can adapt to foreign accents, the time-course and specificity of adaptation remain unclear, especially when listeners attend to speech produced by multiple alternating foreign-accented talkers. This dissertation focuses on the role of spectral cues in the perception of foreign-accented speech. While many factors contribute to the perception of foreign-accented speech, spectral cues are of particular interest because they play an important role in talker-specific phonetic recalibration in native speech to accommodate variations in vocal tract size across talkers. Through a series of experiments, we tested the hypothesis that listeners rely on talker-specific spectral cues when adapting to foreign-accented speech. We assessed the contribution of spectral resolution to the intelligibility of foreign-accented speech by varying the number of spectral channels in a tone vocoder. We also tested listeners’ abilities to discriminate between native- and foreign-accented speech to determine the effect of reduced spectral resolution on accent detection. Results showed a greater decrease in intelligibility when spectral resolution was reduced for foreign-accented speech compared to native-accented speech. Listeners also found it harder to detect a foreign accent with spectrally reduced speech. We extended these findings by investigating the effects of changing the talker from trial to trial, a manipulation that produces a reduction in intelligibility when compared to holding the talker constant within each block of trials. We hypothesized that limiting spectral resolution when listeners were exposed to multiple foreign-accented talkers would cause a further decrease in intelligibility. This prediction was confirmed, supporting the idea that detailed spectral resolution helps to maintain the intelligibility of foreign-accented speech when listeners are exposed to multiple interleaved talkers. Listeners were able to adapt with increased exposure if they heard a single foreign-accented talker, though not to the extent observed with unprocessed natural speech. Performance was higher for native-accented speech, with no difference between single- and multiple-talker conditions. Finally, we investigated how spectral shifting of foreign-accented speech would affect intelligibility by scaling the fundamental frequency and spectral envelope to simulate multiple talkers. Consistent with results for spectrally reduced speech, intelligibility was lower in the multiple-foreign-accented talker condition compared to the single-talker condition. Introducing frequency shifts produced a drop in intelligibility to levels observed in the multiple-talker condition. Results indicate that listeners depend on spectral cues when perceiving foreign-accented speech, and that spectral information is especially important when listening to speech spoken by different foreign-accented talkers. The results support a model of foreign-accented speech perception that relies on spectral cues to adjust to the deviations between foreign-accented and native speech.
Voice Gender and the Segregation of Competing Talkers: Perceptual Learning in Cochlear Implant Simulations
(Acoustical Society of America) Sullivan, J. R.; Assmann, Peter F.; Hossain, Shaikat; Schafer, Erin C.; 138145911066327061564 (Assmann, PF); Assmann, Peter F.; Hossain, Shaikat
Two experiments explored the role of differences in voice gender in the recognition of speech masked by a competing talker in cochlear implant simulations. Experiment 1 confirmed that listeners with normal hearing receive little benefit from differences in voice gender between a target and masker sentence in four- and eight-channel simulations, consistent with previous findings that cochlear implants deliver an impoverished representation of the cues for voice gender. However, gender differences led to small but significant improvements in word recognition with 16 and 32 channels. Experiment 2 assessed the benefits of perceptual training on the use of voice gender cues in an eight-channel simulation. Listeners were assigned to one of four groups: (1) word recognition training with target and masker differing in gender; (2) word recognition training with same-gender target and masker; (3) gender recognition training; or (4) control with no training. Significant improvements in word recognition were observed from pre- to post-test sessions for all three training groups compared to the control group. These improvements were maintained at the late session (one week following the last training session) for all three groups. There was an overall improvement in masked word recognition performance provided by gender mismatch following training, but the amount of benefit did not differ as a function of the type of training. The training effects observed here are consistent with a form of rapid perceptual learning that contributes to the segregation of competing voices but does not specifically enhance the benefits provided by voice gender cues.