It is possible that the frequency distributions could possibly generate the wrong number of phoneme categories owing to the acoustic variation from multiple speakers. For example, a bimodal distribution may still represent one phoneme in cases where two speakers who have different dialects speak the same phoneme in their own ways (between-speaker bimodal distribution), thus misleading listeners to infer two phonemes. It is also possible that a perceived unimodal distribution may actually be two underlying phonemes if two speakers have a similar pronunciation for different phonemes (e.g., “bet” from one speaker vs. “bat” from another), leading the listeners to infer the wrong number of categories.
2. A New Proposal: How Do Learners Deal with Speaker Variability in Phoneme Learning?
The first year of life is assumed to be a critical period for phonological learning [
3,
18,
32]. At around 9 months of age, infants begin to attend to their native phonology, while they become less sensitive to speech sounds that are not utilized as linguistic units (phonemes) in their native language (a loss of previous competence to non-native phonemes, a.k.a., perceptual narrowing) [
33,
34].
In the distributional-based learning account, the presence of a unimodal distribution (modal tokens in linguistic input fall somewhere between [da] and [ta], for example) leads to an interpretation of a single phoneme category within the continuum, whereas a bimodal distribution is interpreted as evidence for two phoneme categories. As stated earlier in this paper, such a strategy may be misleading. A bimodal distribution in linguistic input may indeed indicate different phonemic categories within the language, but may also indicate speaker differences in production (i.e., accents). In order to systematically learn the appropriate phoneme contrasts despite such variation, infants might rely on a multitude of cues in addition to the distribution of acoustic tokens.
One way to overcome this confusion with the distributional information in inferring phonemic categories may be to compute speaker-specific distributions. For example, there is a listener who computes whether a variety of /r/ pronunciation from one speaker is distributed unimodally or bimodally for that speaker and builds
speaker-specific distributional representations. It can prevent the listener from inferring the wrong number of phonemic categories owing to speaker variability. The listener can make an inference that only those distinctions that are bimodal within individual speakers indicate a phonemic distinction. There are important findings reporting that infants can extract speaker-specific features from short exposure and apply this knowledge to language learning. A recent study determined that 10- to 12-month-olds can build knowledge of speaker-specific variants and use the knowledge to make an inference of mapping appropriate objects [
35]. There is much research that has also demonstrated that it is within infants’ capacity to learn speaker-specific representations, leading to our hypothesis that it may be also possible for infants to compute speaker-specific distributions.
Our hypothesis that infants are able to compute speaker-specific distributions is centered around this assumption that infants rely on a multitude of cues in phonemic learning because they use not just distributions of tokens, but also their relationship to acoustic evidence that distinguishes two speakers. When we say that infants compute speaker-specific distributional representations, we are implicitly assuming that there are at least two separate computations: distributional computations and computations that link distributions to individual speakers. While speaker identity could be inferred from the acoustic characteristics of speech tokens, there is no reason to suspect that speech is the only source for indexing distinct individuals; for example, infants could use visual information in identifying distinct individuals.
In fact, prior research suggests that statistical learning is constrained by other cues, including visual ones. Teinonen and colleagues [
30] showed that English learning infants at 6 months of age were capable of using visually presented faces to learn two different categories in a unimodal frequency distribution. Moreover, Yeung and Werker found 9-month-old English learning infants could hear a difference between Hindi phonetic contrast (e.g., a dental alveolar stop [da] and a retroflex alveolar stop [ɖa]) only when the sounds were consistently paired with two different visual cues [
31]. Recent work also suggests contributions from computing audio–visual correspondences (see [
36,
37]), and perhaps even sensory-motor correspondences [
38].
With the evidence presented in the next section that listeners, in both adults and infants, are capable of dealing with speaker variability and this ability is systematic, we infer that (1) young listeners may be able to compute speaker-specific distributions as they could build speaker-specific representations and (2) computing speaker-specific distributions would also interact with other contextual cues. Taken together, these support our novel proposal that infants are possibly capable of using multiple cues to infer phonemic categories when they learn their native phonology.
3. A New Proposal: How Do Learners Deal with Speaker Variability in Phoneme Learning?
The first year of life is assumed to be a critical period for phonological learning [3,18,32]. At around 9 months of age, infants begin to attend to their native phonology, while they become less sensitive to speech sounds that are not utilized as linguistic units (phonemes) in their native language (a loss of previous competence to non-native phonemes, a.k.a., perceptual narrowing) [33,34]. In the distributional-based learning account, the presence of a unimodal distribution (modal tokens in linguistic input fall somewhere between [da] and [ta], for example) leads to an interpretation of a single phoneme category within the continuum, whereas a bimodal distribution is interpreted as evidence for two phoneme categories. As stated earlier in this paper, such a strategy may be misleading. A bimodal distribution in linguistic input may indeed indicate different phonemic categories within the language, but may also indicate speaker differences in production (i.e., accents).
In order to systematically learn the appropriate phoneme contrasts despite such variation, infants might rely on a multitude of cues in addition to the distribution of acoustic tokens. One way to overcome this confusion with the distributional information in inferring phonemic categories may be to compute speaker-specific distributions. For example, there is a listener who computes whether a variety of /r/ pronunciation from one speaker is distributed unimodally or bimodally for that speaker and builds speaker-specific distributional representations. It can prevent the listener from inferring the wrong number of phonemic categories owing to speaker variability. The listener can make an inference that only those distinctions that are bimodal within individual speakers indicate a phonemic distinction. There are important findings reporting that infants can extract speaker-specific features from short exposure and apply this knowledge to language learning. A recent study determined that 10- to 12-month-olds can build knowledge of speaker-specific variants and use the knowledge to make an inference of mapping appropriate objects [35]. There is much research that has also demonstrated that it is within infants’ capacity to learn speaker-specific representations, leading to our hypothesis that it may be also possible for infants to compute speaker-specific distributions.
Our hypothesis that infants are able to compute speaker-specific distributions is centered around this assumption that infants rely on a multitude of cues in phonemic learning because they use not just distributions of tokens, but also their relationship to acoustic evidence that distinguishes two speakers. When we say that infants compute speaker-specific distributional representations, we are implicitly assuming that there are at least two separate computations: distributional computations and computations that link distributions to individual speakers. While speaker identity could be inferred from the acoustic characteristics of speech tokens, there is no reason to suspect that speech is the only source for indexing distinct individuals; for example, infants could use visual information in identifying distinct individuals. In fact, prior research suggests that statistical learning is constrained by other cues, including visual ones. Teinonen and colleagues [30] showed that English learning infants at 6 months of age were capable of using visually presented faces to learn two different categories in a unimodal frequency distribution. Moreover, Yeung and Werker found 9-month-old English learning infants could hear a difference between Hindi phonetic contrast (e.g., a dental alveolar stop [da] and a retroflex alveolar stop [ɖa]) only when the sounds were consistently paired with two different visual cues [31]. Recent work also suggests contributions from computing audio–visual correspondences (see [36,37]), and perhaps even sensory-motor correspondences [38].
With the evidence presented in the next section that listeners, in both adults and infants, are capable of dealing with speaker variability and this ability is systematic, we infer that (1) young listeners may be able to compute speaker-specific distributions as they could build speaker-specific representations and (2) computing speaker-specific distributions would also interact with other contextual cues. Taken together, these support our novel proposal that infants are possibly capable of using multiple cues to infer phonemic categories when they learn their native phonology.
3.1. Evidence from Adults
Traditional views regarding how listeners solve the lack of invariance problem have focused on perceptual normalization processes [39,40]. The general assumption in these accounts is that, in the mental dictionary, each word is stored with only one canonical phonetic form, and speech signals that do not match the canonical form would be filtered. That is, idiosyncratic features of different speakers would not be regarded as necessary for further processing and be discarded. For example, Mullennix, Pisoni, and Martin showed perceptual performance in recognizing words decreased when the voice of the speaker changed across trials [41]. In their study, participants showed higher performance on identifying monosyllabic words after they heard the words—CVC format (i.e., consonant—vowel—consonant syllables) from a single talker than multiple talkers. In a subsequent study, they found that listeners had difficulty ignoring speaker variability in word recognition [42]. To summarize these and other results, this early literature seems to show that speaker variability decreases listeners’ ability to process speech signals [43–45].
However, this point of view that speaker variability is thought of as an uninformative cue in speech processing began to be challenged by the evidence that speaker-specific features are not discarded. For example, in Lively and colleagues’ work, Japanese listeners were trained on words containing either English /r/ or /l/ and were asked to identify whether the word contains /r/ or /l/ [16]. Importantly, the words were produced by multiple speakers. The results showed that not only did Japanese listeners have increased performance to discriminate English /r/ and /l/ after training, but also they could recognize English /r/ or /l/ embedded in new words pronounced by a new speaker. This was not observed in the other group of participants who were trained with a single speaker. Additionally, recent literature indicates that listeners are not only sensitive to speaker variability, they are also capable of storing speaker-specific variation [46,47]. Kraljic and Samuel [15] showed that listeners’ perception of a target token was influenced by its lexical context. For example, listeners tended to perceive tokens on a continuum from /d/ to /t/ as more /d/-like after being exposed to an ambiguous d-t sound (mixture of /d/ and /t/) in /d/ containing words (further, exposure to an ambiguous d-t sound in /t/ containing words led listeners to perceive more /t/ tokens on the continuum). This learning effect was still observed for a different voice. Their following study [46] further determined that this perceptual learning occurs in the speaker dimension as well. Listeners had lexical training with two speakers and the two speakers had systematically different pronunciations of fricative consonants. One speaker produced a sound in between /s/ and /ʃ/ in /s/ containing words and the other speaker produced the same sound, but in /ʃ/ containing words. Listeners tended to perceive the continuum differently according to each speakers’ feature; that is, they categorized the continuum as more /s/-like when it was produced by the speaker who spoke the midway sound in /s/ containing words, while they perceived the same tokens on the continuum as /ʃ/ when they were produced by the other speaker who had the tokens in /ʃ/-containing words. Listeners were able to maintain representations of multiple speakers and so their perception of the tokens on the continuum was adjusted according to their experience of each speaker.
Interestingly, listeners did not show such a pattern of perception in stop consonants. This shows that the experience of multiple speakers can lead listeners to build speaker-specific representation and use speaker-specific information in learning phonemic categories; although, at least in adults, this process might interact with the class of phonemes. Exposure to multiple speakers or experience with new speakers can sometimes also help listeners to learn better. For example, Nygaard and Pisoni [48] found that exposure to words spoken by multiple speakers improved listeners’ intelligibility to the words in novel sentences. Furthermore, listeners who had experience with multiple speakers from different regions could recognize each region according to the speaker’s dialects, but this learning did not occur when the listeners had a single speaker exposure of each region, suggesting that the advantage of experience multiple speakers on perceptual learning appears to be observed with dialects as well [49].
Overall, the findings above suggest that listeners can develop speaker-specific representations as well as linguistic context-dependent representations in phonemic processing [16,46]. This ability to deal with speaker variability seems very sophisticated and robust. Adult listeners do not simply discard the variation. Rather, they keep speaker-specific representations and utilize them to adapt to new learning situations. To build speaker-specific representations (systematic patterns of each speaker), surprisingly, they do not need considerable amounts of exposure.
3.2. Evidence from Toddlers
A similar pattern of results shown in adults has been observed with younger populations. There is clear evidence that young populations do not just discard speaker-specific variation (such as the speaker’s voice characteristics). It was already evident from studies in the 1980s that infants could remember fine acoustic details in distinguishing speakers. For example, DeCasper and Fifer [50] showed that newborns could remember a specific voice (e.g., mother’s voice); newborns seemed to prefer their mother’s voice to another female’s voice. Young learners could also distinguish their mother’s voice from other voices [51]. These findings suggest that they can in principle build speaker-specific representations.
In word recognition studies, toddlers appear to adapt speaker-specific features as quickly as adults do and they could recognize familiar words in new accents [52,53]. Van Heugten and colleagues [53] investigated whether Canadian English-learning toddlers recognize familiar words spoken by an unfamiliar Australian accent. After the toddlers were exposed to a child-friendly story, The Very Hungry Caterpillar, produced by a speaker with an unfamiliar accent (an Australian English-Cantonese bilingual female speaker), the toddlers were tested on their comprehension of words spoken either by the same or the other non-native speaker. During the test, two pictures (a target and a distractor) were presented on the screen side by side and a familiar word, such as “cake” was played. Twenty-five-month-old toddlers recognized the words correctly for both familiarized and non-familiarized non-native speakers (a male native speaker of Cantonese).
For toddlers at 2 years of age, it seems even possible to learn new words from multiple speakers. Schmale, Cristia, and Seidl [54] demonstrated that 24-month-old toddlers could always learn new words from either a single speaker or multiple speakers with a short exposure to the speaker(s). This ability seems to be maintained even when the accents were non-native (e.g., Spanish-accented). Potter and Saffran reported similar observations in word recognition in 18-month-olds [52]. Surprisingly, the toddlers had better performance at recognizing words for an unfamiliar accent after they were exposed to brief passages of multiple speakers with multiple accents (a mix of Australian, Southern-American, and Indian accents) than when they were exposed to multiple speakers of a single accent. These results demonstrate that, by the second year of life, toddlers seem not to have difficulty in learning new words as well as recognizing familiar words even if the words are produced by unfamiliar accents. It is also evident that experience of more variability may indeed be advantageous for recognizing words in unfamiliar accents.
Although there are mixed results reported in terms of the age that toddlers can take advantage of speaker variability in word recognition, it appears that both native/non-native accents and speaker variability contribute to the process. For example, while van Heugten and colleagues found that 20-month-olds failed to recognize the familiar words in the presence of a speaker with an unfamiliar accent [53], Potter and Saffran reported that younger infants (18 months of age) could recognize words in unfamiliar accents after having multiple accent exposures [52]. Schmale and Seidl [55] also indicated that 13-month-olds appeared to have no trouble recognizing familiar words embedded in sentences when the voice characteristics of the speaker changed (i.e., one native speaker during familiarization and another during the test) or accent changed (i.e., change from a native speaker to a non-native speaker). In contrast, younger, 9-month-old infants in this study could not succeed in this task when the speakers changed from native to non-native accent, and only succeeded for speaker changes in the native accent.Young children are also able to apply speaker-specific knowledge to learning situations [35,55–57]. White and Aslin found that infants as young as 19 months of age could quickly adjust to a speaker’s unique pronunciation—the speaker pronounced a vowel [a]-containing familiar words, such as dog, as a vowel [æ]- containing words (i.e., a shift from [a] to [æ])—and correctly map the vowel of the switched words to their proper referents based on their experience with the speaker [57]. Weatherhead and White further suggested that 10–12-month-old infants could not only remember speaker-specific features, but also utilize this information in learning new labels [35]. In the study, the infants heard two speakers whose pronunciations of a certain vowel (e.g., front vowels) were systematically different during exposure. For example, one speaker had higher front vowels compared with the other speaker (“t/I/pu” vs. “t/E/pu”) and infants were exposed to the two speakers saying the two words with an object. If infants could learn these productions are systematically different across two speakers and learn that the different pronunciations are speaker-specific variants, they could infer that the word “t/E/pu” from the speaker who said “t/I/pu” refers to the new object, while the same word (“t/E/pu”) from the other speaker (who said“t/E/pu” during the exposure) indicated the same object from the training. The results confirmed the prediction; infants showed longer looking times to the untrained object (i.e., new object) after they heard that the “t/E/pu” speaker said “t/I/pu”, but showed the reversed looking pattern (i.e., looked longer to the trained object) when the “t/I/pu” speaker said “t/I/pu”. While there is no direct observation reported, these findings support our hypothesis that young infants could keep speaker-specific variants and this ability can be applied to computing speaker-specific distribution.
It is important to note that this proposal can explain the phoneme acquisition in the presence of multiple speakers not only in monolinguals, but also in bilinguals. In fact, language learners who are raised in multicultural societies are more likely to be exposed to increased variability because of the possible presence of non-native accented speakers (see [16,58]). The classic distributional hypothesis may be difficult to apply in this case where the same phoneme might have acoustically distinct productions in speakers of different languages (e.g., an English /p/ versus French /p/).In bilingual native phonetic discrimination, there is one distinct pattern that is not observed in monolingual development, that is, the U-shaped developmental pattern. For instance, while bilingual Spanish–Catalan infants were able to discriminate a Catalan vowel contrast /e/-/ɛ/ at both 4 months and 12 months, they failed to do so at 8 months (U-shaped pattern). For Catalan and Spanish monolinguals, while Catalan monolingual infants maintained the sensitivity to the contrast at 12 months because it is their native, Spanish monolingual infants no longer discriminated the contrast at the same age [59]. There is also similar evidence found in French and English bilingual infants [60].
We suggest that this short-term delay in bilinguals may be related to their extensive use of other cues, such as audiovisual speech cues. Pons, Bosch, and Lewkowicz found that both 8-month-old Catalan and Spanish monolinguals exhibited longer looking times at the speaker’s mouth than the eyes in response to both native and non-native speech, but 12-month-old monolinguals looked longer at the mouth only to nonnative speech. However, bilinguals always looked longer at the speaker’s mouth for both native and nonnative speech at all ages [61]. If it is true that infants take more advantage of other resources (e.g., visual information from speakers) when their input language has high variability, then our proposal that infants can compute speaker-specific distribution in learning phonemes also seems to be plausible with bilinguals.
4. Conclusions: An Enriched View of Phoneme Acquisition
A rich psycholinguistic literature with adults has revealed that adults can rapidly adapt to a variety of speaker-specific variations [26,62,63]. We presented evidence that infants are also sensitive to acoustic differences that mark both language-relevant differences (e.g., different vowels) and accentual differences (see [64]). In addition, there is evidence that infants can integrate visually presented, socially relevant information about speakers with phonetic distributional information to make inferences about phonemes. For example, ter Schure, Junge, and Boersma exposed 8-month-old Dutch infants to English vowel contrasts presented by talking faces and found longest looking to the mouth only in the bimodal condition [37]. Moreover, phonetic, distributional information interacts with visual information about potential referents. Yeung and Werker found that 9-month-old English-learning infants, who typically cannot discriminate auditory tokens for a continuum between [da] and [ɖa], were able to do so when, in a familiarization phase, tokens from the two halves of the continuum were consistently paired with one of two distinct, novel objects [31]. The ability to utilize visually presented, socially relevant information about speakers indicates that infants can extract indexical information about speakers, potentially leading them to compute speaker-specific distributions.
In this project, we suggest that various aspects of language learning can be seen as inferential processes that weigh several sources of information. This proposal also infers that learning phonemic inventories would benefit from the social context in which it occurs and it could explain the rapid narrowing of phonemic categories in the face of acoustic variation that sometimes signals underlying phoneme categories and sometimes not.
What might be the driving force behind using social (auditory or visual) information to guide phoneme learning? Here, we borrow ideas from Relevance Theory and Natural Pedagogy (see [65]). We do so by first noting that, even though the majority of previous studies investigating early language acquisition have done so in experimental contexts that are devoid of any social cues, a study by Kuhl, Tsao, and Liu [66] found that, in the absence of contingent social cues (presentation of auditory material via a live person versus a recording of that person), distributional learning appeared to completely disappear. This observation dovetails with Natural Pedagogy, which aims to understand how infants acquire generic information from specific observations. It proposes that humans are specifically adapted for the transmission of generic information—teachers provide specific ostensive cues that accompany instances where they would like the learner to infer generic information, and infants are sensitive to such cues, and use their presence to make inferences that lead to the acquisition of generic information. For example, if an infant sees Bob looking disgustedly at a glass of a green liquid, the infant could either infer that Bob does not like that green liquid (Bob-specific interpretation) or that green liquids are bad for us (generic interpretation). In a series of studies, it has been shown that infants make generic inferences when these are accompanied by ostensive-communicative cues such as direct eye gaze, contingent interaction, ostensive pointing, or human speech [67].
Phoneme inventories and words are also examples of generic knowledge—if Arya and Sansa speak the same language, we can expect that Arya’s phonemic description of the word “bottle”, and what she intends by that word, generalizes to Sansa. Considering Natural Pedagogy as an appropriate framework suggests that ostensive cues might modulate how infants make inferences when acquiring phoneme categories and words. In fact, there is a slew of evidence that demonstrates that infants, even in their first year of life, consider the perspective of other agents (e.g., [68]), and use ostensive cues as a guide to determine categorization and generalization (see [69,70]). In a recent study, Kovács, Téglás, Gergely, and Csibra found that 12-month-olds were more likely to assign ambiguous objects (e.g., a plate that can turn into a cup) to a demonstrated category, when the demonstrator looked directly at the infant; direct gaze is another ostensive cue [71]. We believe that infants are capable of considering the linguistic role played by phonemes and words as meaningful categories that serve the social-communicative goals of language. Perhaps, the implications of this novel framework are not limited to phoneme learning, but are, in principle, widely applicable to all of language. This idea, therefore, can hold promise to transform how we view learning categories in all aspects of language, whether these are in phonology, syntax, or semantics, as outcomes of social-communicative inferences.