The effect of tone on Mandarin English learnersâ€™ perception and production of English consonant clusters Yizhou Lan 1, Sunyoung Oh 2 1,2
Department of Chinese, Translation and Linguistics, City University of Hong Kong, Hong Kong firstname.lastname@example.org, email@example.com
Abstract The present study investigates the effect of L1 tone on L2 acquisition at the segmental level. In particular, the study examines how Mandarin tone may influence the production and perception of the initial clusters Cr- in English with the occurrence of an epenthetic vowel, C[V]r-. First, the acoustic features of the epenthetic vowels were compared with those of full vowels produced by Mandarin speakers. Second, the perception of the epenthetic and full vowels by Mandarin speakers and native English speakers was compared. For perception, the epenthetic vowels were manipulated in constant duration and tone and compared with the original ones. Both identification and discrimination tasks were conducted to find the differences between the three variations. Results show that the epenthetic vowels with variations of tone and duration did not affect the perception by the English speakers that they showed similarly low accuracy in the three tasks. However, the Mandarin speakers showed significantly lower accuracy when tone was manipulated in the toneconstant task. Findings suggest that L1 feature, tone, in Mandarin is significantly involved in L2 speech acquisition. The present study also intends to support the attention-based model over distance-based models for the theoretic debates on how L2 categories are acquired. Index Terms: L2 acquisition, English consonant clusters, tonal effect, Mandarin tone.
1. Introduction Literature reported that Mandarin learners of English tend to insert a short vowel within initial consonant clusters [1-3]. This is believed to result from transfer of Mandarin syllable structure (CV/CVN), because English CC onsets are not allowed and hence rendered as CV  . Investigations of such epenthesis are largely within segment and syllable levels and seldom involve tone, an important feature contrasting English and Mandarin . Evidence shows that tone language speakers will transfer the perception of stress in stress-timed languages (e.g., English) onto pitch variations in their native languages . Given that epenthesis only happens in unstressed syllables   , we may predict that Mandarin speakers might realize the CC onset with an epenthesis (i.e. CVC) if there exists a low tone created in re-syllabification (which inserted an unstressed syllable). Since Mandarin extensively uses low tone and neutral tone to contrast lexical meaning, such transfers are possible. In the current study, we intend to examine how tone and epenthetic vowel are related in Mandarin speakers L2 production and perception of the English initial stop-/r/ clusters. First, we compared the acoustic features of the epenthetic vowels with those of full vowels (e.g., krit vs. kerit). Second, we compared the perception of the epenthetic and full vowels by Mandarin learners of English with that of native English speakers, intending to find the role tone plays
in helping Mandarin speakers forming L2 segmental categories. Apart from investigating the tonal effect in L2 speech, we are also interested in framing a suitable theoretic model for the inclusion of such tonal effect in second language speech acquisition. Previous studies on L2 speech learning were largely done within either distance-based models, e.g., Speech Learning Model (SLM) or attention-based models, e.g., Automatic Selective Perception model (ASP model). The SLM  posits that experienced learners will establish an intermediate category in between L1 and L2 based on the distance of L1 and L2 categories. However, it does not touch upon features that cannot be described by distance, such as tone and intonation. By contrast, L2 category formation predictions by ASP model  are based on varied distribution of attention to different dimensions of an L2 sound, and are not restricted to segments only. If the first type of model is true, then segmental production cannot be influenced by tonal variations because the perceptual distance of two segments was essentially uninfluenced by tones   . If the second type is true, it is possible that tone can influence segmental production.
2. Method 2.1. Participants Three Mandarin speakers of English (two male, one female) participated in the production part of the study. Their productions were analyzed acoustically and served as the material for perception tests. Participants were all from the old districts of the city of Beijing, and pre-screened with a nativespeaker perception test. None reported any history of speech or hearing impairment, and all had received training in singing. Ten listeners participated in the perception part of the study, including five Mandarin advanced learners of English (three males, two females, mean age=22.5) and five native English speakers (three males, two females, mean age=24). Mandarin listeners were all native speakers of Beijing Mandarin. They were freshmen university students at Hong Kong universities. They started learning English on or before 6 years old, and all came from middle class families. Native English listeners were exchange students coming from the east coast of the USA, all speaking standard American accent. None of the listeners were reported of hearing deficiencies and none had professional musical training. To ensure the relative uniformity of dialectical and other linguistic background, all 30 Mandarin students interested in participating in the experiment were asked to complete a revised version of LEAP-Q questionnaire for linguistic experience . The questionnaire recorded both linguistic and biographical experiences such as history of foreign language learning, age, GPA, and English standard test scores. Only students with similar linguistic experience (std<1.5) were chosen for the perception tests. All participants,
including speakers and listeners, were remunerated HK$50 for their participation.
2.2. Stimuli and Procedure We designed a production and a perception test to find out whether L1 tonal variations will affect the category formation of L2 English initial stop-/r/ clusters. The acoustic property of epenthetic vowels in C-/r/ clusters was examined for Mandarin learners, and perceptual accuracy for the differences of epenthetic and full vowels were examined for Mandarin learners and English speakers. Stimuli were 27 pseudo words in closed syllables. They included initial cluster (CrVt), regular plosive (CVt) and cluster with full vowel in between (CerVt). The initial consonant differed in place and the following vowel differed in /i/, /ɑ/ and /u/. Words were embedded in carrier sentences. We chose voiceless consonants because there might be a duration effect for voiced consonants when confounded with vowels. The production test was done in a sound-proof booth. Productions were recorded with the sampling frequency of 44100 Hz. The perception tests were held in the same booth with head-mounted earphones provided to the listeners. In the production test, productions of 3 Mandarin advanced learners of English were analyzed for duration, F1 and F2. The acoustic detail of interest, i.e., the vowel parts for both epenthetic and full vowel words, was measured from the end of the release of the consonants to the visible formant contour of /r/ marked by rising of F3, as in . The production stimuli obtained from the three speakers were first gone through a screening process. Words produced with a wrong vowel due to lexical effects were left out. Only words with significant auditory epenthesis were chosen by a phonetically-trained listener, i.e. the first author, in order to attain maximal perceptual confusion. The duration of each produced word was also normalized by the average duration within and across three speakers. More importantly, the /r/+V+C parts of a pair of words were cut and pasted manually fitting to the offset of /r/ to ensure the discrimination making was interfered by differences from the non- cluster part in the word pair. The perception test contained productions adapted with three types of manipulations: original, duration-constant and tone-constant production. To create duration-constant stimuli, original productions were sent to duration modification, with the part between the initial consonant release to the onset of vowel reset at the duration of a syllable in the speaker’s production. To create tone-constant stimuli, original productions were sent to a low-pass filter and the pitch of each production was set to 100Hz. The sounds were manually checked to ensure the naturalness to the human ear. In the identification task, randomized pairs of stimuli were printed on a paper and listeners were asked to identify. In the discrimination task, each pair of stimuli was compiled into three combinations (ABA, ABB, and AAB) for listeners to discriminate. All odd-sounding files were discarded. Each pair were repeated 10 times and added with equal numbers of fillers. Theoretically the token number to be included in analysis was 27 stimuli × 5 repetitions × 2 combinations + 27 stimuli × 5 repetitions × 3 combinations = 675 tokens for each speaker. After screening, a total of 620 tokens were selected as the perception test material. Within-trial interstimuli interval (ISI) was set at 50ms and between-trial ISI at 200ms. We used a small ISI to decrease the time lag between stimuli to facilitate a larger potential of accuracy. All trials were randomized and added with equal numbers of fillers.
Since participants expressed fatigue in a pilot study, we divided the trials into four chunks, each containing 360 tokens.
3. Results 3.1. Results on the production test Mandarin speakers’ productions of English contained a considerable amount of epenthetic vowels. 38% of the tokens were clearly inserted with an epenthetic vowel. Acoustically, such productions showed a dip and then peak in F3, indicating insertion of a vowel-like gesture. Such insertions are sometimes inaudible. We examined the acoustic qualities of epenthetic vowels (as in the epenthesis part of the learners’ production of prit in between /p/ and /r/) and full vowels (as in the /ə/ in the first syllable of terrain) elicited from the words. Duration, tone, F1 and F2 were analyzed. The difference of subjects and vowel was not significant across speakers for all dependent variables. However, the difference of consonant was significant for duration [F(2, 248)=4.216, p<.0001], with the post-hoc test showing that the alveolar cluster (/tr/-) being significantly lower in duration for epenthetic vowels [md=.49, std.E=.118, p<.001](See Figure 1). The average duration for epenthesis and full vowels were 49.01 and 150.47 respectively. Average tone was 34Hz and 65Hz respectively. For F1 and F2, they were 1245 and 2356 respectively. Within-group variance tests shows that the difference was significant but significant for duration [F(2, 248)=3.488, p<.0001] (see Figure 2) and tone [F(2, 248)=.287, p<.0001], but insignificant for F1 and F2 (see Figure 3).
Figure 1: Comparison of duration for epenthetic and full vowels by consonants.
Figure 2: Comparison of duration and tone for epenthetic and full vowels.
Figure 3: Comparison of average F2 and F3 formants for epenthetic and full vowels.
3.2. Results on the perception test In general, three tasks were completed with average accuracy rates of 88.4%, 81.6% and 61.3% by the five Mandarin speakers, and 66.7%, 67.5% and 67.4% by English speakers. For Mandarin speakers, the second and third tasks witnessed a substantial drop of accuracy rates [F(3, 617)=8.719, p<.0001], but not for English speakers[F(3, 617)=1.249, p=.576]. For Mandarin speakers, Tukey’s post-hoc test showed that the difference lied in the third task, with tone modified as the constant [Task 1: md=.27, std.E=.214; p=.438; Task 2: md=.27, std.E=.214, p<.001; Task 3: md=.27, std.E=.214, p<.0001]. Consonant differences were drastic across all three tasks. (86.6%, 96.1% and 83.3% for the first task; 85.4%, 100% and 81.7% for the second task, 56.7%, 99.1% and 56.7% for the third task [F(2, 617)=9.467, p<.0001]). Tukey’s post-hoc test showed that the difference lies in the alveolar cluster /tr/. Similar situation exists for English speakers as well. Without the tasks with /tr/ cluster, the percentages were quite different. The three tasks were completed with average accuracy rates of 80.4%, 74.6% and 49.7% by the five speakers for Mandarin speakers and 51.4%, 47.7%, and 46.9% for English speakers. The inter-task difference for Mandarin speakers was even larger and the chance-level of the third experiment was even more salient [F(3, 617)=6.275, p<.0001]. For post-hoc tests, [Task 1: md=.45, std.E=.178; p=.438; Task 2: md=.45, std.E=.178, p<.001; Task 3: md=.45, std.E=.178, p<.0001] (See Figure 4). For both Mandarin and English speakers, between-group differences of vowel and speaker were not significant. As for Mandarin speakers, the average accuracy rates by vowels in the order of /i, /ɑ/, and /u/ were 87.5%, 91.4% and 88.5%, and accuracy rates by individual speaker were 81.55%, 82.2%, 74.4%, 78.8% and 81.5%. For English speakers, accuracy rates by vowel were 51.8%, 42.8%, and 47.7% respectively, and the difference by speaker were 65.67%, 68.6%, 59.4%, and 68.6%, and 70.2%. These differences were all insignificant.
Figure 4: Comparison of mean perceptual accuracy rates by tasks (with /tr/ condition excluded).
4. General Discussions In the production test, the epenthetic and full vowels showed significant differences in duration, but not in F1 and F2 values. We could interpret the insignificant difference in F1 and F2 as evidence for lack of spectral difference for the epenthetic and full vowel. The longer durations for full vowels could be interpreted as longer duration of glottal opening. More importantly, the production results showed evidence that tone can be a salient cue for perceptual distinction for L2 segments. It echoes back to our hypothesis that L1 suprasegmental traits may also influence L2 perception. In the perception test, the substantial gap of overall accuracy between Mandarin and English speakers had shown that Mandarin speakers could distinguish actual and epenthesis in a minimal context, while native English speakers could not. Noticeably, the high accuracy rate for Mandarin perception of original productions showed that Mandarin listeners could distinguish the target pair (CrVt and CerVt) with ease. The contrast between high and low accuracy in the original task by Mandarin and English listeners indicated that Mandarin speakers may had utilized some L2-specific cues in perceiving L2 speech. But in the second and third test, where duration and tone were manipulated, accuracy rates substantially dropped for Mandarin speakers. This showed that tone and duration were the cues Mandarin speakers used to perceive epenthetic vowels. However, the weightings of the two cues were not equal. Duration may not be a heavily weighted cue because the perceptual accuracy was somewhat high in the second task as well (especially in /p/ condition). However, the cue of tonal variations was more significant than duration because of a larger drop of accuracy rate was found in the third task. We could hence infer that Mandarin listeners would allocate much of their attention to the pitch variations in their peers ’ productions to decide whether there was a phonological vowel in between two consonant members in stop-/r/ clusters. For English speakers, perceptual accuracy was similarly low across three tasks. It suggested that the durational and tonal cues did not significantly help English speakers in perceiving the difference. However, we did see a minor drop of accuracy rates in the second task, showing that duration may be of insignificant help in English speakers’ successful perception of epenthetic vowels. This suggests that duration can be a universal cue for perceiving epenthetic vowels, which is predictable.
The unique behavior of the alveolar consonant cluster in the perception test was consistent in all three conditions and for both English and Mandarin speakers. As shown in the results, alveolar cluster conditions had higher perceptual accuracies. It was noted that even with constant tone, Mandarin speakers had no problem in distinguishing the trVt and terVt contrast. That was probably because of the gestural simplification of tr- cluster. For tr- clusters, the tongue tip might undergo a gestural conflict: the degree of displacement of the tongue tip would be restricted by the same gestural closure of tongue tip in /t/ as well as /r/ . Also as a sideproof, the production pattern of trVt was totally different from prVt and krVt as well. The realization of /tr/, instead of vowel epenthesis, was /tʃ/, which bears more resemblance acoustically to the Mandarin ear. Here, under such specific gestural condition, cue weighting in perception was shifted to the huge spectral differences between /tVr/ and /tr/, where the latter was phonetically no longer clearly two separate phonemes, but co-articulated. Results can be explained through both distance-based   and attention-based theoretic frameworks  . In the former framework, the perceptual accuracies for these two groups of speakers were different due to varying perceptual distances. For English speakers, because they already knew that these productions were English, an equivalent classification was inevitable to occur disregarding whether tonal information was provided. In this situation, the assimilation type was single-category (SC) with equivalent classification , resulting in poor discrimination rate. However, for Mandarin speakers, since they were sensitive to the pitch variations, their perceptual mapping was clear-cut and the distance is farther because they perceive duration differently. However, we could not simply conclude that distance accounts for English speakers’ perceptual map to be distinguishable because, as stated above, tone was not traditionally deemed as a factor influencing segmental perceptual distance in distance-based models. They could not explicitly describe the quantitative distance for tone, which embodies both tone value and tone contour. They are not purely numbers but also vectors. Nevertheless, an attentional model can explain the seeming transfer of L1 tone onto L2 segment, because L2 perceivers may draw their attention to the tonal cue when perceiving the syllable structure, where both segmental and supra-segmental information were presented in competition . Mandarin speakers’ overuse of pitch as a perceptual cue shown in perceptual results can be explained by the increased attention allocated to tonal differences than to temporal and spectral ones. The current results suggest that changes in different perceptual dimensions on the same category may have different impact on perceptual category formation. Findings also provide empirical evidence against the idea that L2 speech is a fragmentary form of L1. Instead, L2 interlanguage makes use of many cues that native speakers of that target language never use. In the current study, Mandarin L2 English speakers could successfully perceive L2 interlanguage minimal pairs through the aid of L1 tones, but native English speakers failed to do so. That L2 phonology has its own rules provides a new sight into L2 speech learning as well.
5. Conclusion The experiments in this study have tested Mandarin speakers’ realizations of English initial consonant-/r/ clusters, and Mandarin and English speakers’ perceptual accuracy of three conditions varying in the absence of certain possible
perceptual cues. The results showcased how L1 tones may influence L2 segmental production and perception of English. Especially in the perception test, it was found that tone serves as a significantly weighted perceptual cue. The accurate perception of the cluster vs. non-cluster conditions at the onset position was heavily influenced by whether pitch information was provided. Since Mandarin extensively uses different pitch ranges and directions in con- tour pitches to distinguish lexical meaning, the changes in pitch variation is essentially rooted into the attentional resources L1 Mandarin learners like to seek in L2 English even for advanced learners as has been tested in this study. Moreover, the different perceptual accuracy for alveolar and other clusters had confirmed that the preferred dimension of attention not only differs by language, but also by the salience of the acoustic stimuli itself. Nevertheless, future studies should include more cognitive methods to examine the actual distribution of attention by L2 speakers.
6. References        
Lin, Y.-H., “Syllable simplification strategies: A stylistic perspective,” Language learning, 51(4):681–718, 2001. Lee, S., “A comparison of cluster realizations in first and second language,” The Journal of Studies in Language, 19(2):341–357, 2003 Deterding, D., “The pronunciation of English by speakers from china,” English World-Wide, 27(2):175–198, 2006. Carlisle, R. S., “Syllable structure universals and second language acquisition,” International Journal of English Studies, 1(1):1–19, 2001. Itoˆ, J., “A prosodic theory of epenthesis,” Natural Language & Linguistic Theory, 7(2):217–259, 1989. Hombert, J. M., Ohala, J. J. and Ewan, W. G., “Phonetic explanations for the development of tones,” Language, 55(1):37–58. 1979. Kingston, J., “Tonogenesis,” Blackwell Companion in Phonology, 4(97):2304–2333, 2011. Flege, J. E., “Assessing constraints on second-language segmental production and perception,” Phonetics and phonology in language comprehension and production: Differences and similarities, 319–355. 2003. Strange, W., “Automatic selective perception (asp) of first and second language speech: A working model,” Journal of Phonetics, 39(4): 456–466, 2011. So, C. K. and Best, C. T., “Cross-language perception of nonnative tonal contrasts: Effects of native phonological and phonetic influences,” Language and speech, 53(2):273–293, 2010. Eckman, F. R., “The structural conformity hypothesis and the acquisition of consonant clusters in the inter-language of ESL learners,” Studies in Second Language Acquisition, 13(1):23–41. 1991. Marian, V., Blumenfeld, H. K. and Kaushanskaya, M., “The language experience and proficiency questionnaire (LEAP-Q): Assessing language profiles in bilinguals and multilinguals,” Journal of Speech, Language and Hearing Research, 50(4):940957, 2007. Kent, R. D. and Read, C., “The acoustic analysis of speech.” Thomson Learning Albany, NY, 2002. Browman and L. Goldstein, C. P., “Tiers in articulatory phonology, with some implications for casual speech,” Papers in laboratory phonology I: Between the grammar and physics of speech, 341–376, 1990. Best, C. T. “A direct-realist view of cross-language speech perception,” Speech perception and linguistic experience: Issues in cross-language research, 171–204, 1995. Guion S. G. and Pederson, E., “Investigating the role of attention in phonetic learning,” Language experience in second language speech learning, 57–77, 2007.