Automatic pronunciation error detection. in Dutch as a second language: an acoustic-phonetic approach

MA Thesis Automatic pronunciation error detection in Dutch as a second language: an acoustic-phonetic approach Khiet Truong First Supervisor: Helmer Strik (University of Nijmegen) Second Supervisor: Gerrit Bloothooft (Utrecht University) Submitted to the Faculty of Arts of Utrecht University The Netherlands

Doctoraalscriptie van Khiet Truong (Titel vertaald: Automatische detectie van uitspraakfouten bij NT2-leerders: een akoestisch-fonetische aanpak) Faculteit der Letteren, Universiteit van Utrecht Opleiding: Algemene Taalwetenschap Specialisatie: Computertaalkunde Eerste scriptiebegeleider: Helmer Strik (Katholieke Universiteit van Nijmegen) Tweede scriptiebegeleider: Gerrit Bloothooft (Universiteit van Utrecht) Juni 2004

Acknowledgements The research for my MA thesis was carried out at the department of Language and Speech at the University of Nijmegen. From September 2003 - June 2004, I participated in the PROO project. I would like to take this opportunity to thank everyone who has helped me doing research and writing my MA thesis at this department. I would like to thank Lou Boves and Gerrit Bloothooft for making this traineeship possible. I would also like to thank my supervisors who have guided me and helped me completing this thesis: Helmer Strik, Catia Cucchiarini, Ambra Neri (University of Nijmegen) and Gerrit Bloothooft (Utrecht University). Thank you, I have learned so much from you. The other members of the PROO group are also thanked for their help. Finally, the members of the department of Language and Speech at the University of Nijmegen and the scriptie groep at Utrecht University are thanked for sharing their knowledge and giving feedback on my work and presentations. Khiet Truong Apeldoorn, June 2004 3

Contents Acknowledgements 3 Contents 4 1 Introduction 7 1.1 Background: CAPT within CALL............................ 7 1.2 The aim of the present study............................... 9 1.3 Structure of the thesis................................... 11 2 Automatic detection of pronunciation errors: a small literature study 13 2.1 Introduction......................................... 13 2.2 Why do L2 learners produce pronunciation errors?................... 13 2.3 What kind of pronunciation errors do L2-learners make?................ 16 2.4 Possible goals in pronunciation teaching......................... 18 2.5 Overview of automatic pronunciation error detection techniques in the literature.. 19 2.5.1 Overview of ASR-based techniques........................ 19 2.5.2 Adding extra knowledge to acoustic models and ASR-based techniques.... 22 2.6 Automatic pronunciation error detection techniques employed in real-life applications 24 2.6.1 Employing ASR-based techniques in real-life CALL applications....... 24 2.6.2 Using acoustic-phonetic information in real-life CALL applications...... 25 3 The approach adopted in the present study 28 3.1 Introduction......................................... 28 3.2 An acoustic-phonetic approach to automatic pronunciation error detection...... 28 4

3.3 Selecting pronunciation errors............................... 30 3.3.1 Goal of pronunciation teaching adopted in this study.............. 30 3.3.2 The pronunciation errors addressed in this study................ 30 4 Material & Method 33 4.1 Introduction......................................... 33 4.2 Material........................................... 33 4.3 Algorithms used in this study............................... 36 4.3.1 Linear Discriminant Analysis........................... 36 4.3.2 Decision tree-based................................. 41 5 The pronunciation error detectors /A/-/a:/ and /Y/-/u,y/ 43 5.1 Introduction......................................... 43 5.2 Acoustic characteristics of /A/, /a:/, /Y/, /u/ and /y/................ 43 5.2.1 General acoustic characteristics of vowels.................... 43 5.2.2 Acoustic differences between /A/ and /a:/.................... 45 5.2.3 Acoustic differences between /Y/ and /u,y/................... 46 5.2.4 Acoustic features for vowel classification: experiments in the literature.... 47 5.3 Method & acoustic measurements............................. 51 5.4 Experiments and results for /A/-/a:/ and /Y/-/u,y/.................. 53 5.4.1 Organization of experiments............................ 53 5.4.2 Experiments and results /A/-/a:/........................ 56 5.4.3 Experiments and results /Y/-/u,y/........................ 63 5.4.4 Experiments and results /Y/-/u/-/y/...................... 68 5.5 Discussion of results.................................... 71 5.5.1 Discussion of the results of /A/-/a:/....................... 71 5.5.2 Discussion of the results of /Y/-/u,y/...................... 75 6 The pronunciation error detector /x/-/k,g/ 78 6.1 Introduction......................................... 78 6.2 Acoustic characteristics of /x/, /k/ and /g/....................... 78 6.2.1 General acoustic characteristics of consonants.................. 78 5

6.2.2 Acoustic differences between /x/ and /k,g/................... 80 6.2.3 Acoustic features for fricatives versus plosives classification: experiments in the literature.................................... 81 6.3 Methods & acoustic measurements............................ 83 6.3.1 Method I & acoustic measurements........................ 83 6.3.2 Method II & acoustic measurements....................... 87 6.4 Experiments and results for /x/-/k,g/.......................... 88 6.4.1 Organization of experiments............................ 88 6.4.2 Experiments and results method I........................ 89 6.4.3 Experiments and results method II........................ 90 6.5 Discussion of results.................................... 95 7 Conclusions and summary of results 101 7.1 Introduction......................................... 101 7.2 Summary of /A/-/a:/................................... 101 7.3 Summary of /Y/-/u,y/................................... 103 7.4 Summary of /x/-/k,g/................................... 104 7.5 Conclusions......................................... 106 References 113 A List of abbreviations 117 B List of phonetic symbols 118 C Scripts 121 D Sentences 127 E Amount of speech data 129 F Tables with classification scores 133 G How to read Whisker s Boxplot 147 6

Chapter 1 Introduction 1.1 Background: CAPT within CALL Traditionally, pronunciation training received less attention than writing and grammar in foreign language teaching. Many language teachers believed that pronunciation did not deserve as much attention as other linguistic aspects such as grammar, mainly because they considered accent-free pronunciation a myth (Scovel, 1988), and thus an impossible goal to achieve. This view has influenced, among other factors, the amount of available information on how pronunciation can be best taught rather negatively. Nowadays, it is generally agreed that a reasonably intelligible pronunciation is more important than accent-free pronunciation. Unfortunately, training of pronunciation is still often neglected in traditional classroom instruction for the main reason that training pronunciation is time-consuming: training pronunciation requires a lot of time for practice from students and a lot of time from teachers for providing feedback. Computer Aided Language Learning systems (CALL) can offer a solution. More specifically, a Computer Aided Pronunciation Training module (CAPT) within such a CALL system can tackle problems that are associated with training pronunciation in a classroom environment, and offers many other advantages. Technology is nowadays more and more integrated in teaching and more specifically foreign language teaching. There are many software applications available on the market that teach users foreign languages. CAPT and CALL applications provide a solution to the problems mentioned above. First of all, computers are more patient than human teachers, and are usually available without any time constraints. And secondly, computers provide a more individual way of learning 7

which allows students to practise their own pronunciation difficulties and work at their own pace, whereas in a traditional classroom environment, it is difficult to focus on the needs of individual students. Moreover, student profiles can be logged by the system, so the improvement or problems can be monitored by the teacher or the student him/herself. Finally, a classroom environment can cause more anxiety or stress for students; a CALL environment which offers more privacy can reduce this phenomenon known as foreign language anxiety (Young, 1990). These collective advantages have led to an increasing interest in CALL, and more specifically CAPT, by the language teaching community. Developing CALL and CAPT systems offers challenges and new interdisciplinary areas of interest in the field of language teaching: technology is to be integrated in a language teaching system in such a way that it needs to meet pedagogical requirements. Neri et al. (2002) describe the relationship between pedagogy and technology in CAPT courseware more closely. CAPT can be integrated into a CALL system by using Automatic Speech Recognition (ASR) technology. An (ideal) ASR-based CAPT system can be described by a sequence of three phases: 1) Speech Recognition : the first and most important phase because the subsequent phases depend on the accuracy of this one. In interactive dialogues with multiple-choice answers the correct answer should be recognized by the system and all other answers should be discarded. Furthermore, ASR timealignes the spoken signal with phone labels; phase 2) and 3) are based on this timealignment. 2) Scoring and error detection/diagnosis : the system evaluates the pronunciation quality and can give a global score. Pronunciation errors are located and the type of error is determined for phase 3). 3) Feedback : with the diagnosis of the pronunciation error, correct feedback can be given that meets the pedagogical requirements. Ideally, such a CAPT system should mimic the tasks of a human teacher and give the same judgements about the student s pronunciation as a human teacher would do. CALL and CAPT systems are therefore usually evaluated by how well judgements from the machine agree/correlate with human judgements (human-machine correlations) of the same speech material. 8

1.2 The aim of the present study The focus of this study is on automatic pronunciation error detection (phase 2 in the previous scheme) in speech of foreign language learners. In our case, the foreign language is Dutch which is learned by second language (L2) learners. In general, automatic pronunciation error detection techniques usually involve measures that are obtained by means of automatic speech recognition technology, generalized under the term confidence scores (see chapter 2.5) which in some way represent how certain the system is that signal X belongs to a pattern Y: a low confidence (score) of the system may indicate bad pronunciation quality. These measures have the advantage that they can be obtained fairly easily, and that they can be calculated in similar ways for all speech sounds. However, ASR confidence measures also have the disadvantage that they are not very accurate: the average human-machine correlations they yield are rather low, and, consequently, their predictive power of pronunciation quality is also rather low (see e.g. Kim et al., 1997). This lack of accuracy might be related to the fact that confidence scores are computed in the same way for all speech sounds, without focusing on the specific acoustic-phonetic features of individual sounds. These disadvantages of methods based on confidence measures have led to the present study in which we investigate an alternative approach that would yield higher detection accuracy. In this study, we present an acoustic-phonetic approach for detection of pronunciation errors at phone-level. The goal of this study is formulated as: Goal: to develop automatic acoustic-phonetic-based classification techniques for automatic pronunciation error detection in speech of L2 learners of Dutch. Related to this goal, is the question of how well these automatic classification techniques perform in detecting pronunciation errors at phone level. This is the main question addressed in this study and is formulated as the thesis question: Thesis question: How effective are automatic acoustic-phonetic-based classification techniques in detecting pronunciation errors of L2 learners of Dutch? In this context, effective means that ideally, the techniques should be able to detect pronunciation errors just as humans do: machine judgements should resemble human judgements. For this purpose, a non-native speech database of Dutch was annotated by human listeners on pronunciation errors. The non-native speech used in this study was checked and annotated on pronunciation 9

errors so that these human annotations (judgements) could be compared to machine judgements. The acoustic-phonetic approach (section 3) enables us to be more specific in developing pronunciation error techniques. First, we selected three pronunciation errors by carrying out a survey on an annotated non-native speech database. We found that the following three speech sounds were often mispronounced by non-native speakers and decided to address these three pronunciation errors in this study: /A/ mispronounced as /a:/ /Y/ mispronounced as /u/ or /y/ /x/ mispronounced as /k/ or /g/ For each pronunciation error, the acoustic differences between a correctly pronounced phone and an incorrectly pronounced phone are examined and these acoustic differences, translated into acousticphonetic features, are used to develop a pronunciation error detector. Classification experiments and statistical analyses can show which specific features are most reliable for detection of a particular pronunciation error (Q1). Another interesting issue that is to be examined in this study, is the use of native or non-native speech as training material: it is still not clear whether a detector should be trained on native or non-native speech material to achieve the highest detection accuracy, without degrading the performance for native speakers (Q2). For the pronunciation error of /x/, we will examine two different methods: one that uses Lineair Discriminant Analysis and one that uses a decision tree to classify a sound as either correct or incorrect. Is there a preference of one method over the other (Q3)? Thus, in addition to the main question, three other questions are posed that are related to this acoustic-phonetic approach and the thesis question: Q1. What are reliable discriminative acoustic-phonetic features of phonemes for pronunciation errors of /A/, /Y/ and /x/? Q2. How do the detectors trained under different conditions (trained on native or non-native speech), cope with non-native speech? Q3. What are the advantages of a Lineair Discrimination Analysis method (LDA) as opposed to a decision tree-based method for automatic pronunciation error detection? 10

The following chapters describe how the goal of this study is achieved and how we try to find the answers to the questions posed above. 1.3 Structure of the thesis Chapter 2 reports on a small literature study on automatic pronunciation error detection. First, we examine why L2 learners produce pronunciation errors (section 2.2) and show some examples of these errors (section 2.3). A description of possible goals of pronunciation teaching is given in section 2.4. Finally, overviews of different kinds of automatic pronunciation error techniques are given in section 2.5 and 2.6. Chapter 3 describes the approach adopted in this study (section 3.2). Part of this approach is the selection procedure for the pronunciation errors addressed in this study (3.3). Chapter 4 gives an description of the material and different classification algorithms that were used in this study. The speech material is described in section 4.2 and two different classification algorithms are described in section 4.3. Chapter 5 reports on the development of the pronunciation error detectors for errors of /A/ and /Y/. First, the acoustic characteristics of the two sounds are examined (section 5.2) to determine potential discriminative features. A description of the procedure for acoustic feature extraction is given in section 5.3 and in section 5.4, the results of the classification experiments are shown. Finally, a discussion of the results is given in section 5.5. Chapter 6 reports on the development of the pronunciation error detectors for errors of /x/. The acoustic properties of this pronunciation error are examined in section 6.2. Two classification methods for this error are introduced in section 6.3. Finally, in section 6.4 the results of the classification experiments are shown, and discussed in section 6.5. Chapter 7 gives a summary of the results. Summaries of the classification experiments of /A/ vs /a:/, /Y/ vs /u,y/ and /x/ vs /k,g/ are given in section 7.2, 7.3 and 7.4 respectively. Finally, in section 7.5, we try to answer the questions posed at the beginning of this thesis and give some suggestions for further research. Some practical remarks: Throughout this work, phonetic symbols will be used in SAMPA notation: for a list of phonetic 11

symbols in IPA and SAMPA notation, see appendix B. A list of abbrevations used throughout this work is given in appendix A. 12

Chapter 2 Automatic detection of pronunciation errors: a small literature study 2.1 Introduction Chapter 2 reports on a small literature study and consists of two parts: before we give an overview of automatic pronunciation error detection techniques (section 2.5 and 2.6), we give some background information on pronunciation errors. Why and what kind of pronunciation errors are produced by L2 learners is described in section 2.2 and 2.3. Section 2.4 describes some possible goals in pronunciation teaching. The second part of this chapter gives an overview of automatic pronunciation error detection techniques (section 2.5 and 2.6). 2.2 Why do L2 learners produce pronunciation errors? Pronunciation errors exist because L2 sounds are not correctly produced by the L2 learner. How then are L2 sounds learned by L2 learners or to say it differently: why are L2 sounds not properly learned? Various studies that have investigated this issue, have also paid attention to the relationship between production and perception of L2 sounds. The main question seems to be whether production precedes perception, or perception precedes production in the process of acquiring an L2 (Llisterri, 1995). Or in other words, is an L2 learner able to produce an L2 sound accurately if the same sound is not correctly perceived? This relationship between production and perception 13

of L2 sounds is related to factors such as age of learning and knowledge of L2. Some researchers have proposed that neurological maturation might lead to a diminished ability to add or modify sensorimotor programs that configure the movements of the articulators for producing sounds in an L2 (e.g. McLaughlin, 1977). Many researchers believe that when a certain age is passed, new sounds in speech cannot be learned perfectly. For instance, it was found that the later non-native speakers began learning English, the more strongly foreign-accented their English sentences were judged to be (Flege, Munro and MacKay, 1995). The existence of this so-called critical period is often explained by neurological maturation. Knowledge of L2 may also affect the relationship between production and perception. Bohn & Flege (1990) investigated this factor by examining the production and perception of the English vowels /e:/ and /{/ (IPA /æ/) in two groups of German learners of English: an experienced group and an inexperienced group. The results showed that there are clear differences between the two groups of speakers: the inexperienced group did not produce the contrast between the two vowels, but was able to differentiate them in a labeling task and thus was able to perceive them correctly; the experienced group did produce the contrast and was better in the labeling task. Furthermore, they found that the two groups relied on different acoustic cues in the labeling task. They concluded that perception may lead production in the early stages of L2 speech learning and that production might be improved by experience. Evidence supporting the view production precedes perception can be found in Borrell (1990), Neufeld (1988) and Briere (1966) who pointed out that it is common that when learning an L2, not all sounds that are correctly perceived will be correctly pronounced. Furthermore, an experiment carried out by Sheldon & Strange (1982) with Japanese speakers of English showed that the production of the English contrast between /r/ and /l/ was more accurate then the perception of it. The view perception precedes production is supported by evidence from many more studies, which seems to imply that generally, perception does precede production, at least for vowels. Already in 1939, Trubetzkoy proposed that bilinguals tend to perceive L2 sounds with their own L1 phonology, which may lead to wrong productions or accentedness of L2 sounds. Borden, Gerber & Milsark (1983) examined the relationship between perception and production of English /l/ and /r/ in Korean learners of English. They found that perceptual judgments of /r/ and /l/ improved 14

before production and that self-perception develops earlier and may be a prerequisite for accurate production. Flege (1993) examined vowel duration as a cue to voicing in English words produced and perceived by Chinese speakers of English. The study revealed correlations between differences in perceived vowel duration and degree of foreign accent and Flege (1993) concluded that [...] nonnatives will resemble native speakers more closely in perceiving than in producing vowel duration differences [...]. Numerous studies have proven that listeners have difficulty perceiving and making phonetic distinctions that do not exist in their native language. A common view in the 1970s was that interference from the L1 is the primary phonological cause of non-native productions: 1) an L2 sound that is identified with a sound in the L1 will be replaced by the L1 sound; 2) contrasts between sounds in the L2 that do not exist in the L1 will not be honored; 3) contrasts in the L1 that are not found in the L2 may nevertheless be produced in the L2 (e.g. Weinreich, 1953; Lehiste, 1988). Two more recent working models that focus on phonological contrasts in L1/L2 and support the perceptive view on L2-learning are Flege s Speech Learning Model (SLM) and Best s Perceptual Assimilation Model (PAM). SLM (Flege, 1995) claims that [...] without accurate perceptual targets to guide the sensorimotor learning of L2 sounds, production of the L2 sounds will be inaccurate [...]. The model makes the assumptions that the phonetic systems used in the production and perception of vowels and consonants remain adaptive over the life span and that new phonetic categories are added or old ones are modified in the phonetic systems when L2 sounds are encountered. It hypothesizes that many (but not all!) L2 production errors have a perceptual basis. Learners perceptually relate positional allophones in the L2 to the closest positionally defined allophone in the L1 in acoustic-phonetic terms, such as the F1/F2 formant space for vowels. L2 learners can establish a new phonetic category for an L2 sound that differs from the closest L1 sound. The greater the perceived distance of an L2 sound from the closest L1 sound, the more likely that a new phonetic category will be established. According to PAM (Best, 1995), non-native sounds are perceptually assimilated to native phonetic categories according to their articulatory-phonetic (gestural) similarity to native gestural constellations (Browman & Goldstein, 1989), where gestures are defined by the articulators, place 15

of articulation and manner of articulation. The model states that non-native speech perception is strongly affected by the linguistic experience of the listener with phonological contrasts and that listeners perceptually assimilate non-native phones to native phones whenever possible. In PAM, a given non-native phone may be perceptually assimilated to the native system of phonemes in one of three ways: as a Categorized exemplar of some native phoneme: if the contrasting phones are both assimilated as good exemplars of a single native phoneme then perceptual differentiation is difficult; if the contrasting phones differ in their goodness of fit, thus being an exemplar of a single native phoneme, then perceptual differentiation is somewhat easier as an Uncategorized sound that falls somewhere in between native phonemes, the non-native phone is roughly similar to two or more phonemes, perceptual differentiation is easy as a Nonassimilable nonspeech sound that bears no detectable similarity to any native phoneme, the non-native phone will be perceptually differentiated on the basis of its auditory or phonetic characteristics. The main difference between the two models is that SLM places the emphasis on an acousticphonetic specification of phonetic similarity whereas PAM assumes an articulatory specification of phonetic similarity. 2.3 What kind of pronunciation errors do L2-learners make? A distinction can be drawn between pronunciation errors that are made on a segmental level and errors that are made on a suprasegmental level. On a segmental level, errors may concern vowel and consonant quality and may be explained by differences between language systems. An example of a segmental pronunciation error is the pronunciation of the Dutch /i/ in vies and /I/ in vis : Japanese and Italian L2 learners of Dutch do not know the difference between /i/ and /I/ because this distinction does not exist in their L1. The same applies to the mispronunciation of /A/ as /a:/ length is a distinctive feature in Dutch whereas in e.g. Italian this distinctive feature does not exist. 16

Another example of a segmental pronunciation error is the mispronunciation of /x/, which is a very common error in Dutch which again might be due to the fact that /x/ is not encountered in many other languages. The pronunciation errors may be mispronunciations of L2 sounds, also called substitutions, because an L2 sound is substituted with another sound, but also deletions or insertions of sounds in L2 occur. The two latter pronunciation errors may be due to differences in syllable structure between L1 and L2. Japanese and Arabic do not allow branching onsets or codas, so an L2 word may be modified so that it fits the L1 syllable structure, which results in vowel epenthesis (see fig. 2.1 and fig. 2.2, both examples were taken from O Grady et al., 1996). In Turkish, a word cannot begin with two consonants and Spanish does not allow an /s/ word-initially followed by a sequence of consonants (O Grady et al., 1996). Figure 2.1: English target word with its syllable structure. Figure 2.2: Non-native speaker s version of English target word. In addition to having deviant intonational contours and deviant lexical stress patterns, L2 learners tend to have lower speech rates and a higher number of disfluencies such as stops, repetitions, and pauses, which result in lower fluency (suprasegmental errors). An example of a suprasegmental pronunciation error is incorrect stress placement. L2 learners have to acquire the stress patterns of the language they are trying to learn, which is difficult because the stress patterns of L1 interfere. Consider Polish in which word-level stress is assigned to the penultimate (next-to-last) syllable, regardless of syllable weight. Whereas in English, stress can also fall on the antipenultimate (third from the end of word) syllable depending on the heaviness of the syllable. The tendency of Polish speakers to place stress on the penultimate syllable regardless of syllable weight is a common pronunciation error in English (see table 2.1). 17

English target as tonish main tain cabinet Non-Native form as tonish maintain ca binet Table 2.1: Example of a non-native stress pattern in which the next-to-last syllable is always stressed (this example was taken from O Grady et al., 1996). Researchers have examined the spectral differences between native and non-native speech and found that one of the largest differences between these two types of speech are the patterns of the second and higher formants (Arslan, 1996; Flege, 1987). This finding can be explained by Fant (1960), who showed that small changes in the configuration of the tongue position can lead to large shifts in the frequency location of F2 and F3, while the frequency location of F1 only changes if the overall shape of the vocal tract changes. To improve intelligibility of L2 learners and methods of pronunciation teaching, researchers have tried to establish pronunciation error gravity hierarchies, so that priority can be given to those errors that have the most damaging effect on the intelligibility of speech (e.g. Van Heuven et al., 1981; Anderson-Hsieh et al, 1992; Derwing & Munro, 1997). Although the answer to this issue is still not clear, it appears that both segmental aspects and suprasegmental aspects play important roles. Both aspects can be measured separately, but they do influence each other as the case of lexical stress illustrates. A stressed syllable is usually characterized by a clearer pronunciation (which may cause spectral differences, segmental), a higher amplitude (segmental), a higher pitch (suprasegmental) and a longer duration (suprasegmental). 2.4 Possible goals in pronunciation teaching Studies have shown that foreign accents may have negative consequences for non-native speakers. Listeners detect divergences between the phonetic norms of their L1 and those of the non-native speaker, and may for instance misjudge the non-native speaker s affective state (e.g. Holden & Hogan, 1993). Although several studies have shown that a general bias against foreign accentedness in speech exists and that native listeners tend to downgrade non-native speakers because of their foreign accent, these observations do not directly mean that language teachers should aim at teaching accent-free speech. Abercrombie (1956) argued that most language learners need no more than a comfortably intelligible pronunciation. Witt (1999) agrees with Abercrombie and 18

defined comfortable intelligibility as [...] a level of pronunciation quality, where words are correctly pronounced to their phonetic transcription, but there are still subtle differences in how these phonemes sound in comparison with native speakers [...] the speech of comfortably-intelligible non-native speakers might differ from native speakers with regard to intonation and rhythm, but on overall their speech is understandable without requiring too much effort from a listener [...]. Comfortable intelligibility seems to be a widely accepted goal in pronunciation teaching: Munro & Derwing (1995) describe intelligibility as [...] the extent to which a speaker s message is actually understood by a listener, but there is no universally accepted way of assessing it [...]. The goal of Munro & Derwing s study was to examine the interrelationships among accentedness, perceived comprehensibility and intelligibility in the speech of second language learners. Foreign accent and intelligibility are related, but it is still not clear how foreign accent affects intelligibility. The most important finding of their research is that [...] although strength of foreign accent is indeed correlated with comprehensibility and intelligibility, a strong foreign accent does not necessarily cause second language speech to be low in comprehensibility or intelligibility [...]. Thus their study suggests that existing programs and second language instructors aiming at foreign accent reduction or accent-free speech do not necessarily improve the intelligibility of a second language learner s speech. In the present study, we aim at teaching intelligible speech rather than accent-free speech (see also section 3.3.1). We agree with Abercrombie s view that most language learners do not need more than comfortable intelligibility. 2.5 Overview of automatic pronunciation error detection techniques in the literature 2.5.1 Overview of ASR-based techniques In this section, the focus is on different techniques for automatic detection of pronunciation errors that have already been examined and described in the literature. These techniques should be built in such a way that they match as closely as possible the judgments of human listeners: in order to be valid, automatic pronunciation error detection techniques or machine scores should correlate with scores or judgments given by humans. Measures that seem to correlate well with 19

human judgments are temporal measures (which are acoustic measures); they are strongly correlated with human ratings of pronunciation and fluency (Cucchiarini et al., 2000; Neumeyer et al., 2000). Cucchiarini et al. (2000) showed that expert fluency ratings can be predicted on the basis of automatically calculated temporal measures such as rate of speech or articulation rate (timing scores). Another finding was that temporal measures for native and non-native speakers differed significantly and indicated that native speakers are more fluent than non-native speakers and that non-natives normally speak more slowly than natives. Fluency is often used in tests to evaluate non-native speakers pronunciation. Consequently, other temporal measures that are related to rate of speech or articulation rate, such as duration scores (relative phone durations) and timing scores (rhythm), also correlate well with human listener s judgments (see also Neumeyer et al., 2000). Thus these above mentioned temporal (acoustic) measures all function as good predictors of pronunciation quality because they correlate strongly with human judgments. Therefore, in principle, machine scores based on temporal measures can suffice for good native and non-native pronunciation assessment, but not for pronunciation training in a CALL application. With temporal measures alone, feedback can only be given on temporal aspects of non-native pronunciation. Unfortunately, telling the student to speak faster or to make fewer pauses does not help the student a lot to improve his/her pronunciation. Therefore, temporal measures should be supplemented with other measures that are able to evaluate segmental or other suprasegmental aspects of non-native speech. These other measures and techniques have been developed by researchers to detect segmental pronunciation errors and to evaluate non-native pronunciation quality by using parameters from the ASR system. Nowadays, many CALL applications use measures which are generalized under the term confidence measures : these ASR confidence measures represent in some way the confidence of the ASR system in deciding that the given signal X belongs to pattern Y; or in other words, how confident the ASR system is that the given signal X belongs to pattern Y. ASR confidence measures have the advantage that they can be obtained fairly easily, and that they can be calculated in similar ways for all speech sounds. These measures based on spectral match can be combined with temporal measures to compute a combined score to increase human-machine correlation. Because a good deal of statistics is involved in these scores and methods, I will first shortly explain some statistical terms. 20

The recognition problem in ASR can be reduced to the following statistical problem: given a set of measurements, vector X, what is the probability of it belonging to a word sequence W? In other words, compute P (W X). The posterior probability P (W X) cannot be computed directly: it can only be estimated after the data has been seen (hence the term posterior ). Therefore Bayes Rule is used to estimate the posterior probability: P (W X) = (P (X W ) P (W ))/P (X) In the above formula, P (X W ) represents the probability density function: given a word sequence W, what is the probability of vector X belonging to that word sequence? This is often called the data likelihood. P(W) is the probability that the word sequence W was uttered: this represents the language model that is independent of the observation vectors and is based on prior knowledge. P(X) is a fixed probability: the average probability that the vector X was observed. These statistical measures, likelihoods and posterior probabilities that are derived from the formula just presented, are used to supplement duration scores and timing scores in scoring pronunciation quality of non-native speech. Log-likelihood is assumed to be a good measure of the similarity between native and non-native speech, therefore Neumeyer et al. (1996) compared loglikelihood scores to segment duration scores (relative phone duration normalized by rate of speech) and timing scores (speaking rate, rhythm) by computing correlations between machine and human scores at sentence and speaker level. The correlations in Neumeyer et al. (1996) showed that HMMbased log-likelihoods are poor predictors of pronunciation ratings. The timing scores resulted in acceptable speaker level correlations, but normalized segment duration scores produced the best results. So the duration-based scores outperformed the HMM-based log-likelihood scores. This study was extended in Franco et al. (1997) by examining other HMM-based scores, namely average phone segment posterior probabilities, and comparing them to log-likelihood and duration scores. This time, the HMM-based posterior probabilities produced higher human-machine correlations than log-likelihood and duration scores. The two previous approaches (Neumeyer et al., 1996; Franco et al., 1997) focused on rating an entire sentence rather than targeting specific phone segments. Kim et al. (1997) extended their work by assessing the pronunciation quality of individual phone segments within a sentence. Probabilistic measures given in Franco et al. (1997) were compared to each other and again the 21

score based on posterior probability was the best at phone and sentence level. Duration scores that previously showed high human-machine correlations (Neumeyer et al., 1996) now turned out to be poor measures at phone level. However, the results of duration scores improved and showed the strongest improvement when the amount of training data increased. This is not surprising since it is generally known that adding more training data can improve performance. Humanmachine correlations on phone level were always lower than correlations on sentence level, so rating a single phone by machines is still problematic. Our techniques presented in this study aim at evaluating a single phone; by adopting the approach presented in chapter 3 we hope to achieve higher human-machine agreement at segment level. Another ASR-based method that focuses on rating a phone rather than a word or sentence is Witt & Young s Goodness of Pronunciation (GOP) method (Witt & Young, 2000). Their GOP score is primarily based on the posterior probability of an uttered phoneme. A threshold is used to decide whether a phoneme was correct or not, based on the GOP score in relation to a predetermined threshold. Thus posterior probabilities and temporal measures individually produced good results on sentence level. Therefore, combining these scores might result in even higher human-machine correlations. A combination of such scores was examined in several studies (Franco et al., 1997; Franco et al., 2000) and indeed showed that a combination of scores in almost every case produced higher human-machine correlations than a single posterior probability score. Linear and nonlinear regression methods, which were used to predict the human grade from a set of machine scores, were investigated and it appeared that a nonlinear combination of machine scores produced better results than a linear combination of scores. In the best case, an increase of 11% in correlation was obtained by using nonlinear regression with a neural network combining posterior, duration and timing scores (Franco et al., 2000). Thus these studies have shown that some optimal confidence scores can be combined to achieve higher human-machine correlations at sentence or speaker level. 2.5.2 Adding extra knowledge to acoustic models and ASR-based techniques The measures described above were all obtained from HMM models trained on native speech only. Several methods have been introduced where confidence scores were obtained from adapted acoustic models trained on both native and non-native speech. Furthermore, different methods have been 22

proposed to integrate knowledge about the expected set of mispronunciations in the phone models or pronunciation networks. HMM models trained with native speech data only can be expanded to form a network with alternative pronunciations, where models trained on native and non-native speech are used. In the MisPronunciation (MP) network by Ronen et al. (1997) each phone can be optionally pronounced as a native or as a non-native sound. This network is then searched using the Viterbi algorithm. To evaluate the overall pronunciation quality, a mispronunciation score can be computed that is the relative ratio of the number of non-native phones to the total number of phones in the sentence. The human-machine correlations obtained with the new MP models were almost equal to those of the previous native models. Similarly to Ronen et al. (1997), Franco et al. (1999) used two different acoustic models for each phone, one trained on acceptable, native speech and another trained on incorrect, strongly non-native speech to detect mispronunciations at phone level. For each phone, a log-likelihood ratio score was computed using the correct and incorrect pronunciation models and compared to a posterior probability score (we have seen that posterior scores correlate well with human scores, Franco et al., 1997; Kim et al., 1997) computed from models based only on native speech. Results have shown that the method using both native and non-native models, thus the log-likelihood ratio score, had higher human-machine correlations than the method using only native models, the posterior score. Deroo et al. (2000) also used correct (native-like) and incorrect (strong non-native-like) speech to model the acoustic models, but this time by using a hybrid system combining HMM models and ANN (Artificial Neural Networks) to detect mispronunciations at phone level. Unfortunately, their phoneme models trained with native or non-native speech were very similar to each other, so the system was not able to discriminate between wrong and right pronunciations. A second approach produced better results. This time, knowledge about expected mispronunciations was used: phoneme graphs were built taking all wrong pronunciations of that phoneme into account. A disadvantage of this approach is that this method requires knowing in advance all the mistakes that can be uttered by non-native speakers. 23

2.6 Automatic pronunciation error detection techniques employed in real-life applications 2.6.1 Employing ASR-based techniques in real-life CALL applications Some of the methods and scores that have been discussed in the above sections are applied in real-life CALL systems, such as the SRI EduSpeak System (Franco et al., 2000), the ISLE system (Menzel et al., 2000) and the PLASER system (Mak et al., 2003). The EduSpeak toolkit uses acoustic models trained with Bayesian adaptation techniques that optimally combine native and non-native training data so both type of speakers can be handled with the same models with good recognition performance. In this way, improvement in recognition for the non-native speakers was achieved without degrading the recognition performance on the native speakers. The score used in this system is a combination of previously discussed machine scores: the logarithm of posterior probability, phone duration and speech rate. In the ISLE system, which focuses on Italian and German learners of English, the development of the pronunciation training is divided into two components: automatic localization of pronunciation errors and correction of pronunciation errors (Menzel et al., 2000). Localization of pronunciation errors is done by identifying the areas of an utterance that are likely to contain pronunciation errors. Only the most severe errors are selected by the error localization component that assigns confidence scores to each speech segment. A speech segment with a low confidence score represents a mispronounced segment. These scores are based on probabilistic measures such as the acoustic likelihood of the recognized path. After localizing areas that are likely to contain errors, specific pronunciation errors are detected and diagnosed for correction. Pronunciation errors that a student might make are predicted by rules that describe how a pronunciation is altered. This results in a set of alternative pronunciations for each entry in the dictionary; one of the alternative pronunciations of course include the correct one. Again, all the mistakes that could be made by non-native speakers should be known in advance. Unfortunately, the system performed poor on finding and explaining pronunciation errors (Menzel et al., 2000). The PLASER system (Mak et al., 2003), designed to teach English pronunciation to speakers of Cantonese Chinese, computes a confidence-based score for each phoneme of a given word. An English corpus and a Cantonese corpus were both used to develop Cantonese-accented English 24

phoneme HMMs. To asses pronunciation accuracy of a phoneme the Goodness of Pronunciation measure (GOP) is used. Evaluation of the system showed that the pronunciation accuracy of about 75% of the students improved after using the system for a period of 2-3 months. 2.6.2 Using acoustic-phonetic information in real-life CALL applications The acoustic-phonetic approach, which is the approach adopted in this study, is not frequently used as a technique to detect pronunciation errors. Most of the existing methods use scores such as those described above to evaluate non-native speech. Some projects or systems that adopt approaches resembling the acoustic-approach use raw acoustic data to provide feedback by displaying waveforms, spectrograms, energy or intonation contours. However, a substantial difference with our acoustic-phonetic approach is that, in those methods, no actual assessment is done, based on acoustic-phonetic data. The VICK system (Nouza, 1998) displays user-friendly visual patterns formed from the students speech (single words and short phrases) and compares them to reference utterances. Different types of parameters of the same signal are available for visualization, e.g. the time waveform, the spectrogram and the energy of F0 contours, vowel plots, diagrams or phonetic labels. Feedback on the students pronunciation is given by showing and pointing out deviations in a difference panel that indicates the parts of speech with major differences between the trainees attempt and the references. The VICK system uses two classifiers for the automatic evaluation of speech: primarily a DTW (Dynamic Time Warping) classifier is used (Nouza, 1998). The distance between the utterance and the reference is evaluated for the whole set of features or for a specific feature subset such as log energy or F0. The evaluation is based on means and variances computed from the scores achieved with the reference speakers. In the SPELL project (Hiller et al., 1994), different modules teaching consonants, vowel quality, rhythm and intonation, are characterized by an acoustic similarity metric used to evaluate the pronunciation of a student. For instance, for the rhythm module, duration and vowel quality are used as acoustic parameters. The vowel teaching module uses a set of acoustically-based vowel targets which are derived from a set of vowel tokens produced by a group of native speakers. First, a student s vowel token is analyzed to produce estimates of the formants and pitch. After a normalization procedure, these acoustic parameters are then used to provide feedback in a graphical 25

Figure 2.3: An example of the VICK s screen (from Nouza, 1998) display for the student. In the display, an elliptic vowel target for the vowel and the position of the user s attempt is shown. The vowel similarity metric decides whether the user s vowel token falls within this target vowel space. The consonant module uses a rather different analysis. A list of pronunciation errors in consonant production by non-native speakers of English was first made and ranked according to their expected effect on intelligibility. Substitutions were one of the most frequent consonantal errors. These errors are detected in SPELL by using a simplified speech recognition technique. Each utterance has a specified phonetic sequence containing the desired sequence of segments and the likely substitutions (errors) which the student might make. The errors produced by the student are then detected by the choices the speech recognizer made in recognizing the utterance. WinPitch LTL (Germain-Rutherford & Martin, 2000) is another system that provides feedback by visualizing acoustic data. Learners can visualize the pitch curve, the intensity curve and the waveform of their own recorded speech. A useful feature of this system is speech synthesis: for instance, students can hear the correct prosodic contours produced with the students own voice and comparisons of prosodic patterns can be made between the students recorded and synthesized segments. The system offers other user-friendly functions as well, such as many edit-functions to facilitate the learning process. But a major disadvantage of this system is that the system does not include ASR: thus no automatic check of the contents of the student s utterance is available. Therefore, a teacher is required to do this (e.g. produce the phonetic transcription of the utterance) 26

Figure 2.4: Examples of the SPELL s screen (from Hiller et al., 1994) and to explain the students what the meaning is of the various acoustic analyses. A general questionable issue of CAPT systems that visualize acoustic data to give feedback to language learners is that some training in reading and understanding the displays is required beforehand and that in some cases, a teacher is required. Furthermore, matching visual displays is not always recommended, for instance it is known that matching acoustic waveforms is not very helpful. Consequently, visualizing acoustic data can be very tricky and therefore this kind of data should be used with care. Although these applications use acoustic information, actual assessment of pronunciation based on acoustic information is not done. The acoustic-phonetic approach adopted in this study, described in the next chapter (chapter 3), will use specific acousticphonetic information to evaluate non-native pronunciation. 27