Improvemens in Tone Pronunciaion Scoring for Srongly Accened Mandarin Speech 1 Fuping Pan, Qingwei Zhao, Yonghong Yan ThinkIT laboraory, Insiue of Acousics, Chinese Academy of Sciences Beijing 100080 {fpan, qzhao, yyan}@hccl.ioa.ac.cn Absrac. This paper discusses a one pronunciaion scoring sysem of Mandarin. I recognizes ones of syllables by using GMM model and uses he recogniion resuls for one assessmen. Iniially, experimen resuls are bad on srongly accened speech. There are wo reasons: one is ha he inaccurae force-alignmen leads o incomplee F0 conours; he oher is due o he special paern of F0 conours. We propose several measures o he problems. The firs is o make he exracion of F0 conour independen of he force-alignmen. The second is o base he scoring on GMM poserior probabiliies. The hird is o use he same accened speech o rain he GMM model. And he las is o rain he fracionized bi-one GMM models o cover one changes in he muliplecharacer words. Afer hese measures are aken, he one scoring correc rae is improved from 60.2% o 83.3%. Keywords: CALL, one assessmen, GMM, one recogniion, HMM, forcealignmen, F0 1 Inroducion CALL (Compuer Aided Language Learning) sysems can auomaically score he qualiy of human speech from many differen poins of view. In onal languages, such as Chinese, one plays a very imporan role in discriminaing characers and expressing meaning, so one scoring in CALL is in special demand for hese languages. This paper inroduces a ex-independen CALL sysem of Mandarin, which consiss of one assessmen as one of primary componens. The sysem will be used o evaluae pronunciaion qualiy of speakers from Hong Kong. I scores he pronunciaion qualiy of every syllable from hree basic aspecs: pronunciaion qualiy of he consonan, pronunciaion qualiy of he vowel and accuracy of he one. The hree scores are hen inegraed o form he final score of he syllable. The sysem iniially uses HMM and Vierbi decoding o obain phone segmenaion informaion and log-likelihood score for inpu speech [1]. Then he average phone poserior 1 This work is (parly) suppored by Chinese 973 program (2004CB318106), Naional Naural Science Foundaion of China (10574140, 60535030), and Beijing Municipal Science & Technology Commission (Z0005189040391).
probabiliies are compued and scores of he pronunciaion qualiy of consonan and vowel are achieved [2][3]. Simulaneously, he vowel segmen of syllable is used o evaluae he one. A las a combinaion mehod simplified from [4] is used o inegrae he hree par scores ino one final score of he pronunciaion qualiy of he syllable. A GMM based one recogniion sysem is designed o score he one [5]. The sysem works well on general Mandarin daabase, which is comprised of Mandarin speech wih lile accen. Bu when applied o ha daabase wih very srong Souhern China accen, such as one Hong Kong speech daabase ha we use, he scoring performance drops down grealy. This is aroused from he inaccurae force-alignmen and he special one pronunciaion which generaed very srange paerns of F0 conours. Several soluions o hese problems are proposed in his paper. One is o replace force-alignmen wih a pos-processing procedure of F0 conour of he enire syllable; he second is o do scoring based on one poserior probabiliies insead of direcly based on one recogniion resuls. The hird mehod is o use he same accened daa o rain he GMM model. In addiion o hese measures, we also aemp o use fracionized bi-one GMM models o cover one changes in muliple-characer words or senences. Experimens show ha hese soluions are very effecive. They grealy improve he sysem performance. This paper is organized as he following: secion 2 inroduces our CALL sysem; secion 3 describes our original GMM based one scoring sysem; some modificaions are made in secion 4; experimens and resuls are presened in secion 5, and a las conclusion is obained. 2 CALL Sysem Overview Our CALL sysem evaluaes he pronunciaion qualiy of Mandarin speech, where he speech ype includes mono-syllables, phases, and senences. In all hese cases, Mandarin syllable is he fundamenal assessmen uni. The syllable is evaluaed from hree aspecs: pronunciaion qualiy of consonan, pronunciaion qualiy of vowel, and he accuracy of one. The firs wo scorings are compued by using he speech recogniion echnology of HMM and Vierbi search, and he las one is done by using GMM based one recogniion echnology. The block diagram of he sysem is shown in Fig. 1. Observaion feaure is exraced from inpu speech and fed ino HMM model ne o do one-pass Vierbi decoding. For pronunciaion assessmen, full-funcional speech recogniion is no required. The HMM model ne only consiss of he models of he uerance ex. The Vierbi decoding is only a force alignmen beween he speech frames and he HMM models in he ne. The final resul include frame indices of each HMM sae and oupu probabiliies of each observaion frame from is force-aligned HMM sae [1][5]. Then he acousic confidences of phonemes of each syllable are compued by Equaion 1. e 1 PPH ( O) = Ps ( o) e b +. (1) 1 = b
Inpu Speech Learning Tex Pronunciaion Dicionary Feaure Exracion One-Pass Vierbi Decoding (Forced Alignmen) HMM Model Score by Pos Probabiliy Vowel Score Vowel Par Consonan Score Inegrae Scores Final Score Compue F0 / Exrac Feaure Model Maching Tone Score Tone GMMs Fig. 1. Archiecure of he CALL sysem proposed in his paper In Equaion 1, O = [ ob, ob+ 1,... oe] is he force-aligned observaion sequence of he phone PH (which is consonan or vowel), b is he begin frame of PH and e is he end frame of PH. S = [ sb, sb+ 1,... se] is he sae sequence corresponding o O. Ps ( o) is he sae poserior probabiliy compued by Equaion 2. po ( s) ps ( ) po ( s) ps ( ) Ps ( o) = =. po ( ) po ( s) ps ( ) In Equaion 2, p( o s ) is he oupu probabiliy of observaion o in sae s, and S is he sysem global sae se. The poserior probabiliy PPH ( O ) is an absolue measure of how he pronunciaion is close o he acousic model. The models are rained by sandard Mandarin speech corpus, consequenly, PPH ( O ) can be direcly used for phonemic pronunciaion assessmen. We classify phonemic pronunciaion qualiy ino hree classes: good, medium and bad, corresponding o score 2, 1 and 0 respecively. And wo hresholds are se o map he poserior probabiliies o he hree scores. Evaluaion of he one is in parallel wih he phonemic pronunciaion assessmen. I will be discussed in deail nex secion. The scores of one are also confined o 2, 1 and 0, which means good, medium and bad respecively. Finally, he phonemic pronunciaion scores and he one score are inegraed by Equaion 3 o form he final score of he syllable. s S (2)
ScoreSyllable = min( ScoreConsonan, ScoreVowel, ScoreTone). (3) The sysem is used o aid Hong Kong PuTongHua level es (PSK) on pronunciaion qualiy of Mandarin speech spoken by Hong Kong naive sudens. The es includes 75 uerances, he firs 50 uerances are isolaed syllables and he las 25 uerances are wo-syllable words. Score of every syllable is compued as above, and scores of he oal 100 syllables are summed up o be he final score of he es. 3 Tone Scoring Sysem All Mandarin syllables can be considered as a combinaion of iniial and final pars. Phoneme srucure of Chinese syllable can be defined as Fig. 2. The lexical one is mainly specified by he paern of pich conour of he syllable s final par ha is he vowel porion of he syllable [6]. According o hese principles, he one scoring sysem is designed, which is shown in Fig. 1. [Consonan] [Medial] Syllabic vowel [Ending] Iniial Final Fig. 2. Srucure of Mandarin syllable To evaluae he one of a given syllable, we use he pich conour of he syllable s vowel segmen idenified via he above force alignmen. The pich conour is compued by he Sub-Harmonic summaion mehod [7], and hen ransformed ino classificaion feaure for furher one recogniion via GMM. Finally, he recogniion resul is compared wih he reference one (one specified in he learning ex) o ge he one score by he following rules (ignore score 1 because hard o discriminae): Score =, if recognized one is he same as reference one. Score =, if recognized one is no same as reference one. Tone 2 Tone 0 Obviously, he validiy of his one scoring process is closely relaed o he performance of one recogniion. The more accuraely one is recognized, he more precise he one scoring is. We rain he GMM model by using a general Mandarin daabase. The daabase includes all he isolaed onal syllables of Mandarin. The number of he GMM models is four, ha is, one model for one one. Tone-5 is no considered a presen. According o [8], differen kind of one classificaion feaure leads o differen recogniion performance. [8] proved ha he equal-lengh subsecions F 0 and Δ F0 are he bes: he F0 curve of he enire vowel segmen is divided ino several equallengh subsecions, for each subsecion, mean of F 0 and Δ F0 are compued o serve as feaure. We expand his feaure wih anoher feaure elemen ha is Δ F0 of he enire vowel segmen. We find his expansion lead o beer performance.
4 Improvemens of he Tone Scoring Sysem 4.1 Cope wih he Srong Accen Our one scoring sysem can achieve high correc rae on general Mandarin speech. Bu when esed on Hong Kong speech daabase, he performance drops down grealy. The speech in he HK daabase is spoken by Hong Kong naive sudens. They have very srong Souhern China accen when speaking Mandarin, and many of hem even can no speak Mandarin fluenly. We analyze many recogniion misakes and find ha he performance deerioraion is mainly due o he following reasons. Firsly, he F0 conour segmened by force-alignmen is no complee. The HMM model used for force-alignmen is rained by general Mandarin daabase. Is acousic characerisics are somehow differen from hose of he Hong Kong daabase. This leads o warps of he force-alignmen. Ofen, he forepar or he end-par of vowel is cu off by he force-alignmen. Addiionally, F0 conour of some voiced consonan also conribues o he one, bu is cu off because we only preserve he vowel porion. All hese insances damage he inegriy of F0 conour. Secondly, he paerns of F0 conours of he Hong Kong daa are differen from hose of general Mandarin speech. They change by he following syles. In one-4, here is a F0 rising in he forepar of he F0 conour, as shown in Fig. 3a. In one-3, he F0 rising of he end-par is no enough, as shown in Fig. 3b. In one-2, here is a falling in he forepar of he F0 conour, as shown in Fig. 3c. Fig. 3. Special F0 conour paerns of HK daa
These changes can be oleraed by human exper s evaluaion bu can no be ignored by he GMM model, which is rained by he general Mandarin. Based on hese analyses, we propose he following measures. The firs one, in order o avoid force-alignmen misakes, we no longer depends on force-alignmen o segmen he vowel porions. Force-alignmen resuls are only uilized for segmening syllables in muli-characer words. Then we direcly compue F0 conour for he enire syllable. A las he F0 conour of syllable is pos-processed o exclude hose F0 poins of unvoiced consonan, breahing noise or any oher noise. The remained F0 conour is a regular F0 sequence of exacly he voiced porion of he syllable. A F0 conour pos-processing example is shown in Fig. 4. This example is of syllable si1. As shown in Fig. 4, he sub-harmonic summaion of he voiced segmen, which is wha we need, is comparaively high. We can find he maximum value of he subharmonic summaion sequence, and by seing a hreshold below his maximum value we can approximaely ell he voiced porion of he syllable, as inerval (a, b). Then in (a, b), more careful examinaion is performed on he F0 sequence o exclude singulariy poins on wo ends. A las he porion of (c, d) is remained. Frequency/ Energy 240 220 200 180 160 140 120 100 80 60 F0 Sub-Harmonic summaion a c F0 Conour of Voiced porion Threshold si1 40 0 10 20 30 40 50 60 70 80 90 d b Frame Fig. 4. Pos-processing mehod on F0 conour The second one is o use Hong Kong speech daa o rain he GMM model. By his means, he new GMM model will be able o cover he changes of paerns of F0 conours, and so he recogniion performance is expeced o be improved.
And he las one is o improve he scoring mehod. The one scoring mehod inroduced in secion 3 seems oo heavily dependen on he accuracy of one recogniion. Considering he relaively low correc rae of he one recogniion of he Hong Kong daa, his mehod is oo arbirary. So we plan anoher scoring mechanism making use of he observing probabiliy compued by he GMM model. We compue he poserior probabiliy of he reference one (ha is he correc one of he syllable in he learning ex) by using Equaion 4. p( F0 ref. Tone) p( ref. Tone) P( ref. Tone F0) = pf ( 0) p( F0 ref. Tone) p( ref. Tone) = 4 p( F0 Tone ) p( Tone ) k = 1 k = 1 p( F0 ref. Tone) 4 p( F0 Tone ) k k k. (4) In Equaion 4, p( F0 ref. Tone) is he observing probabiliy of F0 feaure by he GMM model of he reference one; 4 p( F0 Tone k) is he sum of he observing probabiliies k = 1 of F0 feaure by he GMM models of all he four ones; and p( ref. Tone) = p( Tone k ) is supposed. This poserior probabiliy is a measure of how well he paern of he F0 feaure is close o he reference one model. I can be used for one scoring direcly. By seing wo hresholds, we can map he poserior probabiliy o score of 2, 1 or 0. By his new mehod, even if he ones are wrongly recognized, heir poserior probabiliies of Equaion 4 may sill fall ino a correc inerval, so correc scores can also be goen. 4.2 Cope wih Tone Changes in Muli-characer Words In coninuous speech, he one of one syllable is ofen affeced by he ones of is neighboring syllables, so he characerisic of is F0 conour is somehow differen from ha of he isolaed syllable, which is supposed o be he sandard. This leads o one changes in coninuous speech. According o [9], some common one variaions in coninuous speech can mainly be divided ino he following caegories: When wo one-3 are concaenaed, he firs one-3 is changed o one-2; and when hree one-3 are concaenaed, he firs wo one-3 are changed o one-2. If one-3 is no followed by one-3, hen i is change o half one-3. The one of characer 一 is change o one-2 when i is followed by one-4; and is changed o one-4 when i is followed by one-1, one-2 or one-3. The one of characer 不 is change o one-2 when i is followed by one-4 For one recogniion of coninuous speech, GMM models ha can discriminae hese changes are more appreciaed. Because differen one conexs lead o differen one changes, we propose o rain fracionized models for each one like he idea repored
in [8]. Bu differen from [8], which associaed one model wih mono-phone or riphone, we associae one model wih is one-conex. Tha is o say, for each one muliple models are rained for muliple one-conexs, so ha one model can cover one kind of one change resul from he corresponding one conex. A presen our sysem is mainly focused on evaluaing muli-characer words, so bi-one conex is considered a firs. For each one, according o is predecessor or successor, eigh models are rained by using daabase of muli-characer words. And for all he four ones (one-5 is no included) oally hiry-wo model are rained. From hese fracionized models beer performance is expeced for he one recogniion of mulicharacer words. 5 Experimens and Resuls 5.1 Daabase of he experimens There are wo kinds of daabase in our experimens. One is of general Mandarin speech daa. We divide his daabase ino wo pars: one par conains 37 speakers and 48100 isolaed onal syllables, his par is used as rain daa, named GM1; he oher par conains four speakers and 4800 isolaed onal syllables, his par is used as es daa, named GM2. The oher kind of daabase consiss of groups of PSK es samples and many muli-characer words spoken by Hong Kong naive residens. I has very srong Souhern China accen. We divide his daabase ino hree pars: par one includes 100 PSK es samples, named HK1; par wo includes 102 PSK es samples, named HK2; par hree consiss of 8 speakers and 4000 muli-characer words, named HKW. The HK1 and HKW are used as rain daa, he HK2 is used as es daa. 5.2 Feaure Selecion Many kinds of feaures were esed in [8] on coninuous speech daabase. Isolaed syllables and words are differen in feaure selecion. So we es several kinds of feaures on our sysem as following. 1. All he F 0 and Δ F0 values of he enire vowel porion are used o form he feaure: ( F0, Δ F0). 2. 4 equal-lengh subsecions F 0 and Δ F0 : ( F0, Δ F0). 3. 4 equal-lengh subsecions F 0 and Δ F0, and Δ F0 of he enire vowel segmen. We use he original one recogniion sysem described in secion 3 o do ess on GM2. The GMM model is rained by GM1. The es resuls are shown in Table 1.
Table 1. Tone recogniion correc rae wih differen feaure Feaure Feaure Dimension GMM Mixure Number Correc Rae Feaure 1 32 4 87.1% Feaure 2 8 128 88.3% Feaure 3 9 128 91.3% Our resuls confirm he conclusion of [8] on isolaed syllable daabase. The main value and direcion of F0 conour are he mos imporan characerisics for one recogniion. The deailed informaion is no imporan. And he las kind of feaure is shown o have goen he bes performance. 5.3 Tone Recogniion of Hong Kong Daa We use he bes feaure decided by he above experimens o recognize HK2 on he following hree sysems. The es resuls are shown in Table 2. Sysem 1: Original sysem, use GM1 o rain he GMM model. Sysem 2: Use he isolaed syllables of HK1 o rain GMM model, which also has 128 Gaussian mixures Sysem 3: In addiion o Sysem2, he force-alignmen procedure is replaced. Sysem 4: Use he isolaed syllables of HK1 o rain GMM model for recognizing ones of isolaed syllables of HK2; use he muli-characer words of HK1 and HKW o rain fracionized one models for recognizing ones of words of HK2; and F0 conour of he enire syllable is compued. Table 2. Correcion Rae of Tes Daa HK2 by differen sysems Sysem Correc Rae Sysem 1 67.9% Sysem 2 73.0% Sysem 3 75.3% Sysem 4 75.5% The resuls can prove he effeciveness of he measures propose in secion 4. All he measures lead o improvemen of recogniion performance. Bu he correc rae is sill no as good as ha of he general Mandarin daabase. We hink he reasons may be ha he raining daa is no enough and he paerns of F0 conours can no be fully covered by only four GMM models. 5.4 Tone scoring and final score compuaion Above one recogniion resuls are hen used for scoring one of HK2. Five experimens are carried ou. The firs hree experimens use he original one scoring mehod o score he one recogniion resuls of above sysem 1, 2 and 3. The las wo
experimens use he new one scoring mehod inroduced in secion 4.1 o score he resuls of sysem 3 and 4. The one scoring resuls are compared wih hose of he exper human s evaluaion and he correc raes are shown in Table 3. We can see ha he new scoring mehod grealy improved he correc rae. And finally he one scores are inegraed wih he phonemic pronunciaion scores o ge he final scores of he syllable pronunciaion of HK2. All syllables final scores are added ogeher and compared wih ha of he exper human s evaluaion o calculae he score difference for es sample. The average score difference of he 102 samples is compued, he resuls are also shown in Table 3. From he resuls i s reasonable o see he decrease of he score difference because of he increase of he one scoring correc rae. Table 3. Score differences wih differen one scoring correc rae Sysem Tone Scoring Mehod Tone Scoring Average Score Correc Rae Difference Sysem 1 Based on Rec. Resuls 60.2% 16.77 Sysem 2 Based on Rec. Resuls 69.4% 14.32 Sysem 3 Based on Rec. Resuls 72.2% 11.10 Sysem 3 Based on Poserior Probabiliy 83.1% 6.43 Sysem 4 Based on Poserior Probabiliy 83.3% 6.34 6 Conclusions Our one scoring sysem is originally buil for general Mandarin speech daa. Bu when his sysem is applied o he accened speech daa (such as he HK-accen speech daa), many problems arise and he scoring performance reduces grealy. Analyses show ha he characerisic of HK-accened speech is clearly differen from hose of he general Mandarin speech. Based on his, several soluions o he problems are proposed. The main idea of our soluions is o sui he scoring algorihm o he special es daa. We compue he F0 conour for he enire syllable o avoid forcealignmen errors; we design a new one scoring mehod o olerae one recogniion misakes; and we use he same accened daa o rain he GMM model o cover he changes of F0 conour paerns of he Hong Kong daa. Experimen resuls prove ha our soluions are effecive. The one scoring correc rae is improved form 60.2% o 83.3% and he final average score difference beween our scoring sysem and he exper human s evaluaion is decreased from 16.77 o 6.34. Anoher big problem of one recogniion is he one changes in he muli-characer words. We aemp o rain fracionized bi-one models for he differen one conexs o cover hose changes. Alhough he improvemen of he performance is sill no very large, we are confiden ha i is mainly due o insufficien speech daa, he algorihm is promising. And furher research is coninuing.
References 1. J. Bernsein, M. Cohen, H. Murvei, D. Rischev, and M. Weinraub, Auomaic Evaluaion and Training in English Pronunciaion, ICSLP 1990, Kobe, Japan. 2. L. Neumeyer, H. Franco, M. Weinraub, and P. Price, Auomaic Tex-Independen Pronunciaion Scoring of Foreign Language Suden Speech, Proc. of ICSLP 96, pp.1457-1460, Philadelphia, Pennsylvania 1996. 3. L. Neumeyer, H. Franco, V. Digalakis, and M. Weinraub, Auomaic Scoring of Pronunciaion Qualiy, Speech Communicaion, Volume 30, Issues 2-3, February 2000, Pages 83-93. 4. H. Franco, L. Neumeyer, V. Digalakis, and O. Ronen, Combinaion of machine scores for auomaic grading of pronunciaion qualiy, Speech Communicaion, volume 30, 2000. 5. Jiang-Chun Chen, Jyh-Shing Roger Jang, Jun-Yi Li and Ming-Jun Wu, Auomaic Pronunciaion Assessmen for Mandarin Chinese, IEEE Inernaional Conference on Mulimedia and Expo, Taipei, Taiwan, June 2004. 6. Yang Wu-Ji, Jyh-Chyang Lee, Yueh-chin Chang and Hsiao-Chuan Wang (1988), Hidden Markov Model for Mandarin Lexical Tone Recogniion. IEEE Transacion on Acousic, Speech, & Signal Processing, vol. ASSP-36, no. 7, 989-992. 7. DikHermes, Measuremen of pich by subharmonics summaion, Journal of Acousics of Sociey of America, AM 83(1), Jan..1988, pp.257-264. 8. Ye Tian, Jianlai Zhou, Min Chu and Eric Chang, Tone Recogniion wih Fracionized Models and Oulined Feaures, proc. of ICASSP 2004, Monreal, pp. I-105~I-108. 9. Wu Zongji. The one variaion in mandarin. Chinese grammar, 1982, No 6. p439-449.