Improvements in Tone Pronunciation Scoring for Strongly Accented Mandarin Speech 1

Similar documents
Neural Network Model of the Backpropagation Algorithm

More Accurate Question Answering on Freebase

Fast Multi-task Learning for Query Spelling Correction

MyLab & Mastering Business

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

1 Language universals

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

Speech Emotion Recognition Using Support Vector Machine

Learning Methods in Multilingual Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Mandarin Lexical Tone Recognition: The Gating Paradigm

Modeling function word errors in DNN-HMM based LVCSR systems

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Modeling function word errors in DNN-HMM based LVCSR systems

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

WHEN THERE IS A mismatch between the acoustic

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Investigation on Mandarin Broadcast News Speech Recognition

Word Segmentation of Off-line Handwritten Documents

Speech Recognition at ICSI: Broadcast News and beyond

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Human Emotion Recognition From Speech

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Reducing Features to Improve Bug Prediction

Constructing Parallel Corpus from Movie Subtitles

Linking Task: Identifying authors and book titles in verbose queries

Transfer Learning Action Models by Measuring the Similarity of Different Domains

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Australian Journal of Basic and Applied Sciences

Probabilistic Latent Semantic Analysis

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Rule Learning With Negation: Issues Regarding Effectiveness

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

A Case Study: News Classification Based on Term Frequency

Rule Learning with Negation: Issues Regarding Effectiveness

A study of speaker adaptation for DNN-based speech synthesis

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

Why Is the Chinese Curriculum Difficult for Immigrants Children from Southeast Asia

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

City University of Hong Kong Course Syllabus. offered by Department of Architecture and Civil Engineering with effect from Semester A 2017/18

arxiv: v1 [cs.lg] 3 May 2013

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Effectiveness of Electronic Dictionary in College Students English Learning

Exploring the adaptability of the CEFR in the construction of a writing ability scale for test for English majors

OTHER RESEARCH EXPERIENCE & AFFILIATIONS

Metadiscourse in Knowledge Building: A question about written or verbal metadiscourse

/$ IEEE

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Colleges And Universities Civil Engineering Practice Teaching Family Planning Materials. Civil Engineering Graduate Design Typical Example: Road And

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Analyzing the Usage of IT in SMEs

ACS HONG KONG INTERNATIONAL CHEMICAL SCIENCES CHAPTER 2014 ANNUAL REPORT

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Identification of Opinion Leaders Using Text Mining Technique in Virtual Community

ACCEPTING MOODLE BY ACADEMIC STAFF AT THE UNIVERSITY OF JORDAN: APPLYING AND EXTENDING TAM IN TECHNICAL SUPPORT FACTORS

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

The Bruins I.C.E. School

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

How to set up gradebook categories in Moodle 2.

In Workflow. Viewing: Last edit: 10/27/15 1:51 pm. Approval Path. Date Submi ed: 10/09/15 2:47 pm. 6. Coordinator Curriculum Management

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

Dialog Act Classification Using N-Gram Algorithms

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Calibration of Confidence Measures in Speech Recognition

Contrastiveness and diachronic variation in Chinese nasal codas. Tsz-Him Tsui The Ohio State University

The ABCs of O-G. Materials Catalog. Skills Workbook. Lesson Plans for Teaching The Orton-Gillingham Approach in Reading and Spelling

Language and Tourism in Sabah, Malaysia and Edinburgh, Scotland

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Application of Visualization Technology in Professional Teaching

Chen Zhou. June Room 492, Darla Moore School of Business Office: (803) University of South Carolina 1014 Greene Street

Empirical research on implementation of full English teaching mode in the professional courses of the engineering doctoral students

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Automatic intonation assessment for computer aided language learning

FEIRONG YUAN, PH.D. Updated: April 15, 2016

A Grammar for Battle Management Language

Transcription:

Improvemens in Tone Pronunciaion Scoring for Srongly Accened Mandarin Speech 1 Fuping Pan, Qingwei Zhao, Yonghong Yan ThinkIT laboraory, Insiue of Acousics, Chinese Academy of Sciences Beijing 100080 {fpan, qzhao, yyan}@hccl.ioa.ac.cn Absrac. This paper discusses a one pronunciaion scoring sysem of Mandarin. I recognizes ones of syllables by using GMM model and uses he recogniion resuls for one assessmen. Iniially, experimen resuls are bad on srongly accened speech. There are wo reasons: one is ha he inaccurae force-alignmen leads o incomplee F0 conours; he oher is due o he special paern of F0 conours. We propose several measures o he problems. The firs is o make he exracion of F0 conour independen of he force-alignmen. The second is o base he scoring on GMM poserior probabiliies. The hird is o use he same accened speech o rain he GMM model. And he las is o rain he fracionized bi-one GMM models o cover one changes in he muliplecharacer words. Afer hese measures are aken, he one scoring correc rae is improved from 60.2% o 83.3%. Keywords: CALL, one assessmen, GMM, one recogniion, HMM, forcealignmen, F0 1 Inroducion CALL (Compuer Aided Language Learning) sysems can auomaically score he qualiy of human speech from many differen poins of view. In onal languages, such as Chinese, one plays a very imporan role in discriminaing characers and expressing meaning, so one scoring in CALL is in special demand for hese languages. This paper inroduces a ex-independen CALL sysem of Mandarin, which consiss of one assessmen as one of primary componens. The sysem will be used o evaluae pronunciaion qualiy of speakers from Hong Kong. I scores he pronunciaion qualiy of every syllable from hree basic aspecs: pronunciaion qualiy of he consonan, pronunciaion qualiy of he vowel and accuracy of he one. The hree scores are hen inegraed o form he final score of he syllable. The sysem iniially uses HMM and Vierbi decoding o obain phone segmenaion informaion and log-likelihood score for inpu speech [1]. Then he average phone poserior 1 This work is (parly) suppored by Chinese 973 program (2004CB318106), Naional Naural Science Foundaion of China (10574140, 60535030), and Beijing Municipal Science & Technology Commission (Z0005189040391).

probabiliies are compued and scores of he pronunciaion qualiy of consonan and vowel are achieved [2][3]. Simulaneously, he vowel segmen of syllable is used o evaluae he one. A las a combinaion mehod simplified from [4] is used o inegrae he hree par scores ino one final score of he pronunciaion qualiy of he syllable. A GMM based one recogniion sysem is designed o score he one [5]. The sysem works well on general Mandarin daabase, which is comprised of Mandarin speech wih lile accen. Bu when applied o ha daabase wih very srong Souhern China accen, such as one Hong Kong speech daabase ha we use, he scoring performance drops down grealy. This is aroused from he inaccurae force-alignmen and he special one pronunciaion which generaed very srange paerns of F0 conours. Several soluions o hese problems are proposed in his paper. One is o replace force-alignmen wih a pos-processing procedure of F0 conour of he enire syllable; he second is o do scoring based on one poserior probabiliies insead of direcly based on one recogniion resuls. The hird mehod is o use he same accened daa o rain he GMM model. In addiion o hese measures, we also aemp o use fracionized bi-one GMM models o cover one changes in muliple-characer words or senences. Experimens show ha hese soluions are very effecive. They grealy improve he sysem performance. This paper is organized as he following: secion 2 inroduces our CALL sysem; secion 3 describes our original GMM based one scoring sysem; some modificaions are made in secion 4; experimens and resuls are presened in secion 5, and a las conclusion is obained. 2 CALL Sysem Overview Our CALL sysem evaluaes he pronunciaion qualiy of Mandarin speech, where he speech ype includes mono-syllables, phases, and senences. In all hese cases, Mandarin syllable is he fundamenal assessmen uni. The syllable is evaluaed from hree aspecs: pronunciaion qualiy of consonan, pronunciaion qualiy of vowel, and he accuracy of one. The firs wo scorings are compued by using he speech recogniion echnology of HMM and Vierbi search, and he las one is done by using GMM based one recogniion echnology. The block diagram of he sysem is shown in Fig. 1. Observaion feaure is exraced from inpu speech and fed ino HMM model ne o do one-pass Vierbi decoding. For pronunciaion assessmen, full-funcional speech recogniion is no required. The HMM model ne only consiss of he models of he uerance ex. The Vierbi decoding is only a force alignmen beween he speech frames and he HMM models in he ne. The final resul include frame indices of each HMM sae and oupu probabiliies of each observaion frame from is force-aligned HMM sae [1][5]. Then he acousic confidences of phonemes of each syllable are compued by Equaion 1. e 1 PPH ( O) = Ps ( o) e b +. (1) 1 = b

Inpu Speech Learning Tex Pronunciaion Dicionary Feaure Exracion One-Pass Vierbi Decoding (Forced Alignmen) HMM Model Score by Pos Probabiliy Vowel Score Vowel Par Consonan Score Inegrae Scores Final Score Compue F0 / Exrac Feaure Model Maching Tone Score Tone GMMs Fig. 1. Archiecure of he CALL sysem proposed in his paper In Equaion 1, O = [ ob, ob+ 1,... oe] is he force-aligned observaion sequence of he phone PH (which is consonan or vowel), b is he begin frame of PH and e is he end frame of PH. S = [ sb, sb+ 1,... se] is he sae sequence corresponding o O. Ps ( o) is he sae poserior probabiliy compued by Equaion 2. po ( s) ps ( ) po ( s) ps ( ) Ps ( o) = =. po ( ) po ( s) ps ( ) In Equaion 2, p( o s ) is he oupu probabiliy of observaion o in sae s, and S is he sysem global sae se. The poserior probabiliy PPH ( O ) is an absolue measure of how he pronunciaion is close o he acousic model. The models are rained by sandard Mandarin speech corpus, consequenly, PPH ( O ) can be direcly used for phonemic pronunciaion assessmen. We classify phonemic pronunciaion qualiy ino hree classes: good, medium and bad, corresponding o score 2, 1 and 0 respecively. And wo hresholds are se o map he poserior probabiliies o he hree scores. Evaluaion of he one is in parallel wih he phonemic pronunciaion assessmen. I will be discussed in deail nex secion. The scores of one are also confined o 2, 1 and 0, which means good, medium and bad respecively. Finally, he phonemic pronunciaion scores and he one score are inegraed by Equaion 3 o form he final score of he syllable. s S (2)

ScoreSyllable = min( ScoreConsonan, ScoreVowel, ScoreTone). (3) The sysem is used o aid Hong Kong PuTongHua level es (PSK) on pronunciaion qualiy of Mandarin speech spoken by Hong Kong naive sudens. The es includes 75 uerances, he firs 50 uerances are isolaed syllables and he las 25 uerances are wo-syllable words. Score of every syllable is compued as above, and scores of he oal 100 syllables are summed up o be he final score of he es. 3 Tone Scoring Sysem All Mandarin syllables can be considered as a combinaion of iniial and final pars. Phoneme srucure of Chinese syllable can be defined as Fig. 2. The lexical one is mainly specified by he paern of pich conour of he syllable s final par ha is he vowel porion of he syllable [6]. According o hese principles, he one scoring sysem is designed, which is shown in Fig. 1. [Consonan] [Medial] Syllabic vowel [Ending] Iniial Final Fig. 2. Srucure of Mandarin syllable To evaluae he one of a given syllable, we use he pich conour of he syllable s vowel segmen idenified via he above force alignmen. The pich conour is compued by he Sub-Harmonic summaion mehod [7], and hen ransformed ino classificaion feaure for furher one recogniion via GMM. Finally, he recogniion resul is compared wih he reference one (one specified in he learning ex) o ge he one score by he following rules (ignore score 1 because hard o discriminae): Score =, if recognized one is he same as reference one. Score =, if recognized one is no same as reference one. Tone 2 Tone 0 Obviously, he validiy of his one scoring process is closely relaed o he performance of one recogniion. The more accuraely one is recognized, he more precise he one scoring is. We rain he GMM model by using a general Mandarin daabase. The daabase includes all he isolaed onal syllables of Mandarin. The number of he GMM models is four, ha is, one model for one one. Tone-5 is no considered a presen. According o [8], differen kind of one classificaion feaure leads o differen recogniion performance. [8] proved ha he equal-lengh subsecions F 0 and Δ F0 are he bes: he F0 curve of he enire vowel segmen is divided ino several equallengh subsecions, for each subsecion, mean of F 0 and Δ F0 are compued o serve as feaure. We expand his feaure wih anoher feaure elemen ha is Δ F0 of he enire vowel segmen. We find his expansion lead o beer performance.

4 Improvemens of he Tone Scoring Sysem 4.1 Cope wih he Srong Accen Our one scoring sysem can achieve high correc rae on general Mandarin speech. Bu when esed on Hong Kong speech daabase, he performance drops down grealy. The speech in he HK daabase is spoken by Hong Kong naive sudens. They have very srong Souhern China accen when speaking Mandarin, and many of hem even can no speak Mandarin fluenly. We analyze many recogniion misakes and find ha he performance deerioraion is mainly due o he following reasons. Firsly, he F0 conour segmened by force-alignmen is no complee. The HMM model used for force-alignmen is rained by general Mandarin daabase. Is acousic characerisics are somehow differen from hose of he Hong Kong daabase. This leads o warps of he force-alignmen. Ofen, he forepar or he end-par of vowel is cu off by he force-alignmen. Addiionally, F0 conour of some voiced consonan also conribues o he one, bu is cu off because we only preserve he vowel porion. All hese insances damage he inegriy of F0 conour. Secondly, he paerns of F0 conours of he Hong Kong daa are differen from hose of general Mandarin speech. They change by he following syles. In one-4, here is a F0 rising in he forepar of he F0 conour, as shown in Fig. 3a. In one-3, he F0 rising of he end-par is no enough, as shown in Fig. 3b. In one-2, here is a falling in he forepar of he F0 conour, as shown in Fig. 3c. Fig. 3. Special F0 conour paerns of HK daa

These changes can be oleraed by human exper s evaluaion bu can no be ignored by he GMM model, which is rained by he general Mandarin. Based on hese analyses, we propose he following measures. The firs one, in order o avoid force-alignmen misakes, we no longer depends on force-alignmen o segmen he vowel porions. Force-alignmen resuls are only uilized for segmening syllables in muli-characer words. Then we direcly compue F0 conour for he enire syllable. A las he F0 conour of syllable is pos-processed o exclude hose F0 poins of unvoiced consonan, breahing noise or any oher noise. The remained F0 conour is a regular F0 sequence of exacly he voiced porion of he syllable. A F0 conour pos-processing example is shown in Fig. 4. This example is of syllable si1. As shown in Fig. 4, he sub-harmonic summaion of he voiced segmen, which is wha we need, is comparaively high. We can find he maximum value of he subharmonic summaion sequence, and by seing a hreshold below his maximum value we can approximaely ell he voiced porion of he syllable, as inerval (a, b). Then in (a, b), more careful examinaion is performed on he F0 sequence o exclude singulariy poins on wo ends. A las he porion of (c, d) is remained. Frequency/ Energy 240 220 200 180 160 140 120 100 80 60 F0 Sub-Harmonic summaion a c F0 Conour of Voiced porion Threshold si1 40 0 10 20 30 40 50 60 70 80 90 d b Frame Fig. 4. Pos-processing mehod on F0 conour The second one is o use Hong Kong speech daa o rain he GMM model. By his means, he new GMM model will be able o cover he changes of paerns of F0 conours, and so he recogniion performance is expeced o be improved.

And he las one is o improve he scoring mehod. The one scoring mehod inroduced in secion 3 seems oo heavily dependen on he accuracy of one recogniion. Considering he relaively low correc rae of he one recogniion of he Hong Kong daa, his mehod is oo arbirary. So we plan anoher scoring mechanism making use of he observing probabiliy compued by he GMM model. We compue he poserior probabiliy of he reference one (ha is he correc one of he syllable in he learning ex) by using Equaion 4. p( F0 ref. Tone) p( ref. Tone) P( ref. Tone F0) = pf ( 0) p( F0 ref. Tone) p( ref. Tone) = 4 p( F0 Tone ) p( Tone ) k = 1 k = 1 p( F0 ref. Tone) 4 p( F0 Tone ) k k k. (4) In Equaion 4, p( F0 ref. Tone) is he observing probabiliy of F0 feaure by he GMM model of he reference one; 4 p( F0 Tone k) is he sum of he observing probabiliies k = 1 of F0 feaure by he GMM models of all he four ones; and p( ref. Tone) = p( Tone k ) is supposed. This poserior probabiliy is a measure of how well he paern of he F0 feaure is close o he reference one model. I can be used for one scoring direcly. By seing wo hresholds, we can map he poserior probabiliy o score of 2, 1 or 0. By his new mehod, even if he ones are wrongly recognized, heir poserior probabiliies of Equaion 4 may sill fall ino a correc inerval, so correc scores can also be goen. 4.2 Cope wih Tone Changes in Muli-characer Words In coninuous speech, he one of one syllable is ofen affeced by he ones of is neighboring syllables, so he characerisic of is F0 conour is somehow differen from ha of he isolaed syllable, which is supposed o be he sandard. This leads o one changes in coninuous speech. According o [9], some common one variaions in coninuous speech can mainly be divided ino he following caegories: When wo one-3 are concaenaed, he firs one-3 is changed o one-2; and when hree one-3 are concaenaed, he firs wo one-3 are changed o one-2. If one-3 is no followed by one-3, hen i is change o half one-3. The one of characer 一 is change o one-2 when i is followed by one-4; and is changed o one-4 when i is followed by one-1, one-2 or one-3. The one of characer 不 is change o one-2 when i is followed by one-4 For one recogniion of coninuous speech, GMM models ha can discriminae hese changes are more appreciaed. Because differen one conexs lead o differen one changes, we propose o rain fracionized models for each one like he idea repored

in [8]. Bu differen from [8], which associaed one model wih mono-phone or riphone, we associae one model wih is one-conex. Tha is o say, for each one muliple models are rained for muliple one-conexs, so ha one model can cover one kind of one change resul from he corresponding one conex. A presen our sysem is mainly focused on evaluaing muli-characer words, so bi-one conex is considered a firs. For each one, according o is predecessor or successor, eigh models are rained by using daabase of muli-characer words. And for all he four ones (one-5 is no included) oally hiry-wo model are rained. From hese fracionized models beer performance is expeced for he one recogniion of mulicharacer words. 5 Experimens and Resuls 5.1 Daabase of he experimens There are wo kinds of daabase in our experimens. One is of general Mandarin speech daa. We divide his daabase ino wo pars: one par conains 37 speakers and 48100 isolaed onal syllables, his par is used as rain daa, named GM1; he oher par conains four speakers and 4800 isolaed onal syllables, his par is used as es daa, named GM2. The oher kind of daabase consiss of groups of PSK es samples and many muli-characer words spoken by Hong Kong naive residens. I has very srong Souhern China accen. We divide his daabase ino hree pars: par one includes 100 PSK es samples, named HK1; par wo includes 102 PSK es samples, named HK2; par hree consiss of 8 speakers and 4000 muli-characer words, named HKW. The HK1 and HKW are used as rain daa, he HK2 is used as es daa. 5.2 Feaure Selecion Many kinds of feaures were esed in [8] on coninuous speech daabase. Isolaed syllables and words are differen in feaure selecion. So we es several kinds of feaures on our sysem as following. 1. All he F 0 and Δ F0 values of he enire vowel porion are used o form he feaure: ( F0, Δ F0). 2. 4 equal-lengh subsecions F 0 and Δ F0 : ( F0, Δ F0). 3. 4 equal-lengh subsecions F 0 and Δ F0, and Δ F0 of he enire vowel segmen. We use he original one recogniion sysem described in secion 3 o do ess on GM2. The GMM model is rained by GM1. The es resuls are shown in Table 1.

Table 1. Tone recogniion correc rae wih differen feaure Feaure Feaure Dimension GMM Mixure Number Correc Rae Feaure 1 32 4 87.1% Feaure 2 8 128 88.3% Feaure 3 9 128 91.3% Our resuls confirm he conclusion of [8] on isolaed syllable daabase. The main value and direcion of F0 conour are he mos imporan characerisics for one recogniion. The deailed informaion is no imporan. And he las kind of feaure is shown o have goen he bes performance. 5.3 Tone Recogniion of Hong Kong Daa We use he bes feaure decided by he above experimens o recognize HK2 on he following hree sysems. The es resuls are shown in Table 2. Sysem 1: Original sysem, use GM1 o rain he GMM model. Sysem 2: Use he isolaed syllables of HK1 o rain GMM model, which also has 128 Gaussian mixures Sysem 3: In addiion o Sysem2, he force-alignmen procedure is replaced. Sysem 4: Use he isolaed syllables of HK1 o rain GMM model for recognizing ones of isolaed syllables of HK2; use he muli-characer words of HK1 and HKW o rain fracionized one models for recognizing ones of words of HK2; and F0 conour of he enire syllable is compued. Table 2. Correcion Rae of Tes Daa HK2 by differen sysems Sysem Correc Rae Sysem 1 67.9% Sysem 2 73.0% Sysem 3 75.3% Sysem 4 75.5% The resuls can prove he effeciveness of he measures propose in secion 4. All he measures lead o improvemen of recogniion performance. Bu he correc rae is sill no as good as ha of he general Mandarin daabase. We hink he reasons may be ha he raining daa is no enough and he paerns of F0 conours can no be fully covered by only four GMM models. 5.4 Tone scoring and final score compuaion Above one recogniion resuls are hen used for scoring one of HK2. Five experimens are carried ou. The firs hree experimens use he original one scoring mehod o score he one recogniion resuls of above sysem 1, 2 and 3. The las wo

experimens use he new one scoring mehod inroduced in secion 4.1 o score he resuls of sysem 3 and 4. The one scoring resuls are compared wih hose of he exper human s evaluaion and he correc raes are shown in Table 3. We can see ha he new scoring mehod grealy improved he correc rae. And finally he one scores are inegraed wih he phonemic pronunciaion scores o ge he final scores of he syllable pronunciaion of HK2. All syllables final scores are added ogeher and compared wih ha of he exper human s evaluaion o calculae he score difference for es sample. The average score difference of he 102 samples is compued, he resuls are also shown in Table 3. From he resuls i s reasonable o see he decrease of he score difference because of he increase of he one scoring correc rae. Table 3. Score differences wih differen one scoring correc rae Sysem Tone Scoring Mehod Tone Scoring Average Score Correc Rae Difference Sysem 1 Based on Rec. Resuls 60.2% 16.77 Sysem 2 Based on Rec. Resuls 69.4% 14.32 Sysem 3 Based on Rec. Resuls 72.2% 11.10 Sysem 3 Based on Poserior Probabiliy 83.1% 6.43 Sysem 4 Based on Poserior Probabiliy 83.3% 6.34 6 Conclusions Our one scoring sysem is originally buil for general Mandarin speech daa. Bu when his sysem is applied o he accened speech daa (such as he HK-accen speech daa), many problems arise and he scoring performance reduces grealy. Analyses show ha he characerisic of HK-accened speech is clearly differen from hose of he general Mandarin speech. Based on his, several soluions o he problems are proposed. The main idea of our soluions is o sui he scoring algorihm o he special es daa. We compue he F0 conour for he enire syllable o avoid forcealignmen errors; we design a new one scoring mehod o olerae one recogniion misakes; and we use he same accened daa o rain he GMM model o cover he changes of F0 conour paerns of he Hong Kong daa. Experimen resuls prove ha our soluions are effecive. The one scoring correc rae is improved form 60.2% o 83.3% and he final average score difference beween our scoring sysem and he exper human s evaluaion is decreased from 16.77 o 6.34. Anoher big problem of one recogniion is he one changes in he muli-characer words. We aemp o rain fracionized bi-one models for he differen one conexs o cover hose changes. Alhough he improvemen of he performance is sill no very large, we are confiden ha i is mainly due o insufficien speech daa, he algorihm is promising. And furher research is coninuing.

References 1. J. Bernsein, M. Cohen, H. Murvei, D. Rischev, and M. Weinraub, Auomaic Evaluaion and Training in English Pronunciaion, ICSLP 1990, Kobe, Japan. 2. L. Neumeyer, H. Franco, M. Weinraub, and P. Price, Auomaic Tex-Independen Pronunciaion Scoring of Foreign Language Suden Speech, Proc. of ICSLP 96, pp.1457-1460, Philadelphia, Pennsylvania 1996. 3. L. Neumeyer, H. Franco, V. Digalakis, and M. Weinraub, Auomaic Scoring of Pronunciaion Qualiy, Speech Communicaion, Volume 30, Issues 2-3, February 2000, Pages 83-93. 4. H. Franco, L. Neumeyer, V. Digalakis, and O. Ronen, Combinaion of machine scores for auomaic grading of pronunciaion qualiy, Speech Communicaion, volume 30, 2000. 5. Jiang-Chun Chen, Jyh-Shing Roger Jang, Jun-Yi Li and Ming-Jun Wu, Auomaic Pronunciaion Assessmen for Mandarin Chinese, IEEE Inernaional Conference on Mulimedia and Expo, Taipei, Taiwan, June 2004. 6. Yang Wu-Ji, Jyh-Chyang Lee, Yueh-chin Chang and Hsiao-Chuan Wang (1988), Hidden Markov Model for Mandarin Lexical Tone Recogniion. IEEE Transacion on Acousic, Speech, & Signal Processing, vol. ASSP-36, no. 7, 989-992. 7. DikHermes, Measuremen of pich by subharmonics summaion, Journal of Acousics of Sociey of America, AM 83(1), Jan..1988, pp.257-264. 8. Ye Tian, Jianlai Zhou, Min Chu and Eric Chang, Tone Recogniion wih Fracionized Models and Oulined Feaures, proc. of ICASSP 2004, Monreal, pp. I-105~I-108. 9. Wu Zongji. The one variaion in mandarin. Chinese grammar, 1982, No 6. p439-449.