An Effective Combination of Different Order N-grams

Similar documents
Detecting English-French Cognates Using Orthographic Edit Distance

WHEN THERE IS A mismatch between the acoustic

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Disambiguation of Thai Personal Name from Online News Articles

Mandarin Lexical Tone Recognition: The Gating Paradigm

Modeling function word errors in DNN-HMM based LVCSR systems

Learning Methods in Multilingual Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Calibration of Confidence Measures in Speech Recognition

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Human Emotion Recognition From Speech

Practice Examination IREB

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

GCSE. Mathematics A. Mark Scheme for January General Certificate of Secondary Education Unit A503/01: Mathematics C (Foundation Tier)

Task Types. Duration, Work and Units Prepared by

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Quantitative Evaluation of an Intuitive Teaching Method for Industrial Robot Using a Force / Moment Direction Sensor

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Transfer Learning Action Models by Measuring the Similarity of Different Domains

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Speech Emotion Recognition Using Support Vector Machine

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

The Strong Minimalist Thesis and Bounded Optimality

Lecture 1: Machine Learning Basics

Universiteit Leiden ICT in Business

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Modeling function word errors in DNN-HMM based LVCSR systems

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

An Online Handwriting Recognition System For Turkish

Switchboard Language Model Improvement with Conversational Data from Gigaword

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Arabic Orthography vs. Arabic OCR

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

INPE São José dos Campos

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Are You Ready? Simplify Fractions

Knowledge Transfer in Deep Convolutional Neural Nets

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

Improvements to the Pruning Behavior of DNN Acoustic Models

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming. Jason R. Perry. University of Western Ontario. Stephen J.

Lecture 9: Speech Recognition

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Dublin City Schools Mathematics Graded Course of Study GRADE 4

NCEO Technical Report 27

CS Machine Learning

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Toward a Unified Approach to Statistical Language Modeling for Chinese

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Introduction to Simulation

Radius STEM Readiness TM

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

PRODUCT PLATFORM DESIGN: A GRAPH GRAMMAR APPROACH

Australian Journal of Basic and Applied Sciences

Axiom 2013 Team Description Paper

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

A Case Study: News Classification Based on Term Frequency

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Abstractions and the Brain

Corpus Linguistics (L615)

SARDNET: A Self-Organizing Feature Map for Sequences

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Visit us at:

SIE: Speech Enabled Interface for E-Learning

The following information has been adapted from A guide to using AntConc.

Statewide Framework Document for:

Rule Learning with Negation: Issues Regarding Effectiveness

AMULTIAGENT system [1] can be defined as a group of

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

HOLIDAY LESSONS.com

Analysis of Enzyme Kinetic Data

Artificial Neural Networks written examination

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Improving the impact of development projects in Sub-Saharan Africa through increased UK/Brazil cooperation and partnerships Held in Brasilia

Smart Grids Simulation with MECSYCO

Transcription:

An Effective Combination of Different Order N-grams Sen Zhang, Na Dong Speech Group, INRIA-LORIA B.P.101 54602 Villers les Nancy, France zhangsen@yahoo.com Abstract In this paper an approach is proposed to combine different order N-grams based on the discriminative estimation criterion, on which the parameters of n-gram can be optimized. To raise the power of modeling language information, we propose several schemes to combine conventional different order n-gram language model. We employ Newton Gradient method to estimate the assumption probabilities and then test the optimally selected language model. We conduct experiments on the platform of conversion from Chinese pinyin to Chinese character. The experimental results show that the memory capacity of language model can be remarkably lowered with hide loss of accuracy. 1. Introduction In Chinese natural language processing domain, the parameters of n-gram can be estimated by calculating the frequency of word pair in text corpus and then normalizing the frequency. This conventional language model cannot satisfy our requirements since it is dependent of discriminative capability. We propose discriminative estimation approach, which can directly relate the estimation of n-gram parameters to its discriminative capability. We optimize the parameters of n-gram on the criterion of discriminative estimation by using Newton Gradient method. When we establish N-gram we artificially introduce an assumption over the relationship among adjacent words. Uni-gram is based upon the assumption that all words appear in the corpus independently. Bi-gram assumes that only contiguous words correlate with each other and tri-gram puts a constraint on the language information that one word can be predicted only by its two predecessor words. Some words are free of context and some depend on short or long history information under some circumstances. In this sense single N-gram could model the language phenomena with some compromise. This paper addresses the impact of the different assumption from different order n-gram on the performance of language model and proposed the combination and optimal selection of different n-gram to battle with artificial assumption and possible data sparsity problem. In the following sections, we first bring up with the discriminative estimation criterion. Next we describe the assumption from N-gram. In the following section we introduce the scheme for combination of different order N-gram. In the next section, an approach to optimal selection of different order language model is proposed. At last we report the experimental results on the platform of conversion from Chinese pinyin to Chinese character. 251

2 Discriminative Estimation In natural language processing, statistical language model has been proved to be a successful method and great efforts have been taken in building n-gram language model. (R. Isotani, 1994) reports the result of the research on stochastic language model with local and global language information. N-gram can be looked as Markov chain model, so maximum likelihood estimation can be used in n- gram similar to the estimation of HMM parameters (L. Bahl, 1983). At the same time, the estimation of n-gram parameters is not surely relative to discriminative capability. Discriminative capability means the power that n-gram gets rise to correct results with higher score or probability contrast to wrong results. In speech recognition, discriminative training of HMM has been proposed from the viewpoint of pattern recognition (P. Chang, 1993; W. Chou, 1995). In this training approach, high recognition rate for training set is the motivation. The complicated objective function results in the complex formula and insupportable computational cost for parameter estimation. To simplify the above estimation criterion, we introduce the following objective function. where 6 and 6 denote the correct word string and all possible word string, respectively. We can estimate the parameters of n-gram using Newton gradient method. For the parameters in the numerator, P(0 ),)-= P(0) i) - F a For the parameters in the denominator, 1 aawrie) I, P(wrl ) ap(0)i) J (1) Pk) = pfro i ) a. 1 Awr le)ap(wriek) where a denotes the length of step. k (I,P(wle.,)." ap(0),), (2) The above formula describe that discriminative estimation increases the value of the parameters in the numerator, that is, the correct word string, and decreases the value of the parameters in the denominator. 3 Assumption of N-Gram Model As we describe above, we introduce to N-gram language model an assumption about the mutual information indicated in contiguous words. We assume over uni-gram that adjacent words are independent of each other and a word string is made up of words without any mutual information. On the confidence of this assumption and the probability of words conditioned over this assumption, the 252

conventional n-gram language model can be evaluated by the following conditional probability: P(1410)1)=11p(wilcol where ik denotes the assumption that words are independent. For bi-gram it is assumed that two contiguous words can imply some language information, which can be modeled by the conditional probability that one word is followed by a specific word as follows: Awilco2) = Hp(wi+,1w,,c02) where u2 denotes the assumption that only two adjacent words are dependent. The assumption over tri-gram is similar to that over bi-gram, but three adjacent words are taken into consideration for the conditional probability. AwiN ic03) = TIP(wi+ilwowi-1,(03) where 1.13 means the assumption that one word is relative to its two predecessor words. In speech dictation, we can generally obtain N-best candidates by using bi-gram and then tri-gram. We notice that tri-gram cannot handle beyond N-best candidates, in which correct results may be included. To make full use of the power of different order N-grams, it is important to combine different order N- grams together instead of two-pass or multi-pass. 4 Combination of Different N-Gram Model We can obtain the probabilities of a word string conditioned on different assumptions using traditional n-gram language model. In order to calculate the probability that a word string is generated regardless of any assumption, we can introduce the probability that one assumption is true and merge it together with the probability of the word string conditioned on different assumptions. We employ different n-gram to analyze one sentence and then combine the analysis result together. We apply single an assumption to describe the relationship among words for each n-gram. From this viewpoint we can address it as sentence-level analysis. where p(q)= y,pk I wr)14), ) (3) co Awl" I coi )and pfro,) denote assumption i, conditional probability of the sentence over assumption i and probability of assumption i, respectively. Here Acoi )means that the assumption coi is true. In practice, we can apply the assumption from different n-gram to word level analysis. When we process the next word following a sub-string, we can view this word as the production of different n- gram and merge the result of different n-gram together. To be more detailed, 253

pk,wn+1)=ep(wil Wwn+11wn wn-m, (Di )19(0)i (4) where win, wn+1, a)i, P(win P(wn+ilwn ' wn-m 111k) mean the word string containing n words, (n+1)-th word, assumption i, probability of word string, conventional n-gram and the probability of assumption i, respectively. In formula 4 we take the probability of assumption into account independent of specific words. Actually whether an assumption is true or not strongly depends on context information. Hereby we introduce word-specific assumption probability to formula 4 and calculate the probability of word substring using the following formula 5. where wn+i )= L.rPk' Wwn+i lwn wn_m 0)i )/Awi Iwn wn-m (5) -wn_m is the probability of assumption (Di. In order to reduce the computational complexity over the probability of a sentence, we propose that the probability of sub-string can be merged regardless of the history information and used assumptions as indicated in the following expression. 1441.1n Wn +1 ) p({ n+1 ) :". "" AWn)AW n+1 Wn Wn-no 0 P( )i I W n+1 W n W n-m) (6) 5 Optimal Selection of N-grams To reduce the memory occupancy, we design some schemes to optimally select elements from different order N-grams. 1. If the probability of one assumption over some word pair is close to 1.0, then select this element. 2. If the probabilities of several assumptions over some word pair are comparable, then choose the element with greatest conditional probability. If the conditional probabilities are very close, then choose the simplest elements. 3. If the difference on the assumption probabilities is remarkable, but none of them is close to 1.0, we choose the elements with bigger conditional probability and assumption probability. 6 Experiments We conduct several experiments using the tagged text corpus by Peking University which contains one million characters and covers political materials, novels, technical papers, grammatical papers and so on. N-gram is built up on the basis of this corpus and sparse data problem is very serious due to the limitation of corpus. We select 200 sentences for evaluation, which can be successfully processed by N-gram. The total number of characters for test is 3,335. Chinese pinyin of a sentence is character flow without segmentation and tone information. We use dynamic programming (DP) to achieve the 254

conversion between Chinese pinyin and Chinese character. We use uni-gram and bi-gram to test the availability of the proposed approaches. The following experimental results show that the combination of different N-gram can raise the transform rate remarkably. Table 1. Conversion result N_gram D S I Uni-gram 2 388 0 Bi-gram 0 29 0 Formula 3 0 27 0 Formula 4 0 24 0 Formula 5 0 13 0 Hybrid 0 15 0 D-Deletion, S-Substitution, I-Insertion n Fig. 1. Correctness rate and error reduction rate From Fig. 1 and table 1, we can obtain that the error rate is reduced by 6.9% when we adopt probability of assumptions in sentence level. The word-specific assumption probability gives an error rate reduction of 55.2% while the word-independent assumption probability decreases the error rate by 17.2%. If we build the hybrid language model by optimally selecting some elements from different N- grams, the accuracy is lowered from 99.6% to 99.5%, but the memory capacity for new language model is decreased remarkably by 30%. 7 Conclusion This paper reports the result of research on the combination of different order N-gram for conversion from Chinese pinyin to Chinese character. We bring up with three schemes to achieve the combination by introducing the assumption probability, which can be in sentence level, word level and wordspecific, respectively. The experimental results show that the error rate of conversion could be decreased remarkably. 255

8 Reference A maximum likelihood approach to continuous speech recognition. 1983 IEEE trans. on PAMI, PAMI- 5(20, pp.179,). L. Bahl, R. Jelinek. A stochastic language model for speech recognition integrating local and global constraint. 1994, ICASSP'94, pp5-8. R. Isotani, S. Matsunaga. Discriminative training of dynamic programming based speech recognizers. 1993, IEEE trans. on Audio and Speech Processing, Vol. 1(2), pp.135. P. Chang, B. Juang. Signal conditioned minimum error rate training, 1995, Eurospeech'95, pp.495. W. Chou, B. Juang. 256