MODIFIED WEIGHTED LEVENSHTEIN DISTANCE IN AUTOMATIC SPEECH RECOGNITION

Similar documents
BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Learning Methods in Multilingual Speech Recognition

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speech Recognition at ICSI: Broadcast News and beyond

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Modeling function word errors in DNN-HMM based LVCSR systems

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Modeling function word errors in DNN-HMM based LVCSR systems

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Detecting English-French Cognates Using Orthographic Edit Distance

Software Maintenance

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Calibration of Confidence Measures in Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Disambiguation of Thai Personal Name from Online News Articles

A study of speaker adaptation for DNN-based speech synthesis

Evolutive Neural Net Fuzzy Filtering: Basic Description

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Investigation on Mandarin Broadcast News Speech Recognition

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Artificial Neural Networks written examination

Assignment 1: Predicting Amazon Review Ratings

Improvements to the Pruning Behavior of DNN Acoustic Models

WHEN THERE IS A mismatch between the acoustic

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Word Segmentation of Off-line Handwritten Documents

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Lecture 10: Reinforcement Learning

Human Emotion Recognition From Speech

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Mandarin Lexical Tone Recognition: The Gating Paradigm

Lecture 1: Machine Learning Basics

Speech Emotion Recognition Using Support Vector Machine

Automatic Pronunciation Checker

Speech Recognition by Indexing and Sequencing

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Miscommunication and error handling

Corrective Feedback and Persistent Learning for Information Extraction

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Building Text Corpus for Unit Selection Synthesis

Reducing Features to Improve Bug Prediction

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

SARDNET: A Self-Organizing Feature Map for Sequences

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

HAZOP-based identification of events in use cases

Using dialogue context to improve parsing performance in dialogue systems

(Sub)Gradient Descent

Lecture 9: Speech Recognition

Large vocabulary off-line handwriting recognition: A survey

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

On the Formation of Phoneme Categories in DNN Acoustic Models

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Matching Similarity for Keyword-Based Clustering

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Australian Journal of Basic and Applied Sciences

Georgetown University at TREC 2017 Dynamic Domain Track

Speaker recognition using universal background model on YOHO database

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

An overview of risk-adjusted charts

Letter-based speech synthesis

Softprop: Softmax Neural Network Backpropagation Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

What is a Mental Model?

Rule Learning With Negation: Issues Regarding Effectiveness

Discriminative Learning of Beam-Search Heuristics for Planning

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Transcription:

Krynica, 14 th 18 th September 2010 MODIFIED WEIGHTED LEVENSHTEIN DISTANCE IN AUTOMATIC SPEECH RECOGNITION Bartosz Ziółko, Jakub Gałka, Dawid Skurzok, Tomasz Jadczyk 1 Department of Electronics, AGH University of Science and Technology al. Mickiewicza 30, 30-059 Kraków, {bziolko,jgalka,jadczyk}@agh.edu.pl, skurzok@gmail.com ABSTRACT The paper presents modifications of the well know Levenshtein metric. The suggested improvements result in better automatic speech recognition when Levenshtein metric is applied to compare words from a dictionary and speech recognition hypotheses. It allows to evaluate hypotheses and to choose the word which was actually spoken. INTRODUCTION An automatic speech recognition (ASR) system needs several layers to work efficiently. One of them is responsible for choosing a word using phoneme hypotheses. Our acoustic recognition is based on a non-uniform phoneme segmentation and Levenshtein distance [1] (known also as an edit metric) from a sequence of phoneme hypotheses to the phonetic transcription of a word stored in a selected dictionary. This is a part of the system which in this solution is the replacement for word decoder based on hidden Markov model (HMM) [2] frequently used in the standard speech recognition systems [3]. The phoneme segmentation methods are already established and described [4, 5]. The acoustic classifier provides a list of likelihood-ranked phoneme hypotheses with probabilities, for each frame. This ranking is used in the algorithm described in this paper to calculate modified weighted Levenshtein distance. It results in comparing a phone sequence hypothesis and a word from a dictionary. Finally, the N-best list of word hypotheses are chosen. The paper is organised as follows. The second section describes the standard weighted Levenshtein distance as it is commonly used in ASR. The third section presents our modification and how it can be applied in a word decoder. The fourth section describes the experiment and results on speech recognition. The paper is summed up with conclusions. LEVENSHTEIN DISTANCE Levenshtein distance is frequently used in ASR [6, 7]. It measures the number of differences between two sequences of well defined symbols or characters (letters for example). The Levenshtein distance between two sequences is given by the minimum number of operations needed to

Figure 1. Phonetic hypotheses for words Józef and Jerzy. The correct transcriptions are respectively /j u z e f / and /j e Z y/. There are 3 phoneme hypotheses in each column with the most probable one on the top. Errors are easily corrected if secondary and tertiary hypotheses are applied. transform one sequence into the other, where allowed operations are: insertion, deletion, or substitution of a single symbol. In our case different operations have different weights derived from likelihoods of the characters being modified. Our modification of this measure is described in the next section. Weighted Levenshtein distance (WLD) between words A and B is W LD(A, B) = min{αr(w) + βi(w) + χd(w)}, (1) w where α, β and χ are fixed weights, or operation costs and w is a sequence of operations which change A into B, r is the number of replacements, i insertions and d deletions. In ASR, typically A is a hypothesis and B is a phonetic transcription from a dictionary. SPEECH MODELLING WITH MODIFIED WEIGHTED LEVENSHTEIN DISTANCE The calculation of the WLD is conducted on phonetic transcriptions (see Fig. 1). A hypothetic sequence from the phoneme classifier is compared with the words taken from a dictionary. The dictionary can be of any finite size. The values of weights of different operations are a very important detail from application point of view. In our case, they were optimised to maximise the percentage of correct recognitions. First, Hook-Jeeves optimisation method [8] was applied. However, it resulted in too many local minima. This is why, a much slower, but more accurate method was used based on choosing a grid in the space of possible parameters. Each point of the grid was checked. Then the best point of the grid was chosen as the set of parameters. It did not allow find the global minimum but it allowed to find a set of parameters which are very close to the global minimum. We assumed that

modification cost depends on the obtained data. The weights depend on features and outputs of the classification algorithm. The acoustic classifier provides a list of best N phoneme hypotheses in each k = 1..K time frame with probabilities p nk, where n = 1..N. A substitution cost l nk is higher, if a substituting phoneme is further positioned on the list of other hypotheses. It is then calculated as l nk = δ [ln(p 1k ) ln(p nk )], (2) where δ is a parameter. All probability values in the system are implemented as natural logarithms to allow easier computations. Insertion cost h k = ln(p ins ) = const can be described as a cost of performing p ins -probable insertion operation on any k-th position. Probability of such operation can be derived either empirically or from the speech frame versus phoneme rate (undersegmentation rate). Each deletion cost g can be described as a cost of performing a deletion operation on a k-th segment, classified as a particular phoneme with a maximal probability p 1k, where g k = ln(p 1k ) ln(p del ), (3) and p del is an empirically optimised deletion probability. It was found experimentally (by checking results for various, different N) that N = 5 allows the best possible performance for the given acoustic classifier. The other parameter, δ = 10, was found using optimisation method. l nk = 1 is taken arbitrary if a substituting phoneme is not on the list of N hypotheses (in other words n > N). Then we can present a modified weighted Levenshtein distance (MWLD) as MW LD(A, B) = min {α K K K l nk r k (w) + β h k i(w) + χ g k d(w)} (4) w where r k (w) = 0 if there is no substitution on k-th position and r k (w) = 1 if there is a substitution on k-th position in sequence w of operations to change A into B, and i(w), d(w) define insertions and deletions respectively. Parameters α = 3, β = 3 and χ = 1.9 are weighting functions of overall replacement, insertion and deletion costs. It is important to maintain proper ratio βh k + χg k > αl nk (5) of these costs, in such a way, that each substitution is more likely than deletion-insertion sequence with the same output and χg k > α ln(p 1k ). (6) EXPERIMENT AND RESULTS Several experiments were conducted to check various possible sets of parameteres and options. The tests were executed on 100 recordings, different then those used for acoustic classifier training. The speaker was also different, however, a process of speaker adaptation was conducted. Each recording was a complete sentence and phonetic transcriptions of those sentences were used as dictionary entries. For each of it, the mentioned algorithm was applied and several evaluations described in the next section were calculated. Apart from the MLWD, dynamic time warping (DTW) method [2] was used as well and compared. The results were evaluated in four different ways to compare several parameteres and strategies. The first evaluation criterion is the percentage of correctly recognised sentences. The second one is the percentage of recordings for which the correct sentence is on the list of five strongest

Figure 2. Percentage of correct sentences in the 5 best list of hypotheses depending on the value of δ hypotheses. The third criterion is the average ranking of the correct sentence on the list of all hypotheses. The fourth one is a distinction factor 1 M ln(p(a = B m )) M m=1 d f = (7) ln(p(a = B c )) where M is a size of a dictionary, B m is an mth word from the dictionary and B c is a correct recognition and p(a = B m ) is a probability that sequence A is word B m. Table 1. Recognition results method perfect recognition in 5-best avearge dist. distinction factor DTW 70% 87% 6.7 1.03 MLWD 85% 94% 3.5 1.4 Table 1 shows that MLWD method outperformed DTW in all 4 established evaluation criteria for the test corpus. MLWD is very well tuned to ASR tasks, so the results are not surprising, even though the DTW method is a well balanced and used in applications method as well. Fig. 2 shows the influence of the value of parameter δ on recognition (percentage of test examples for which the correct sentence is in the 5-best list of hypotheses). It points that δ = 10 leads to the best recognition rate. The parameter sets the importance of substitution cost on the position of correct phoneme hypothesis on the list of all phoneme hypotheses of the particular time frame. It is one of the major modification of standard WLD applied by us. CONCLUSIONS Presented MLWD is a very good method to compare acoustic hypotheses for speech recognition system with words from a dictionary. It allows to calculate distances between words to maximise the number of correct recognitions. In this way, speech decoding can be conducted with 85% of accuracy on an average dictionary ASR task. ACKNOWLEDGEMENTS This work was supported by MNISW grant OR00001905.

REFERENCES [1] V. I. Levenshtein: Binary codes capable of correcting deletions, insertions, and reversals, Soviet Physics Doklady 10 (1966), 707 10. [2] L. Rabiner and B. -H. Juang: Fundamentals of speech recognition, PTR Prentice-Hall, Inc., 1993. [3] S. Young and G. Evermann and M. Gales and Th. Hain and D. Kershaw and G. Moore and J. Odell and D. Ollason and D. Povey and V. Valtchev and P. Woodland: HTK Book, Cambridge University Engineering Department, 2005. [4] J. Gałka and M. Ziółko: Wavelets in Speech Segmentation, Proceedings of The 14th IEEE Mediterranean Electrotechnical Conference MELECON 2008, Ajaccio (2008). [5] B. Ziółko and S. Manandhar and R. C. Wilson: Phoneme segmentation of speech, Proceedings of 18th International Conference on Pattern Recognition (2006). [6] J. Wu and S. Khudanpur: Efficient training methods for maximum entropy language modelling, Proceedings of 6th International Conference on Spoken Language Technologies (ICSLP-00) (2000). [7] J.-T. Chien and C.-H. Huang and K. Shinoda and S. Furui: Towards Optimal Bayes Decision for Speech Recognition, IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP (2006). [8] C.T. Kelley: Iterative Methods for Optimization, SIAM, 1999.