Automatic Phonetic Alignment and Its Confidence Measures

Similar documents
Learning Methods in Multilingual Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Human Emotion Recognition From Speech

Speech Emotion Recognition Using Support Vector Machine

Modeling function word errors in DNN-HMM based LVCSR systems

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Modeling function word errors in DNN-HMM based LVCSR systems

A study of speaker adaptation for DNN-based speech synthesis

Speech Recognition at ICSI: Broadcast News and beyond

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Automatic Pronunciation Checker

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

On the Formation of Phoneme Categories in DNN Acoustic Models

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Mandarin Lexical Tone Recognition: The Gating Paradigm

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Speaker recognition using universal background model on YOHO database

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Rule Learning With Negation: Issues Regarding Effectiveness

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Word Segmentation of Off-line Handwritten Documents

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

WHEN THERE IS A mismatch between the acoustic

Rule Learning with Negation: Issues Regarding Effectiveness

Calibration of Confidence Measures in Speech Recognition

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Body-Conducted Speech Recognition and its Application to Speech Support System

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Edinburgh Research Explorer

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

INPE São José dos Campos

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Speech Recognition by Indexing and Sequencing

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Python Machine Learning

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

SARDNET: A Self-Organizing Feature Map for Sequences

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

On the Combined Behavior of Autonomous Resource Management Agents

Speaker Identification by Comparison of Smart Methods. Abstract

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Disambiguation of Thai Personal Name from Online News Articles

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

arxiv: v1 [cs.cl] 2 Apr 2017

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Corpus Linguistics (L615)

A Reinforcement Learning Variant for Control Scheduling

Axiom 2013 Team Description Paper

Phonological Processing for Urdu Text to Speech System

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Speaker Recognition. Speaker Diarization and Identification

Probability and Statistics Curriculum Pacing Guide

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Investigation on Mandarin Broadcast News Speech Recognition

An Online Handwriting Recognition System For Turkish

Linking Task: Identifying authors and book titles in verbose queries

Learning From the Past with Experiment Databases

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Reducing Features to Improve Bug Prediction

Learning Methods for Fuzzy Systems

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Lecture 1: Machine Learning Basics

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Affective Classification of Generic Audio Clips using Regression Models

A Hybrid Text-To-Speech system for Afrikaans

AQUA: An Ontology-Driven Question Answering System

Lecture 10: Reinforcement Learning

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Circuit Simulators: A Revolutionary E-Learning Platform

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

CS Machine Learning

Generative models and adversarial training

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Transcription:

Automatic Phonetic Alignment and Its Confidence Measures Sérgio Paulo and Luís C. Oliveira L 2 F Spoken Language Systems Lab. INESC-ID/IST, Rua Alves Redol 9, 1000-029 Lisbon, Portugal {spaulo,lco}@l2f.inesc-id.pt http://www.l2f.inesc-id.pt Abstract. In this paper we propose the use of an HMM-based phonetic aligner together with a speech-synthesis-based one to improve the accuracy of the global alignment system. We also present a phone durationindependent measure to evaluate the accuracy of the automatic annotation tools. In the second part of the paper we propose and evaluate some new confidence measures for phonetic annotation. 1 Introduction The flourishing number of spoken language repositories has pushed speech research in multiple ways. Much of the best speech recognition systems rely on models created with very large speech databases. Research into natural prosody generation for speech synthesis is, nowadays, another important issue that uses large amounts of speech data. These repositories have allowed the development of many corpus-based speech synthesizers in the recent years, but they need to be phonetically annotated with a high level of precision. However, manual phonetic annotation is a very time-consuming task and several approaches have been taken to automate this process. Although state-of-the-art segmentation tools can achieve very accurate results, there are always some uncommon acoustic realizations or some kind of noise that can badly damage the segmentation performance for a particular file. With the increasing size of speech databases manual verification of every utterance is becoming unfeasible, thus, some confidence scores must be computed to detect possible bad segmentations within each utterance. The goal of this work is the development of a robust phonetic annotation system, with the best possible accuracy, and the development and evaluation of confidence measures for phonetic annotation process. This paper is divided into 4 sections, the section 2 describes the development of the proposed phonetic aligner. In the following section (section 3), we describe and evaluate the proposed confidence measures, and the conclusions in the last section.

2 Automatic segmentation approaches Automatic phonetic annotation is composed of two major steps, the determination of the utterance phone sequence, the sequence produced by the speaker during the recording procedure, and the temporal location of the segment boundaries (phonetic alignment). Several phonetic alignment methods have been proposed, but the most widely explored techniques are based either on Hidden Markov Models (HMM) used in forced alignment mode [1] or on dynamic time alignment with synthesized speech [2]. The main reason of the superiority of two techniques is their robustness and accuracy, respectively. An HMM-based aligner consists of a finite state machine that has a set of state occupancy probabilities in each time instant and a set of inter-state transition probabilities. These probabilities are computed using some manually or automatically segmented data (training data). On the other hand, the speech-synthesis-based aligners are based on a technique used in the early days of the speech recognition. A synthetic speech signal is generated with the expected phonetic sequence, together with the segment boundaries. Then, some spectral features are computed from the recorded and the generated speech signals, and finally the Dynamic Time Warping (DTW) algorithm [3] is applied to compute the aligned path between the signals for which there is a better match between the spectral features. The reference signal segment boundaries are mapped into the recorded signal using this alignment path. A comparison between the results of HMM-based and speech-synthesis-based segmentation [4] has showed that in general (about 70% of times) the speech-synthesis-based segmentation is more accurate than the HMM-based one, however, it tends to generate few large boundary errors (when it fails it fails badly). This means that the HMM-based phonetic aligners are more reliable. The lack of robustness of the speech-synthesis-based aligners as well as its better boundary location accuracy suggested the development of an hybrid system, a system as accurate as the speech-synthesis-based aligner and with the robustness of the hmm-based aligners. 2.1 Speech Synthesis based phonetic aligners The first conclusion taken from the usage of some commonly used speechsynthesis-based aligners is that the acoustic features does not prove to be equally good for locating the boundaries for every kind of phonetic segment. For instance, although the energy is, in general, a good feature to locate the boundary between a vowel and a stop consonant, it performed poorly on locating the boundary between two vowels. Thus, some experiments were performed with multiple acoustic features and multiple segment transitions to find the best acoustic features to locate the boundaries between each different pair of phonetic segments. This acoustic feature selection considerably increased the robustness of the resulting aligner. The reference speech signal was generated using the Festival Speech Synthesis System [5] using a Portuguese voice recorded at our lab. A detailed description of this work can be found in [6].

2.2 HMM based phonetic aligners Once the speech-synthesis-based aligner was built with a good enough robustness, it was used to generate the training data for the HMM-based aligner. Given the amount of available training data, context-independent models were chosen for the task. Figure 1 shows the different phone topologies. The upper one is used for all phonetic segment but the silence, semi-vowels and shwa. The central topology is used to represent segments with short durations like the semi-vowels and shwa, by allowing a skip between and first and last states. The silence model is the lower one. In this case a transition from the first state to the last as well as another one from the last to the first state can be observed, this can be used to model very large variations on the duration of the silences in the speech database. Each model states consists of a set of eight gaussian mixtures. The adopted features were the Mel-Frequency Cepstrum Coefficients, their first and second order differences and the energy and its first and second differences. Each frame is spaced by 5-miliseconds, with a 20-milisecond long window. The training of the model was preformed by using the HTK toolkit. Fig. 1. Three HMM topologies were used for the different kinds of phonetic segments. The upper one is the general model, the central topology in used for semi-vowels and shwa, and the last one for the silence. 2.3 Segment boundary refinement As expected, using the HMM-based aligner, a more robust segmentation was obtained. The next step was to use our speech-synthesis-based aligner to refine the location of the segment boundaries. 2.4 Segmentation results Two of the most common approaches to evaluate the segmentations accuracy is to compute of the phonetic segment percentage that have boundary location errors less than a given tolerance (often 20 ms), or the root mean square error of the boundary locations. Although these can be good predictors for aligners accuracy, it is clear that an error of about 20 ms in a 25-ms long segment is much more serious that the same error in a 150-ms long segment. In the first case

the segment frames are almost always badly assigned. This way, a phone-based duration-independent measure is proposed to evaluate the aligners accuracy, that is to determine the percentage of well assigned frames, within the segment. We will call it the Overlap Rate (OvR). Fig. 2 illustrates the computation of this measure. Given a segment, a reference segmentation (RefSeg), and the segmentation to be evaluated(autoseg), OvR is the ratio between the number of frames that belong to that segment in both the segmentations (Common Dur in the fig 2) and the number of frames that belong to the segment in one segmentation, at least(dur max if the fig 2). The following equation illustrates the computation of OvR: OvR = Common Dur Dur max = Common Dur Dur ref + Dur auto Common Dur (1) Dur_ref RefSeg... seg=a... Common_Dur AutoSeg... seg=a Dur_auto... Dur_max Fig. 2. Graphical representation of the quantities involved in the computation of the Overlap Rate Regarding the equation 1, one can realize that if, for example, a phone duration in the reference segmentation differs considerably from its duration in the other segmentation, the OvR quantity takes a very small value. Let X be the Dur ref, Y the Dur auto and z the Common Dur of Fig 2, and suppose X Y, thus: z 0 OvR = X + Y z X (2) Y since the number of common frames (z) is at most the same as the minimum number of frames in the two annotations of the given segment. This way, one can conclude that this measure is duration independent, and is able to produce a more reliable evaluation of the annotation accuracy. Figure 3, shows the accuracy of the three developed annotation tools. The x-axis is the percentage of incorrectly assigned frames ((1 OvR) 100%) and the y-axis is the percentage of phones that has a percentage of incorrectly assigned

frames lower than the value given in the x-axis. The solid line represents the accuracy of the HMM-based aligner, the dashed line is the accuracy of the speechsynthesis-based aligner when it is used to refine the results of the HMM-based aligner. The dotted line represents the accuracy of the speech-synthesis-based aligner when no other alignments were available. In fact, these results are not a fair comparison among the multiple annotation tools, because the HMM-based aligner is an aligner adapted to the speaker, while the speech-synthesis-based aligners are not. Nevertheless, the phone models used in the HMM-based aligner were trained on data aligned by the the speech-synthesis-based aligner. These results also suggest that the use of HMM-based along with speech-synthesisbased annotation tools can be worthy as the former is more robust and the later is more accurate. 100 Accuracy of the different aligners 90 80 70 Aligner accuracy (%) 60 50 40 30 20 HMM based aligner Hybrid aligner 10 Speech synthesis based aligner 0 0 10 20 30 40 50 60 70 80 90 100 Maximum number of wrongly assigned frames (%) Fig. 3. Annotation accuracy for the three tested annotation techniques. 3 Confidence scores In this section we propose some phone-based confidence scores for detecting misalignments in the utterance. The goal is to locate regions of the speech signal where the alignment method may have failed and that could benefit from human intervention. 3.1 The chosen features The alignment process provides a set of features that can be used as indicators of annotation mismatch. This set of features is described below. DTW mean distance: mean distance between the features of the recorded signal frames and the synthesized speech signal over the alignment path for a given phone;

DTW variance: variance of the mean distance between the features of the recorded signal frames and the synthesized speech signal over the alignment path for a given phone; DTW minimal distance: minimal distance between the features of the recorded signal frames and the synthesized speech signal over the alignment path for a given phone; DTW maximal distance: maximal distance between the features of the recorded signal frames and the synthesized speech signal over the alignment path for a given phone; HMM mean distance: mean distance between the features of the recorded signal frames and the phone model; HMM variance: variance of the distance between the features of the recorded signal frames and the phone model; HMM minimal distance: minimal distance between the features of the recorded signal frames and phone model; HMM maximal distance: maximal distance between the features of the recorded signal frames and phone model Each segment of the database is associated with a vector of features that will be used to predict a confidence score for the alignment of that phone. To provide some context we decided to include not only the feature vector of the current phone but also the feature vectors of the previous and following segments. We were now in the condition of performing the evaluation the reliability of the different techniques that we propose to detect annotation problems. Three different approaches will be evaluated: Classification and Regression Trees (CART), Artificial Neural Networks (ANN) and Hidden Markov Models (HMM). 3.2 Definition of Bad Alignment A boundary between good and bad alignment is hard to define. Some researchers assume that boundary errors larger than 10 miliseconds must be considered misalignments, while others are more tolerant. As we explained before, the effect of the error in the location of the boundaries may be different from segment to segment, depending on its duration. Thus, we will use the duration-independent feature proposed before to computed the accuracy of annotation tools: we will assume that a misalignment occurs when OvR 0.75. 3.3 Classification and Regression Trees To train a regression tree we have used the Wagon program, that is part of Edinburgh Speech Tools[7]. This program can be used to build both classification and regression trees, but in this problem it was used as a regression tool to predict the values of the OvR based on the former features. We used a training set with 28000 segments and a test set with 10000 segments.

Since the leafs of the tree are the average value of OvR and its variance, assuming a gaussian distribution in the leafs, we can compute the probability of the having OvR with a value lower than the threshold defined in the previous subsection. Let µ and σ be the average value of OvR and its standard deviation, respectively, in a given leaf of the tree. Then, the probability of misalignment is given by: P (OvR 0.75 µ, σ) = 1 2 π σ 2 0.75 0 e (x µ) 2 2 σ 2 dx (3) We than had to apply a threshold to the resulting probability. By varying these threshold we obtained a Precision/Recall curve represented as a dotted line in Fig. 4. 3.4 Artificial Neural Networks Using a neural network simulator developed at our lab, and the same feature vectors used in the previous experiment, we trained a binary classifier, which computes the probability of misalignment for each segment. As we did in the trainning of the regression tree, we had to apply a threshold to the outputs of the neural network. The variation of this threshold created the lower dashed line of Fig. 4. 3.5 Hidden Markov Models Two one-state models were created for each phonetic segment. A model for aligned segments, and a model for the misaligned ones. Since the amount of training data were not large enough to build context dependent models, we had to choose a context-independent approach. However, we took into account the influence of the different contexts in the models in some extent by using four gaussian mixtures in each state. Each model was based on the feature vectors described in 3.1. After model training, we performed a forced alignment between the feature vector sequences and the model pairs trained for each phonetic segment. This experiment allowed us to find values for precision and recall for each phonetic segment. We depict the experiment results based on phone groups (Vowels, Liquids, Nasals, Plosives, Fricatives, Semi-Vowels and the Silence), which is enough to show that the precision and recall values can vary largely with the phone types in analysis. Based on the previously trained models, we computed a score(hmmsore) for each segment to precision-recall curves, like we did for CART and ANN. This score was calculated using equation 4. HmmScore = P (x = Al Model Al ) P (x = Al Model Al ) + P (x = Misal Model Misal ) (4) where P (x = Al Model Al ) is the probability that segment x is aligned given the model of aligned phones for that segment and P (x = Misal Model Misal )) is the probability that segment x is misaligned given its model of misaligned phones.

Table 1. Best feature pairs for the multiple phonetic segment class transitions. Class Precision(%) Recall(%) Vowels 73.2 69.8 Liquids 48.6 64.0 Nasals 82.0 67.7 Plosives 78.7 72.4 Fricatives 88.0 69.0 Semi-Vowels 44.9 67.5 Silence 97.3 87.8 The score values are between 0 and 1. We computed the upper curve of Fig. 4 by imposing different thresholds to the score, like we had already done for the two other approaches. It is important to point out that in this case we are detecting the aligned segments rather than misaligned ones. 3.6 Results The results depicted in Fig. 4 suggest the HMM approach outperforms all others by far. The other two approaches are very similar, for some applications one should choose CARTs, for others one should choose ANNs. 1.0 0.8 Precision 0.6 0.4 HMM CART 0.2 ANN 0 0 0.2 0.4 0.6 0.8 1.0 Recall Fig. 4. Plot of precision and recall of the proposed confidence measures. 4 Conclusions In the first part of the paper, we have explored the advantages of using an HMM-based aligner together with an aligner based on speech-synthesis, and

we showed the increase of the accuracy of the combined system, and a new measure of alignment accuracy was proposed. In the second part of the paper we proposed and evaluated three new approaches to compute confidence measures for phonetic annotation. In this part we realized that the approach using HMMs is largely the best one. 5 Acknowledgements The authors would like to thank M. Céu Viana and H. Moniz for providing the manually aligned reference corpus. This work is part of Sérgio Paulo s PhD Thesis sponsored by a Portuguese Foundation for Science and Technology (FCT) scholarship. INESC-ID Lisboa had support from the POSI Program. References 1. D. Caseiro, H. Meinedo, A. Serralheiro, I. Trancoso and J. Neto, Spoken Book alignment using WFST HLT 2002 Human Language Technology Conference. 2. F. Malfrère and T. Dutoit, High-Quality Speech Synthesis for Phonetic Speech Segmentation. In Proceedings of Eurospeech 97, Rhodes, Greece, 1997. 3. Sakoe H. and Chiba,Dynamic programing algorithm optimization for spoken word recognition. IEEE Trans. on ASSP, 26(1):43-49, 1978. 4. J. Kominek and A. Black, Evaluating and correcting phoneme segmentation for unit selection synthesis. In Proceedings of Eurospeech 2003, Geneve, Switzerland, 2003. 5. A. Black, P. Taylor and R. Caley, The Festival Speech Synthesis System. System documentation Edition 1.4, for Festival Version 1.4.0, 17th June 1999. 6. S. Paulo and L. C. Oliveira, DTW-based phonetic alignment using multiple acoustic features. In Proceedings of Eurospeech 2003, Geneve, Switzerland, 2003. 7. P. Taylor R. Caley, A. Black, S. King, Edinburgh Speech Tools Library System Documentation Edition 1.2, 15th June 1999.