A Senone Based Confidence Measure for Speech Recognition

Similar documents
On the Formation of Phoneme Categories in DNN Acoustic Models

Learning Methods in Multilingual Speech Recognition

Lecture 9: Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Recognition at ICSI: Broadcast News and beyond

Investigation on Mandarin Broadcast News Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Characterizing and Processing Robot-Directed Speech

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Word Segmentation of Off-line Handwritten Documents

Calibration of Confidence Measures in Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

The ABCs of O-G. Materials Catalog. Skills Workbook. Lesson Plans for Teaching The Orton-Gillingham Approach in Reading and Spelling

Speech Emotion Recognition Using Support Vector Machine

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

A study of speaker adaptation for DNN-based speech synthesis

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Generative models and adversarial training

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Effect of Word Complexity on L2 Vocabulary Learning

Improvements to the Pruning Behavior of DNN Acoustic Models

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

arxiv: v1 [cs.cl] 2 Apr 2017

Stages of Literacy Ros Lugg

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Richardson, J., The Next Step in Guided Writing, Ohio Literacy Conference, 2010

Rule Learning With Negation: Issues Regarding Effectiveness

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Automatic Pronunciation Checker

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

TextGraphs: Graph-based algorithms for Natural Language Processing

Large vocabulary off-line handwriting recognition: A survey

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Edinburgh Research Explorer

Human Emotion Recognition From Speech

Speech Recognition by Indexing and Sequencing

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Using dialogue context to improve parsing performance in dialogue systems

CROSS-LANGUAGE MAPPING FOR SMALL-VOCABULARY ASR IN UNDER-RESOURCED LANGUAGES: INVESTIGATING THE IMPACT OF SOURCE LANGUAGE CHOICE

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Corpus Linguistics (L615)

Artificial Neural Networks written examination

Corrective Feedback and Persistent Learning for Information Extraction

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Letter-based speech synthesis

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Dialog Act Classification Using N-Gram Algorithms

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Lecture 1: Machine Learning Basics

Using Synonyms for Author Recognition

Florida Reading Endorsement Alignment Matrix Competency 1

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Linking Task: Identifying authors and book titles in verbose queries

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

CS Machine Learning

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Rule Learning with Negation: Issues Regarding Effectiveness

Using computational modeling in language acquisition research

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Building Text Corpus for Unit Selection Synthesis

Mandarin Lexical Tone Recognition: The Gating Paradigm

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Individual Differences & Item Effects: How to test them, & how to test them well

Degree Qualification Profiles Intellectual Skills

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

The taming of the data:

Assignment 1: Predicting Amazon Review Ratings

Stochastic Calculus for Finance I (46-944) Spring 2008 Syllabus

First Grade Curriculum Highlights: In alignment with the Common Core Standards

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Transcription:

Utah State University DigitalCommons@USU Space Dynamics Lab Publications Space Dynamics Lab 1-1-1997 A Senone Based Confidence Measure for Speech Recognition Z. Bergen W. Ward Follow this and additional works at: https://digitalcommons.usu.edu/sdl_pubs Recommended Citation Bergen, Z. and Ward, W., "A Senone Based Confidence Measure for Speech Recognition" (1997). Space Dynamics Lab Publications. Paper 13. https://digitalcommons.usu.edu/sdl_pubs/13 This Article is brought to you for free and open access by the Space Dynamics Lab at DigitalCommons@USU. It has been accepted for inclusion in Space Dynamics Lab Publications by an authorized administrator of DigitalCommons@USU. For more information, please contact dylan.burns@usu.edu.

A SENONE BASED CONFIDENCE MEASURE FOR SPEECH RECOGNITION Z. Bergen Berdy Medical Systems 499 Pearl East Circle, Suite 2 Boulder, Colorado, USA 81 Tel. 3-417-163, FAX 3-417-1662, E-mail: zbergen@berdy.com W. Ward Carnegie Mellon University Pittsburgh, PA, USA 213 E-mail: whw@cs.cmu.edu ABSTRACT This paper describes three experiments in using frame level observation probabilities as the basis for word confidence annotation in an HMM speech recognition system. One experiment is at the word level, one uses word classes, and the other uses phone classes. In each experiment we categorize hypotheses into correct and incorrect categories by aligning a best recognition hypothesis with the known transcript. The confidence of error prediction for each class is a measure of the resolvability between the correct and incorrect histograms. 1. INTRODUCTION Speech recognition systems generally rank order hypotheses by computing scores for utterance hypotheses. These scores are useful for preference ordering the hypotheses, but do not give a good indication of the quality of the recognition or how confident the system is that the decoding is correct. For applications to act on speech input, they must be able to assess the confidence that the input has been decoded correctly. This work combines and extends the work described in [1], [2], and is related to extending one feature of [3] for providing confidence annotation of speech recognizer output. The idea is to normalize decoded word strings and phone acoustic scores by scores produced by a less constrained search. [1] used an all-phone recognition to normalize the scores of the hypotheses, followed by Bayesian updating. Among other things, [3] also used the best matching observation for each frame (senone) to normalize the acoustic score for the hypothesis. This paper describes further experiments with this measure. For our acoustic measure we use ms frame-level observation scores as the basis for the normalization. We use the Sphinx-II system [4] as our speech recognizer. It is a Semi-Continuous HMM recognizer using a trigram language model. Acoustic observations are modeled in this system by senones []. Senones are tied hmm-state specific mixture weights for the Gaussian distributions used by the semi-continuous HMM system. For each ms frame of input, the recognizer compares the input feature vector to all senones in the system. The best scoring senone for that frame is recorded, this being the unconstrained match. After the recognizer has produced the best word string (using a Viterbi search), these scores are used to normalize the scores of the words and phones in the hypothesis. For each frame, the score of the senone used by the hypothesis for that frame is subtracted from the best scoring senone for the frame. The average of this normalized score is then computed for each word and for each phone of each word. Chase [3] used this measure as one predictor feature in a decision tree for confidence annotation. The acoustic scores of both words and phones were normalized by the best senone path. Used directly as a predictor feature, this measure seemed to have relatively little predictive power. We investigate the further classification into word and phone classes respectively, in hopes of improving the discrimination power of this measure. 2. EXPERIMENTS Three experiments were performed to determine the utility of using these normalized acoustic scores for word and phone level confidence measures. The categories for the three cases are: all words word classes phone classes Each class is divided into a correct and incorrect set so the distributions for each can be compared. 2.1 Experiment 1 We tested the measure by computing histograms of correct and incorrect words from a development corpus. The recognizer was run on utterances from the Wall Street Journal Corpus and the confidence measure computed for each word. The distributions of the confidence scores were computed for correct words and incorrect words as determined from the reference

All Correct All Incorrect 1.33E+4 2.67E+4 4.E+4.33E+4 6.67E+4 8.E+4 9.33E+4 1.7E+ 1.E+ 1.33E+ 1.47E+ 1.6E+ 1.73E+ 1.87E+ 2.E+ Acoustic Score Figure 1. Correct and incorrect distributions for all words. transcripts. An alignment program was used to flag incorrect words where the hypothesis decoding differed from the transcript. Figure 1 shows the distributions and illustrates the high degree of overlap of the two distributions. These results are consistent with results for the similar measure described in [3]. Much accuracy is probably lost in our confidence measure by averaging across all words. 2.2 Experiment 2 The results of the first experiment led us to cluster words into classes and evaluate using our acoustic measure. It was hoped that clustering words would uncover variations hidden by averaging across all words. We formed the following classes of phones: Vowels (AE, EH, IH, IX, IY, UW, OW, UH, AH, AX, AA, AO, ER, AXR) Dipthongs (AW, AY, EY, OY) Orals (B, D, G, DX, BD, DD, GD, P, T, K, PD, TD, KD) Fricatives (DH, Z, ZH, V, S, TH, SH, F) Affricates (CH, TS, JH) Nasals (M, N, NG) Aspirates (HH) Approximants (W, R, L, Y) Using these phone groups we formed word classes by looking at the beginning phone and total number of phones for each word. While not optimal, this classification results in distributions that exhibit areas of error prediction. Phone level differences are averaged out over the length of the word and their effects may not appear as predominantly as in the single phone case described in the next section. Normalization was performed independently for each of the classes. Figure 2. Shows the distributions for words starting with a dipthong. We can see that there is some variation in the incorrect distribution as compared to the correct distribution which remains similar in shape to the all word case from Figure 1. In general, separation of the correct and incorrect distributions did improve slightly with the more specific statistics. 2.3 Experiment 3 In this experiment we investigated the phone level for a more specific model of the behavior of acoustic scores. To prepare the data we used the Sphinx-II decoder to produce phone level segmentations and scores for the best path hypothesis. We form classes comprised of the individual phones from our previous phone classes (see section 2.2). Normalization is done by averaging the difference between the constrained and unconstrained path over each phone.

Series1 Series2 1.33E+4 2.67E+4 4.E+4.33E+4 6.67E+4 8.E+4 9.33E+4 1.7E+ 1.E+ 1.33E+ 1.47E+ 1.6E+ 1.73E+ 1.87E+ 2.E+ Acoustic Score Figure 2. Correct and incorrect distributions for words beginning with dipthongs (Series 1:correct, Series 2: incorrect). For some phones, the measure shows a significant degree of separation between the correct and incorrect distributions. Figure 3 is an example of how the distributions look for one of the phone classes: UW. Comparing Figure 3 with Figures 1 and 2 note that the distribution for correct scores remains fairly constant while the distribution for incorrect scores spreads over the range of scores providing a distinct region of separation between the distributions. For the more general classes of the prior experiments, the overlap in the distributions is due to large localized differences in the single phone classes that get averaged out in the word level classes. While hidden in the general statistics, for the single phone case, it is possible that a misrecognized phone may cause the recognizer to traverse the lexical tree along the wrong word path and cause a word level error. 3. CONCLUSION Senone based acoustic normalization seems to provide only very slight information for confidence when averaged across all words. However, the performance begins to improve as statistics are computed over finer categories, word classes or phones. We intend to investigate better clustering of word classes, and the estimation of phone class reliability, similar to the updating technique of [1]. We believe this will further improve the predictive capability of senone normalization. 4. ACKNOWLEDGMENT This project is supported in part by an ATP Cooperative Agreement, Number 7NANBH1184, from the National Institute of Standards and Technology.. REFERENCES [1] Young, S. and Ward, W., Recognition Confidence Measures for Spontaneous Spoken Dialog, EUROSPEECH 93, September 1993. [2] Chase, L., Rosenfeld, R., and Ward, W., Error- Responsive Modifications to Speech Recognizers: Negative N-grams, ICSLP 1994. [3] Chase, L., Error-Responsive Feedback Mechanisms for Speech Recognizers, Unpublished Ph.D. Dissertation, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, April, 1997. [4] Ravishankar, M.K., Efficient Algorithms for Speech Recognition, Unpublished Ph.D. Dissertation, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, May 1996. [] Hwang, M. Y. and Huang, X. D., Subphonetic Modeling With Markov States - Senone, ICASSP 92, March 1992.

4 4 UW Correct UW Incorrect 7 1 17 Acoustic Score (x 3 ) Figure 3. Correct and incorrect distributions for the phone: UW.