The SRI Spine 2000 Evaluation System

Similar documents
Investigation on Mandarin Broadcast News Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Calibration of Confidence Measures in Speech Recognition

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Modeling function word errors in DNN-HMM based LVCSR systems

WHEN THERE IS A mismatch between the acoustic

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Modeling function word errors in DNN-HMM based LVCSR systems

Segregation of Unvoiced Speech from Nonspeech Interference

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Improvements to the Pruning Behavior of DNN Acoustic Models

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

(Sub)Gradient Descent

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Noisy SMS Machine Translation in Low-Density Languages

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Learning Methods in Multilingual Speech Recognition

Rhythm-typology revisited.

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

On-Line Data Analytics

Language Model and Grammar Extraction Variation in Machine Translation

Using computational modeling in language acquisition research

CS Machine Learning

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Assignment 1: Predicting Amazon Review Ratings

Speech Recognition by Indexing and Sequencing

Detecting English-French Cognates Using Orthographic Edit Distance

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Disambiguation of Thai Personal Name from Online News Articles

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Human Emotion Recognition From Speech

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Miscommunication and error handling

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

Automatic segmentation of continuous speech using minimum phase group delay functions

Lecture 9: Speech Recognition

Deep Neural Network Language Models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

TxEIS Secondary Grade Reporting Semester 2 & EOY Checklist for txgradebook

Lecture 1: Machine Learning Basics

Research computing Results

Data Structures and Algorithms

Speech Emotion Recognition Using Support Vector Machine

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Statewide Framework Document for:

Connect Microbiology. Training Guide

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

Switchboard Language Model Improvement with Conversational Data from Gigaword

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

NATIONAL CENTER FOR EDUCATION STATISTICS RESPONSE TO RECOMMENDATIONS OF THE NATIONAL ASSESSMENT GOVERNING BOARD AD HOC COMMITTEE ON.

Mandarin Lexical Tone Recognition: The Gating Paradigm

Chapter 4 - Fractions

arxiv: v1 [cs.lg] 7 Apr 2015

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Australian Journal of Basic and Applied Sciences

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Automatic Pronunciation Checker

Software Maintenance

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Mathematics Scoring Guide for Sample Test 2005

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

SOFTWARE EVALUATION TOOL

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Python Machine Learning

Letter-based speech synthesis

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Houghton Mifflin Online Assessment System Walkthrough Guide

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Investigation of Indian English Speech Recognition using CMU Sphinx

Body-Conducted Speech Recognition and its Application to Speech Support System

A Neural Network GUI Tested on Text-To-Phoneme Mapping

SIE: Speech Enabled Interface for E-Learning

The NICT Translation System for IWSLT 2012

Transcription:

The Venkata Ramana Rao Gadde ndreas Stolcke Speech Technology nd Research Laboratory SRI International 1

Organization of the Talk æ The Spine 2000 task æ SRI s Evaluation System æ Post-evaluation improvements æ Future work 2

The Spine 2000 task æ Evaluation of current SR technology in noisy military environments. æ Differences from the Hub-5 task Evaluation data is not segmented. ll sites must use a common language model. Sites are not funded. 3

SRI Evaluation System 0. Segmentation of speech. 1. Cluster segments and estimate front-end normalizations (VTL, cepstral mean and variance). 2. First pass recognition using SI acoustic models and 3-gram multiword language model. 3. dapt the SI acoustic models to the clusters. 4. Dump Nbest (N=2000) using the cluster adapted acoustic models and the 3-gram multiword language model. 5. Rescore the Nbest using 3-gram language model (non-multiword). 6. Format the results for submission. 4

Models æ coustic Models coustic models trained from the Spine training data (11970 waveforms). No DRT data was used. Clustering was used to identify pseudo speakers æ Language Models Two language models were used, the CMU 3-gram LM and a 3-gram multiword LM derived from the CMU LM. 5

Language Model Need to convert CMU language model to contain multi-word units used in dictionary. æ Insert all multi-word N-grams that are triggered by original N-grams. Example: Old N-grams: i m going, to do Multiword: going to dd N-grams: i m going to, going to do Remove N-gram: going to æ ssign probabilities so that word sequences retain combined probabilities: pgoing toji 0 m=pgoingji 0 mæptoji 0 m going 6

Front-end processing æ Segmentation 0. Split the two channel waveform by conversation side. 1. Remove digital zeros from the waveforms. 2. Recognize the waveforms using gender and speaker independent acoustic models and a multiword bigram language model. Use the recognition hypotheses to further segment the waveforms into speech/nonspeech. 3. Perform foreground/background speech classification using energy. Use it to obtain foreground speech segments. æ Clustering 0. Cluster the foreground speech segments using a bottom up agglomerative clustering scheme (SRI s 1997 Hub4 eval system). 1. Compute cluster level normalizations (VTL, cep. mean and var.). 7

N-best Rescoring æ Replace multi-words with component words æ Recompute language model probabilities using CMU LM Note: This yields better results because step 1 in multiword LM construction gives only an approximation to full multi-word N-gram. æ lign all N-best hypotheses and extract words with highest posterior probabilities (explicit word error minimization; Stolcke, Konig, Weintraub 1997) 8

Results on development set æ Two dev sets were taken out of the training set, one containing speech from selected speakers and a second containing all the nv data. æ coustic models were trained from the remaining data. æ LM for each set was trained from the transcripts, excluding the transcripts for that set. Model Test set Set 1 set 2 both step 2. Rec with SI 35.0% 41.1% 36.8% step 4. Rec with dapted 32.8% 39.1% 34.7% step 5. Rescore Nbest 32.1% 37.1% 33.9% 9

æ The improvements are similar to our Hub-5 system. æ Using a larger bandwidth front-end gave a small reduction in WER (not used in our eval system). æ Clustering the training data was better than using speaker/noise labels. æ Multiwords gave 1-2% improvement in WER. æ Using probabilities for pronunciations gave a small reduction in WER. 10

Evaluation results æ SRI s evaluation system had a WER of 46.3%. The best system had a WER around 26%. æ Reasons for the poor performance Incorrect segmentation æ Large number of insertions and deletions. æ Loss of 12.5% absolute due to incorrect segmentation. æ Could not tune segmentation thresholds æ lack of representatitive dev data æ missed early clarification on what to do with background speech Simpler system compared to our Hub-5 system æ No crossword models æ No duration models æ No rate-of-speech models 11

Post-evaluation Improvements æ Improvements in segmentation Segmentation algorithm is simplified, using only energy. Thresholds for segmentation optimized using the dev set. WER reduced by 7.5% æ Using Crossword models Crossword acoustic models were used to rescore the lattices. WER reduced by 6.0% absolute on dev set. 12

Post-evaluation System 0. Segmentation of speech. 1. Cluster segments and estimate front-end normalizations (VTL, cepstral mean and variance). 2. First pass recognition using SI acoustic models and 3-gram multiword language model. 3. dapt the SI acoustic models to the clusters. 4. dapt the crossword SI acoustic models to the clusters. 5. Generate lattices using the adapted models and a bigram multiword LM. Expand using the 3-gram LM. 6. Dump Nbest (N=2000) from lattices using the cluster adapted crossword acoustic models. 7. Rescore the Nbest using 3-gram language model (non-multiword). 8. Format the results for submission. 13

Comparison of the Eval and Post-eval Systems Step Model Eval Post-eval step 2. Rec with SI 52.2% 42.9% step 3. Rec with dapted 49.5% step 6. Rec with dapted CW 36.3% step 7. Rescore Nbest 48.5% 35.8% NIST scoring 46.3% 33.1% æ æ - projected value 14

Future Work æ Segmentation Foreground/background speaker classification Prosody based segmentation æ coustic modeling Utilize noise characteristics Spectral subtraction did not help æ Duration modeling 15

Suggestions for Future Evaluations æ Need clearly defined training and dev sets. In absence of a dev set, we were unaware of background speech issue till a week before the evaluation. æ Common LM was an unnecessary constraint. Sites should be allowed to build their own LMs using a common training data. We could try to model dialog instead of sentences. æ Noise recordings (used in preparing the data) could be made available. 16