Speech Recognition at ICSI: Broadcast News and beyond

Similar documents
Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

English Language and Applied Linguistics. Module Descriptions 2017/18

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

WHEN THERE IS A mismatch between the acoustic

Investigation on Mandarin Broadcast News Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Learning Methods in Multilingual Speech Recognition

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Human Emotion Recognition From Speech

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A study of speaker adaptation for DNN-based speech synthesis

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Lecture 9: Speech Recognition

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Natural Language Processing. George Konidaris

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speech Emotion Recognition Using Support Vector Machine

Mandarin Lexical Tone Recognition: The Gating Paradigm

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Rule Learning With Negation: Issues Regarding Effectiveness

Generative models and adversarial training

Switchboard Language Model Improvement with Conversational Data from Gigaword

Segregation of Unvoiced Speech from Nonspeech Interference

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A Case Study: News Classification Based on Term Frequency

Calibration of Confidence Measures in Speech Recognition

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

CEFR Overall Illustrative English Proficiency Scales

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Probabilistic Latent Semantic Analysis

Python Machine Learning

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Letter-based speech synthesis

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Word Segmentation of Off-line Handwritten Documents

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

On the Formation of Phoneme Categories in DNN Acoustic Models

Using dialogue context to improve parsing performance in dialogue systems

INPE São José dos Campos

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Voice conversion through vector quantization

Cross Language Information Retrieval

Lecture 1: Machine Learning Basics

Universal contrastive analysis as a learning principle in CAPT

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Diploma in Library and Information Science (Part-Time) - SH220

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Aviation English Solutions

Evolution of Symbolisation in Chimpanzees and Neural Nets

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Florida Reading Endorsement Alignment Matrix Competency 1

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Deep Neural Network Language Models

Speaker recognition using universal background model on YOHO database

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Edinburgh Research Explorer

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Arabic Orthography vs. Arabic OCR

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Rhythm-typology revisited.

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

SLINGERLAND: A Multisensory Structured Language Instructional Approach

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

SARDNET: A Self-Organizing Feature Map for Sequences

arxiv: v1 [cs.cl] 27 Apr 2016

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

SIE: Speech Enabled Interface for E-Learning

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Phonological and Phonetic Representations: The Case of Neutralization

Corpus Linguistics (L615)

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Support Vector Machines for Speaker and Language Recognition

On the Combined Behavior of Autonomous Resource Management Agents

Transcription:

Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI s BN system Future directions for speech recognition ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-1

1 DARPA Broadcast News DARPA standard speech tasks - Resource Management (1980s) - Wall Street Journal (early 1990s) - Broadcast News (1996 on) - Switchboard (1996 on) - Call Home (1997 on) Distinguishing features - vocabulary size, grammar perplexity - speaking style: read, spontaneous, familiar - acoustic conditions, variability - accent, dialect, language Annual evaluation bakeoffs - unseen common evaluation set - key result is overall Word Error Rate ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-2

Broadcast News details Training material recorded off-air - ABC, CNN, CSPAN, NPR - 50 hours for 1996, 1997 +50h, 1998 +100h - word transcriptions + speaker time boundaries - excluding commercials 74 h training set 7-way acoustic condition classification - F0: prepared studio speech (~40%) - F1: spontaneous studio speech (20%) - F2: telephone-bandwidth (20%) - F3: background music (5%) - F4: degraded acoustics (5%) - F5: foreign accents (5%) - Fx: combinations/other (5%) ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-3

Broadcast News history Best WER results: - 1996: HTK: 27% - 1997: HTK: 16% (but: easier; 22% on 1996 eval) - 1998: November Some clear conclusions - one classifier for all conditions (or male/female) - feature adaptation (VTLN, MLLR, SAT) - importance of segmentation - hard to improve grammar - more data is useful ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-4

Applications for BN systems Live transcription - subtitles - transcripts - but: more than words? Video editing - precision word-time alignments - commercial systems by IBM, Virage, etc. Information Retrieval (IR) - TREC/MUC spoken documents - tolerant of word error rate, e.g.: F0: THE VERY EARLY RETURNS OF THE NICARAGUAN PRESIDENTIAL ELECTION SEEMED TO FADE BEFORE THE LOCAL MAYOR ON A LOT OF LAW F4: AT THIS STAGE OF THE ACCOUNTING FOR SEVENTY SCOTCH ONE LEADER DANIEL ORTEGA IS IN SECOND PLACE THERE WERE TWENTY THREE PRESIDENTIAL CANDIDATES OF THE ELECTION F5: THE LABOR MIGHT DO WELL TO REMEMBER THE LOST A MAJOR EPISODE OF TRANSATLANTIC CONNECT TO A CORPORATION IN BOTH CONSERVATIVE PARTY OFFICIALS FROM BRITAIN GOING TO WASHINGTON THEY WENT TO WOOD BUYS GEORGE BUSH ON HOW TO WIN A SECOND TO NONE IN LONDON THIS IS STEPHEN BEARD FOR MARKETPLACE ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-5

Thematic Indexing of Spoken Language (Thisl) EC collaboration, BBC providing data > 500 hr archive data IR is key factor - stop lists - weighting schemes - query expansion Archive Query Database Segmentation Control IR NLP ASR Receiver ASR Text Audio http Video ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-6

Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI s BN system - the standard speech recognition architecture - front-end, classifier & HMM decoder issues - adaptation & segmentation - lessons: size matters Future directions for speech recognition ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-7

Standard speech recognition Speech as a sequence of discrete symbols q i Front end Sound Feature vectors Acoustic models Word models Grammar Phone classifier HMM decoder Label probabilities Phone & word labelling ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-8

Front-end issues Spectrogram reading paradigm - short-time spectral features - (perceptual) frequency-warping helps - normalization e.g. RASTA Goal = classifier accuracy - objective measure, but quite opaque - the right space for generalization - tension between detail & blurring Best solution depends on task - RASTA plus delta-features good for small vocab - plain normalized PLP best for BN - modulation spectrum features best for combo... Normalizing... -... in training -... unseen speech ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-9

Find p(q i X) Classifier issues - directly by (discriminant) neural-net estimation - by likelihood i.e. model p(x q i ) with Gaussians - more data permits finer detail in q i Combining classifiers helps: ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-10

HMM decoder issues Define all allowable output q i sequences - phone models - word pronunciations (lexicon) - word sequences (grammar) Search for best matching sequence - dominates processing time in large-vocab systems - variation of pronunciation with speaking rate - data-derived pronunciations - handling poor acoustics ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-11

Adaptation, segmentation & confidence Big gains from adaptation & normalization - e.g. VTLN, MLLR - typ. 10-20% relative WER improvement Requires marking of homogeneous segments - hand-labelled - rate of change metric for automatic boundaries - clustering models for segments Confidence metrics - typically elusive - help indicate errors - help to segment material - conserve decoding effort p(q i X) should correlate with confidence ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-12

Status of the ICSI BN project WER: - started out (April) ~ 50% - best single net ~ 33% - best combination ~26% Size matters - biggest gain from large classifiers & lots of data - e.g. 200k parameters, 4M patterns = 40% 800k parameters, 16M patterns = 33% - training time = 11days (special hardware) - (other approaches reach similar conclusion) Innovations - combinations - multiband? - segmental features? - time windows? ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-13

Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI s BN system Future directions for speech recognition - removing the grammar crutch - the signal model & what is thrown away - a research agenda ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-14

The crutch of grammar The downside of objective evaluation - research priority has been pragmatic goal of reducing WER - human speech recognition results from many constraints - grammatic/semantic constraints implicit in word sequence statistics (grammar) - automatic analysis of large corpora is possible & helpful The problems with a grammar - unexpected (unseen) phrases are discounted - highly brittle alternatives - masks underlying performance A more scientific approach - first work on the underlying phoneme classifier - follow nonsense syllable performance (Fletcher) ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-15

The signal model in speech recognition Systems & approach have been optimized for speech-alone situation - minimize classifier parameters, maximize use of feature space - e.g. cepstra [example] Possibly non-lexical data thrown away - pitch - timing/rhythm - speaker identification Dire consequences -.. dealing with nonspeech sounds -.. distinguishing success & failure Popular focus of research - e.g. segmental models, pitch features - fail to obtain robust improvements ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-16

The prediction-driven approach Originally for non-speech auditory scene analysis Analysis-by-synthesis model - representation is generative parameters - analysis is search & tracking of models input mixture Front end signal features Hypothesis management prediction errors Compare & reconcile hypotheses Noise components Periodic components predicted features Predict & combine ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-17

Prediction-driven analysis of speech/nonspeech mixtures Speech just another class of models... Account for all (speech) perceptual features - phoneme identity - speaker identity - speaking rate & style Informed by speech coding & synthesis Problem: efficiency of analysis - currently: direct evaluation of label likelihoods, search over discrete lexical space - proposed: implies search of continuous speechquality space ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-18

Conclusions Broadcast News: interesting task ICSI s BN system: useful framework - significant infrastructure investment - large, well-known, interesting, real problem - carries implicit research priorities Sore thumbs in current speech recognition & some research directions - separating the effects of different constraints (acoustic model & language model) - signal models that can incorporate nonspeech - track all perceptual attributes, don t just discard them ICSI, Speech, Broadcast News - Dan Ellis 1998sep21-19