APPLICATIONS 5: SPEECH RECOGNITION. Theme. Summary of contents 1. Speech Recognition Systems

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Speech Emotion Recognition Using Support Vector Machine

Learning Methods in Multilingual Speech Recognition

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Modeling function word errors in DNN-HMM based LVCSR systems

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Modeling function word errors in DNN-HMM based LVCSR systems

Speaker recognition using universal background model on YOHO database

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Human Emotion Recognition From Speech

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Natural Language Processing. George Konidaris

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Investigation on Mandarin Broadcast News Speech Recognition

A study of speaker adaptation for DNN-based speech synthesis

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Mandarin Lexical Tone Recognition: The Gating Paradigm

Automatic Pronunciation Checker

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Speaker Identification by Comparison of Smart Methods. Abstract

Lecture 1: Machine Learning Basics

Switchboard Language Model Improvement with Conversational Data from Gigaword

Speaker Recognition. Speaker Diarization and Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Body-Conducted Speech Recognition and its Application to Speech Support System

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

An Online Handwriting Recognition System For Turkish

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Multi-Lingual Text Leveling

Consonants: articulation and transcription

Lecture 9: Speech Recognition

Cross Language Information Retrieval

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Rhythm-typology revisited.

Phonological Processing for Urdu Text to Speech System

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Calibration of Confidence Measures in Speech Recognition

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Getting the Story Right: Making Computer-Generated Stories More Entertaining

Speech Recognition by Indexing and Sequencing

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Segregation of Unvoiced Speech from Nonspeech Interference

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Applications of memory-based natural language processing

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

CS 598 Natural Language Processing

Word Segmentation of Off-line Handwritten Documents

Proceedings of Meetings on Acoustics

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

WHEN THERE IS A mismatch between the acoustic

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

A Grammar for Battle Management Language

Edinburgh Research Explorer

Detecting English-French Cognates Using Orthographic Edit Distance

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

IEEE Proof Print Version

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Disambiguation of Thai Personal Name from Online News Articles

Automatic intonation assessment for computer aided language learning

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Software Maintenance

Statewide Framework Document for:

Individual Differences & Item Effects: How to test them, & how to test them well

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Using dialogue context to improve parsing performance in dialogue systems

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

English Language and Applied Linguistics. Module Descriptions 2017/18

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

A Case Study: News Classification Based on Term Frequency

Phonological and Phonetic Representations: The Case of Neutralization

Transcription:

APPLICATIONS 5: SPEECH RECOGNITION Theme Speech is produced by the passage of air through various obstructions and routings of the human larynx, throat, mouth, tongue, lips, nose etc. It is emitted as a series of pressure waves. To automatically convert these pressure waves into written words, a series of operations is performed. These involve capturing and representing the pressure waves in appropriate notations, creating feature vectors to represent time-slices of the converted input, clustering and purifying the input, matching the results against a library of known sound vectorized waveforms, choosing the most likely series of letter-sounds, and then selecting the most likely sequence of words. Summary of contents 1. Speech Recognition Systems Introduction Early speech recognition systems tried to model the human articulatory channel. They didn t work. Since the 1970s, these systems have been trained on example data rather than defined using rules. The transition was caused by the success of the HEARSAY and HARPY systems at CMU. Step 1: Speech Speech is pressure waves, travelling through the air. Created by vibrations of larynx, followed by openings or blockages en route to the outside. Vowels and consonants. Intensity levels Fourier transform Input: speech wave Input sentence 1

Step 2: Internal representation 1. The basic pressure wave is full of noise and very context-sensitive, so it is very difficult to work with. First, perform Fourier transform, to represent the wave as a sum of waves at a range of frequencies, within a certain window. This is shown in the speech spectrogram on the previous page. Now work in the Frequency domain (on the y axis), not the waveform domain. Try various windows to minimize edge effects, etc. 2. Decompose (deconvolve) the Fourier-transformed waves into a set of vectors, by cepstral analysis. Chop up the timeline (x-axis) and the frequency space (y-axis) to obtain little squares, from which you obtain the quantized vectors. Now certain operations become simpler (like working in log space; can add instead of multiply), though some new steps become necessary. Move a window over the Fourier transform and measure the strengths of the voice natural frequencies f 0, f 1, f 2 Step 3: Purify and match: Acoustic Model 1. After quantizing the frequency vectors, represent them in an abstract vector space (axes: time and MFCC cepstral coefficients). The resulting vectors, one per timeslice, provide a picture of the incoming sound wave. Depending on window size, speech inconsistencies, noise, etc., these vectors are not pure reflections of the speaker s sounds. So purify the vector series by clustering them, using various algorithms, to find the major sound bundles. Here it s possible to merge vectors across time as well, to obtain durations. 2. Then match the bundles against a library of standard sound bundle shapes, represented as durations vs. cepstral coeffecients. To save space, these standard sound bundles are represented as mixtures of Gaussian curves (then only need save two or three parameters per curve) these are the contour lines in the picture below, for one MFCC coefficient. Try to fit them over the clustered points (like umbrellas). To find the best match (which vector corresponds with which portion of the curve?), use the EM algorithm. Step 4: Identify sound sequences and words: Lexical Model Now you have a series of sounds, and you want a series of letters. But unfortunately, sounds and letters do not line up one-to-one. So first represent typical sound sequences in a Hidden Markov Model (like a Finite State Network). For each sound, create all possible links to all other sounds, and arrange these sounds into the HMM. Initialize everything with equal transition probabilities. Then train the transition probabilities on the links using training data, for which you know both the sounds and the correct letters. 2

Given a new input (= sound sequence), use the Viterbi algorithm to match the incoming series of sounds to the best path through the HMM, taking into account likely sound shifts, etc., as given by the probabilistic sound transitions on the HMM arcs. /b/0.7 /p/0.3 /ε/0.5 /a/0.3 / /0.2/ /b/0.3 /a/0.5 1 2 3 /t/0.6 /ε/0.1 /t/0.1 /ε/0.4 A typical large-vocabulary system takes into account context dependency for the phonemes (so phonemes with different left and right context have different realizations as HMM states); to do this it uses cepstral normalization to normalize for different speaker and recording conditions. One can do additional speaker normalization using vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation. Step 5: Sentences: Language Model Finally, you have a series of words. But do they form a sentence? At this point you could use a parser and a grammar to see. Trouble is, the system s top word selections often include errors; perhaps the second- or the third-best words are the correct ones. So create an n-gram language model (in practice, a trigram), for the domain you are working in. This language model provides a probabilistic model of the word sequences you are likely to encounter. Now match the incoming word sequence (and all its most likely alternatives) against the n-gram language model, using the Viterbi algorithm, and find what is indeed the most likely sensible sentence. Output the resulting sequence of words. Overall System Architecture The typical components/architecture of an ASR system: 2. Evaluation Measures Principal measure is Word Error rate (WER): measure how many words were recognized correctly in known test sample. WER = (S + I + D) * 100 / N 3

where N is the total number of words in the test set, and S, I, and D are the total number of substitutions, insertions, and deletions needed to convert the system s output string into the test string. WER tends to drop by a factor of 2 every 2 years; in nicely controlled setting in the lab, with limited vocabularies, systems do quite well (WER in the low single digits). But in real life, which is noisy and unpredictable, where people use made-up words and odd word mixtures, it s a different story. In dialogue systems, people use Command Success Rate (CSR), in which the dialogue engine and task help guide speech recognition; now measure the success for each individual command, and for each task as a whole. Performance Corpus Speech Type Lexicon Size Word Error Rate (%) Human Error Rate (%) connected digits spontaneous 10 0.3 0.009 resource mangmnt read 1000 3.6 0.1 air travel agent spontaneous 2000 2 Wall Street Journal read 64000 7 1 radio news mixed 64000 27 tel. switchboard conversation 10000 38 4 telephone call home conversation 10000 50 pre-1975 1975 1985 1985 1997 present Unit recognized sub-word; single word sub-word sub-word sub-word Unit of analysis single word fixed phrases bigrams, trigrams Approaches to modeling Knowledge representation Knowledge acquisition heuristic, ad hoc rule-based, declarative template matching data-driven; deterministic mathematical data-driven; probabilistic dialogue turns mathematical data-driven; probabilistic; simple task analysis heterogeneous homogeneous homogeneous unclear intense manual effort embedded in simple structures automatic learning learning + manual effort for dialogues 3. Speech translation The goal: a translating telephone. Research projects at CMU, Karlsruhe, ISI, ARL, etc. The Verbmobil project in Germany translated between German and French using English as interlingua. A large multi-site consortium, it kept NLP funded in Germany for almost a decade, starting mid-1990s. One commercial product (PC based): NEC, for $300, in 2002. They now sell a PDA based version. 50,000 words, bigrams, some parsing and a little semantic transfer. 4

The US Army Phrasalator: English-Arabic-English in a very robust box (able to withstand desert fighting conditions): speech recognition and phrasal (table lookup) translation and output. 4. Prosody An increasingly interesting topic today is the recognition of emotion and other pragmatic signals in addition to the words. Human-human speech is foundationally mediated by prosody prosody (rhythm, intonation, etc., of speech). Speech is only natural when it is not flat : we infer a great deal about speaker s inner state and goals from prosody. Prosody is characterized by two attributes: Prominence: Intonation, rhythm, and lexical stress patterns, which signal emphasis, intent, emotion Phrasing: Chunking of utterances into prosodic phrases, which assists with correct interpretation The sentence he leaves tomorrow, said four ways (statement, question, command, sarcastically): To handle prosody, you need to develop: Suitable representation of prosody Algorithms to automatically detect prosody Methods to integrate these detectors in speech applications To represent prosody, you extract features from the pitch contours of last 200 msec of utterances and then convert the parameters into a discretized (categorical) notation. Shri Narayanan and students in the EE Department at USC, and others elsewhere, are detecting three 5

features: pitch ( height of voice), intensity (loudness), and breaks (inter-word spaces). They use the ToBI (TOnes and Break Indices) representation. Procedure: To find the best sequence of prosody labels L They assign a prosodic label to each word conditioned on contextual features They train continuous density Hidden Markov Models (HMMs) to represent pitch accent and boundary tone 3 states They use the following kinds of features: Lexical features: Orthographic word identity Syntactic features: POS tags, Supertags (Similar to shallow syntactic parse) Acoustic features: f 0 and energy extracted over 10msec frames 5. Current status Applications: 1. General-purpose dictation: Several commercial systems for $100: DragonDictate (used to be Dragon Systems; by Jim Baker); now at Nuance (www.nuance.com) IBM ViaVoice (from Jim Baker at IBM) Whatever was left when Lernout and Hauspie (was Kurzweil) went bankrupt Kai-Fu Lee takes SPHINX from CMU to Apple (PlainTalk) and then to Microsoft Beijing Windows Speech Recognition in Windows Vista 2. Military: Handheld devices for speech-to-speech translation in Iraq and elsewhere. Also used in fighter planes where the pilot s hands are too busy to type. 3. Healthcare: ASR for doctors in order to create patient records automatically. 4. Autos: Speech devices take driver input and display routes, maps, etc. 5. Help for disabled (esp to access the web and control the computer). Some research projects: DARPA: ATIS Travel Agent (early 1990s); GALE program (mid-2000s) MIT: GALAXY global weather, restaurants, etc. Dutch Railways: train information by phone DARPA: COMMUNICATOR travel dialogues; BABYLON handheld translation devices Current topics: Interfaces Speech systems in human-computer interfaces. 6

Problems for ASR Voices differ (men, women, children) Accents Speaking speed (overall, specific cadences) Pitch variation (high, low) Word and sentence boundaries Background noise BIG PROBLEM Genuine ambiguity: recognize speech vs. wreck a nice beach 5. Speech Synthesis Traditional model Lexicon of sounds for letters Problems: flat Enhance: sentence prosody contour. Also need Speech Act and focus/stress as input Concatenative synthesis Record speaker many times; create lexicon of sounds for letters, in word start/middle/end, sentence start/middle/end, stress/unstress, etc. forms At run time, choose most fitting variant, depending on neighboring options (intensity/loudness, speed, etc.) Problems: clean sound units for letters, matching disfluencies Optional Readings Victor Zue s course at MIT. Jelinek. F. 1998. Statistical Methods for Speech Recognition. Rabiner, L. 1993. Fundamentals of Speech Recognition. Schroeder, M.R. 2004. Computer Speech (2 nd ed). Karat, C.-M., J. Vergo, and D. Nahamoo. 2007. Conversational Interface Technologies. In A. Sears and J.A. Jacko (eds) The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies, and Emerging Applications (Human Factors and Ergonomics). Lawrence Erlbaum Associates Inc. 7