Automatic Speech Recognition: Introduction

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Learning Methods in Multilingual Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Switchboard Language Model Improvement with Conversational Data from Gigaword

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Speech Emotion Recognition Using Support Vector Machine

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Investigation on Mandarin Broadcast News Speech Recognition

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

On the Formation of Phoneme Categories in DNN Acoustic Models

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

English Language and Applied Linguistics. Module Descriptions 2017/18

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Calibration of Confidence Measures in Speech Recognition

First Grade Curriculum Highlights: In alignment with the Common Core Standards

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

Generative models and adversarial training

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A study of speaker adaptation for DNN-based speech synthesis

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Natural Language Processing. George Konidaris

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Segregation of Unvoiced Speech from Nonspeech Interference

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Lecture 1: Machine Learning Basics

Miscommunication and error handling

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Human Emotion Recognition From Speech

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Large Kindergarten Centers Icons

Deep Neural Network Language Models

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Linking Task: Identifying authors and book titles in verbose queries

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Edinburgh Research Explorer

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Speaker Recognition. Speaker Diarization and Identification

Arabic Orthography vs. Arabic OCR

SARDNET: A Self-Organizing Feature Map for Sequences

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Mandarin Lexical Tone Recognition: The Gating Paradigm

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

WHEN THERE IS A mismatch between the acoustic

Letter-based speech synthesis

Large vocabulary off-line handwriting recognition: A survey

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

Florida Reading Endorsement Alignment Matrix Competency 1

arxiv: v1 [cs.cl] 2 Apr 2017

Characterizing and Processing Robot-Directed Speech

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Stages of Literacy Ros Lugg

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

Characteristics of the Text Genre Informational Text Text Structure

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

South Carolina English Language Arts

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Rule Learning With Negation: Issues Regarding Effectiveness

Proceedings of Meetings on Acoustics

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Using dialogue context to improve parsing performance in dialogue systems

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Lecture 9: Speech Recognition

Fisk Street Primary School

Transcription:

Automatic Speech Recognition: Introduction Steve Renals & Hiroshi Shimodaira Automatic Speech Recognition ASR Lecture 1 14 January 2019 ASR Lecture 1 Automatic Speech Recognition: Introduction 1

Automatic Speech Recognition ASR Course details Lectures: About 18 lectures Labs: Weekly lab sessions using Kaldi (kaldi-asr.org) Lab sessions in AT-4.12: Tuesdays 10:00, Wednesdays 10:00, Wednesdays 15:10, start week 2 (22/23 January) Select one lab session on Learn Assessment: Exam in April or May (worth 70%) Coursework (worth 30%, building on the lab sessions) (out on Thurday 14 February; in by Wednesday 20 March) People: Lecturers: Steve Renals and Hiroshi Shimodaira TAs: Joachim Fainberg and Ondrej Klejch http://www.inf.ed.ac.uk/teaching/courses/asr/ ASR Lecture 1 Automatic Speech Recognition: Introduction 2

Your background If you have taken: Speech Processing and either of (MLPR or MLP) Perfect! either of (MLPR or MLP) but not Speech Processing (probably you are from Informatics) You ll require some speech background: A couple of the lectures will cover material that was in Speech Processing Some additional background study (including material from Speech Processing) Speech Processing but neither of (MLPR or MLP) (probably you are from SLP) You ll require some machine learning background (especially neural networks) A couple of introductory lectures on neural networks provided for SLP students Some additional background study ASR Lecture 1 Automatic Speech Recognition: Introduction 3

Labs Series of weekly labs using Kaldi. Sign up for one lab session on Learn Labs start week 2 (next week) Note: Training speech recognisers can take time ASR training in some labs will not finish in an hour... Give yourself plenty of time to complete the coursework, don t leave it until the last couple of days ASR Lecture 1 Automatic Speech Recognition: Introduction 4

What is speech recognition? Speech-to-text transcription Transform recorded audio into a sequence of words Just the words, no meaning... But do need to deal with acoustic ambiguity: Recognise speech? or Wreck a nice beach? Speaker diarization: Who spoke when? Speech recognition: what did they say? Paralinguistic aspects: how did they say it? (timing, intonation, voice quality) Speech understanding: what does it mean? ASR Lecture 1 Automatic Speech Recognition: Introduction 5

Why is speech recognition difficult? ASR Lecture 1 Automatic Speech Recognition: Introduction 6

Variability in speech recognition Several sources of variation Size Number of word types in vocabulary, perplexity ASR Lecture 1 Automatic Speech Recognition: Introduction 7

Variability in speech recognition Several sources of variation Size Number of word types in vocabulary, perplexity Speaker Tuned for a particular speaker, or speaker-independent? Adaptation to speaker characteristics ASR Lecture 1 Automatic Speech Recognition: Introduction 7

Variability in speech recognition Several sources of variation Size Number of word types in vocabulary, perplexity Speaker Tuned for a particular speaker, or speaker-independent? Adaptation to speaker characteristics Acoustic environment Noise, competing speakers, channel conditions (microphone, phone line, room acoustics) ASR Lecture 1 Automatic Speech Recognition: Introduction 7

Variability in speech recognition Several sources of variation Size Number of word types in vocabulary, perplexity Speaker Tuned for a particular speaker, or speaker-independent? Adaptation to speaker characteristics Acoustic environment Noise, competing speakers, channel conditions (microphone, phone line, room acoustics) Style Continuously spoken or isolated? Planned monologue or spontaneous conversation? ASR Lecture 1 Automatic Speech Recognition: Introduction 7

Variability in speech recognition Several sources of variation Size Number of word types in vocabulary, perplexity Speaker Tuned for a particular speaker, or speaker-independent? Adaptation to speaker characteristics Acoustic environment Noise, competing speakers, channel conditions (microphone, phone line, room acoustics) Style Continuously spoken or isolated? Planned monologue or spontaneous conversation? Accent/dialect Recognise the speech of all speakers who speak a particular language ASR Lecture 1 Automatic Speech Recognition: Introduction 7

Variability in speech recognition Several sources of variation Size Number of word types in vocabulary, perplexity Speaker Tuned for a particular speaker, or speaker-independent? Adaptation to speaker characteristics Acoustic environment Noise, competing speakers, channel conditions (microphone, phone line, room acoustics) Style Continuously spoken or isolated? Planned monologue or spontaneous conversation? Accent/dialect Recognise the speech of all speakers who speak a particular language Language spoken There are many languages beyond English, Mandarin Chinese, Spanish,... What is the difference between a dialect and a language? ASR Lecture 1 Automatic Speech Recognition: Introduction 7

Hierarchical modelling of speech "No right" Utterance W NO RIGHT Word n oh r ai t Subword HMM Acoustics X ASR Lecture 1 Automatic Speech Recognition: Introduction 8

Hierarchical modelling of speech Generative Model "No right" Utterance W NO RIGHT Word n oh r ai t Subword HMM Acoustics X ASR Lecture 1 Automatic Speech Recognition: Introduction 8

Fundamental Equation of Statistical Speech Recognition If X is the sequence of acoustic feature vectors (observations) and W denotes a word sequence, the most likely word sequence W is given by W = arg max P(W X) W ASR Lecture 1 Automatic Speech Recognition: Introduction 9

Fundamental Equation of Statistical Speech Recognition If X is the sequence of acoustic feature vectors (observations) and W denotes a word sequence, the most likely word sequence W is given by Applying Bayes Theorem: W = arg max P(W X) W P(W X) = p(x W)P(W) p(x) p(x W)P(W) W = arg max W p(x W) }{{} Acoustic model P(W) }{{} Language model ASR Lecture 1 Automatic Speech Recognition: Introduction 9

Speech Recognition Components W = arg max p(x W)P(W) W Use an acoustic model, language model, and lexicon to obtain the most probable word sequence W given the observed acoustics X Recorded Speech X Decoded Text W* (Transcription) Signal Analysis Training Data p(x W) Acoustic Model Lexicon P(W) Language Model Search Space W ASR Lecture 1 Automatic Speech Recognition: Introduction 10

Alternative approach: End-to-end systems Directly model transforming an input acoustic sequence into an output word or character sequence Recorded Speech X Decoded Text W* (Transcription) Signal Analysis Training Data p(x W) Acoustic Model Lexicon P(W) Language Model Search Space W ASR Lecture 1 Automatic Speech Recognition: Introduction 11

Alternative approach: End-to-end systems Directly model transforming an input acoustic sequence into an output word or character sequence Recorded Speech X Decoded Text W* (Transcription) Signal Analysis Training Data p(x W) Acoustic Model Direct mapping: Search Lexicon accoustics to transcription Space P(W) Language Model W ASR Lecture 1 Automatic Speech Recognition: Introduction 11

Alternative approach: End-to-end systems Directly model transforming an input acoustic sequence into an output word or character sequence "No right" Utterance W NO RIGHT Word n oh r ai t Subword HMM Acoustics X ASR Lecture 1 Automatic Speech Recognition: Introduction 12

Alternative approach: End-to-end systems Directly model transforming an input acoustic sequence into an output word or character sequence "No right" Utterance W NO RIGHT Word n oh r ai t Direct mapping: acoustics - transcription Subword HMM Acoustics X ASR Lecture 1 Automatic Speech Recognition: Introduction 12

Alternative approach: End-to-end systems Directly model transforming an input acoustic sequence into an output word or character sequence N o _ R i g h t Utterance W NO RIGHT Word Direct mapping: n oh acoustics r ai - transcription t Acoustic sequence mapped to character sequence Subword HMM Acoustics X ASR Lecture 1 Automatic Speech Recognition: Introduction 12

Alternative approach: End-to-end systems Directly model transforming an input acoustic sequence into an output word or character sequence No Right Utterance W NO RIGHT Word Direct mapping: n oh acoustics r ai - transcription t Acoustic sequence mapped to word sequence Subword HMM Acoustics X ASR Lecture 1 Automatic Speech Recognition: Introduction 12

Representing recorded speech (X) Represent a recorded utterance as a sequence of feature vectors Reading: Jurafsky & Martin section 9.3 ASR Lecture 1 Automatic Speech Recognition: Introduction 13

Labelling speech (W) Labels may be at different levels: words, phones, etc. Labels may be time-aligned i.e. the start and end times of an acoustic segment corresponding to a label are known Reading: Jurafsky & Martin chapter 7 (especially sections 7.4, 7.5) ASR Lecture 1 Automatic Speech Recognition: Introduction 14

Phones and Phonemes Phonemes abstract unit defined by linguists based on contrastive role in word meanings (eg cat vs bat ) 40 50 phonemes in English Phones speech sounds defined by the acoustics many allophones of the same phoneme (eg /p/ in pit and spit ) limitless in number Phones are usually used in speech recognition but no conclusive evidence that they are the basic units in speech recognition Possible alternatives: syllables, automatically derived units,... (Slide taken from Martin Cooke from long ago) ASR Lecture 1 Automatic Speech Recognition: Introduction 15

Example: TIMIT Corpus TIMIT corpus (1986) first widely used corpus, still in use Utterances from 630 North American speakers Phonetically transcribed, time-aligned Standard training and test sets, agreed evaluation metric (phone error rate) TIMIT phone recognition - label the audio of a recorded utterance using a sequence of phone symbols Frame classification attach a phone label to each frame data Phone classification given a segmentation of the audio, attach a phone label to each (multi-frame) segment Phone recognition supply the sequence of labels corresponding to the recorded utterance ASR Lecture 1 Automatic Speech Recognition: Introduction 16

Basic speech recognition on TIMIT Train a classifier of some sort to associate each feature vector with its corresponding label. Classifier could be Neural network Gaussian mixture model... Then at run time, a label is assigned to each frame Questions What s good about this approach? What the limitations? How might we address them? ASR Lecture 1 Automatic Speech Recognition: Introduction 17

Evaluation How accurate is a speech recognizer? String edit distance Use dynamic programming to align the ASR output with a reference transcription Three type of error: insertion, deletion, substitutions Word error rate (WER) sums the three types of error. If there are N words in the reference transcript, and the ASR output has S substitutions, D deletions and I insertions, then: WER = 100 S + D + I N % Accuracy = 100 WER% For TIMIT, define phone error error rate analagously to word error rate Speech recognition evaluations: common training and development data, release of new test sets on which different systems may be evaluated using word error rate ASR Lecture 1 Automatic Speech Recognition: Introduction 18

Next Lecture Recorded Speech Decoded Text (Transcription) Signal Analysis Acoustic Model Training Data Lexicon Language Model Search Space ASR Lecture 1 Automatic Speech Recognition: Introduction 19

Reading Jurafsky and Martin (2008). Speech and Language Processing (2nd ed.): Chapter 7 (esp 7.4, 7.5) and Section 9.3. General interest: The Economist Technology Quarterly, Language: Finding a Voice, Jan 2017. http://www.economist.com/technology-quarterly/2017-05- 01/language The State of Automatic Speech Recognition: Q&A with Kaldi s Dan Povey, Jul 2018. https://medium.com/descript/the-state-of-automaticspeech-recognition-q-a-with-kaldis-dan-poveyc860aada9b85 ASR Lecture 1 Automatic Speech Recognition: Introduction 20