Artificial Intelligence 2004

Similar documents
AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Modeling function word errors in DNN-HMM based LVCSR systems

Human Emotion Recognition From Speech

Speaker recognition using universal background model on YOHO database

Speaker Recognition. Speaker Diarization and Identification

Speech Emotion Recognition Using Support Vector Machine

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

Lecture 9: Speech Recognition

Speaker Identification by Comparison of Smart Methods. Abstract

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Learning Methods in Multilingual Speech Recognition

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

English Language and Applied Linguistics. Module Descriptions 2017/18

Body-Conducted Speech Recognition and its Application to Speech Support System

Automatic Pronunciation Checker

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

On the Formation of Phoneme Categories in DNN Acoustic Models

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A study of speaker adaptation for DNN-based speech synthesis

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

CS 598 Natural Language Processing

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

WHEN THERE IS A mismatch between the acoustic

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Segregation of Unvoiced Speech from Nonspeech Interference

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Mandarin Lexical Tone Recognition: The Gating Paradigm

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Improvements to the Pruning Behavior of DNN Acoustic Models

Florida Reading Endorsement Alignment Matrix Competency 1

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Probabilistic Latent Semantic Analysis

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Evolutive Neural Net Fuzzy Filtering: Basic Description

Voice conversion through vector quantization

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Natural Language Processing. George Konidaris

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Applications of memory-based natural language processing

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Automatic segmentation of continuous speech using minimum phase group delay functions

Speech Recognition by Indexing and Sequencing

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Calibration of Confidence Measures in Speech Recognition

Discriminative Learning of Beam-Search Heuristics for Planning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Consonants: articulation and transcription

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

THE RECOGNITION OF SPEECH BY MACHINE

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Rhythm-typology revisited.

SARDNET: A Self-Organizing Feature Map for Sequences

Linking Task: Identifying authors and book titles in verbose queries

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Journal of Phonetics

Phonetics. The Sound of Language

THE enormous growth of unstructured data, including

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Circuit Simulators: A Revolutionary E-Learning Platform

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Assignment 1: Predicting Amazon Review Ratings

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Arabic Orthography vs. Arabic OCR

Transcription:

74.419 Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech Recognition acoustic signal as input conversion into written words Spoken Language Understanding analysis of spoken language (transcribed speech)

Speech & Natural Language Processing Areas in Natural Language Processing Morphology Grammar & Parsing (syntactic analysis) Semantics Pragamatics Discourse / Dialogue Spoken Language Understanding Areas in Speech Recognition Signal Processing Phonetics Word Recognition

Speech Production & Reception Sound and Hearing change in air pressure sound wave reception through inner ear membrane / microphone break-up into frequency components: receptors in cochlea / mathematical frequency analysis (e.g. Fast-Fourier Transform FFT) Frequency Spectrum perception/recognition of phonemes and subsequently words (e.g. Neural Networks, Hidden-Markov Models)

Speech Recognition Phases Speech Recognition acoustic signal as input signal analysis - spectrogram feature extraction phoneme recognition word recognition conversion into written words

Speech Signal Speech Signal composed of different (sinus) waves with different frequencies and amplitudes Frequency - waves/second like pitch Amplitude - height of wave like loudness + noise (not sinus wave) Speech Signal composite signal comprising different frequency components

Waveform (fig. 7.20) Amplitude/ Pressure Time "She just had a baby."

Waveform for Vowel ae (fig. 7.21) Amplitude/ Pressure Time Time

Speech Signal Analysis Analog-Digital Conversion of Acoustic Signal Sampling in Time Frames ( windows ) frequency = 0-crossings per time frame e.g. 2 crossings/second is 1 Hz (1 wave) e.g. 10kHz needs sampling rate 20kHz measure amplitudes of signal in time frame digitized wave form separate different frequency components FFT (Fast Fourier Transform) spectrogram other frequency based representations LPC (linear predictive coding), Cepstrum

Waveform and Spectrogram (figs. 7.20, 7.23)

Waveform and LPC Spectrum for Vowel ae (figs. 7.21, 7.22) Amplitude/ Pressure Energy Time Formants Frequency

Speech Signal Characteristics From Signal Representation derive, e.g. formants - dark stripes in spectrum strong frequency components; characterize particular vowels; gender of speaker pitch fundamental frequency baseline for higher frequency harmonics like formants; gender characteristic change in frequency distribution characteristic for e.g. plosives (form of articulation)

Video of glottis and speech signal in lingwaves (from http://www.lingcom.de)

Phoneme Recognition Recognition Process based on features extracted from spectral analysis phonological rules statistical properties of language/ pronunciation Recognition Methods Hidden Markov Models Neural Networks Pattern Classification in general

Pronunciation Networks / Word Models as Probabilistic FAs (fig 5.12)

Pronunciation Network for 'about' (fig 5.13)

Word Recognition with Probabilistic FA / Markov Chain (fig 5.14)

Viterbi-Algorithm - Overview (cf. Jurafsky Ch.5) The Viterbi Algorithm finds an optimal sequence of states in continuous Speech Recognition, given an observation sequence of phones and a probabilistic (weighted) FA (state graph). The algorithm returns the path through the automaton which has maximum probability and accepts the observation sequence. a[s,s'] is the transition probability (in the phonetic word model) from current state s to next state s', and b[s',o t ] is the observation likelihood of s' given o t. b[s',o t ] is 1 if the observation symbol matches the state, and 0 otherwise.

Viterbi-Algorithm (fig 5.19) function VITERBI(observations of len T, state-graph) returns best-path num-states NUM-OF-STATES(state-graph) Create a path probability matrix viterbi[num-states+2,t+2] viterbi[0,0] 1.0 for each time step t from 0 to T do for each state s from 0 to num-states do for each transition s' from s in state-graph new-score viterbi[s,t] * a[s,s'] * b[s',(o t )] if ((viterbi[s',t+1] = 0) (new-score > viterbi[s',t+1])) then viterbi[s',t+1] new-score back-pointer[s',t+1] s Backtrace from highest probability state in the final column of viterbi[] and return path word model observation (speech recognizer)

Viterbi-Algorithm Explanation (cf. Jurafsky Ch.5) The Viterbi Algorithm sets up a probability matrix, with one column for each time index t and one row for each state in the state graph.each column has a cell for each state q i in the single combined automaton for the competing words (in the recognition process). The algorithm first creates N+2 state columns. The first column is an initial pseudo-observation, the second corresponds to the first observation-phone, the third to the second observation and so on. The final column represents again a pseudo-observation. In the first column, the probability of the Start-state is initially set to 1.0; the other probabilities are 0. Then we move to the next state. For every state in column 0, we compute the probability of moving into each state in column 1. The value viterbi[t, j] is computed by taking the maximum over the extensions of all the paths that lead to the current cell. An extension of a path at state i at time t-1 is computed by multiplying the three factors: the previous path probability from the previous cell forward[t-1,i] the transition probability a i,j from previous state i to current state j the observation likelihood b jt that current state j matches observation symbol t. b jt is 1 if the observation symbol matches the state; 0 otherwise.

Speech Recognition Acoustic / sound wave Frequency Spectrum Features (Phonemes; Context) Phonemes Phoneme Sequences / Words Word Sequence / Sentence Filtering, Sampling Spectral Analysis; FFT Signal Processing / Analysis Phoneme Recognition: HMM, Neural Networks Grammar or Statistics Grammar or Statistics for likely word sequences

Speech Recognizer Architecture (fig. 7.2)

Speech Processing - Important Types and Characteristics single word vs. continuous speech unlimited vs. large vs. small vocabulary speaker-dependent vs. speaker-independent training Speech Recognition vs. Speaker Identification

Additional References Hong, X. & A. Acero & H. Hon: Spoken Language Processing. A Guide to Theory, Algorithms, and System Development. Prentice-Hall, NJ, 2001 Figures taken from: Jurafsky, D. & J. H. Martin, Speech and Language Processing, Prentice-Hall, 2000, Chapters 5 and 7. lingwaves (from http://www.lingcom.de