Introduction to Speech Technology

13/Nov/2008 Introduction to Speech Technology Presented by Andriy Temko Department of Electrical and Electronic Engineering

Page 2 of 30 Outline Introduction & Applications Analysis of Speech Speech Recognition Problem

Page 3 of 30 Speech Signal Speech signal converted to a electrical waveform by a microphone Possibility to be converted to electric waveform and then back to acoustic waveform is the basis for Bell s telephone invention

Page 4 of 30 Speech Chain

Page 5 of 30 Applications: Speech Coding Speech coding block diagram encoder and decoder.

Page 6 of 30 Applications: Text-to-Speech Synthesis Simulation of the entire upper part of Speech Chain Set of linguistic rules determine the appropriate set of sounds Not just simple looking up the words in a pronouncing dictionary: abbreviation, ambiguous words, acronyms, proper names, special terms, intonation, etc Most popular method: Unit Selection & Concatenation

Page 7 of 30 Applications: Speech Recognition Feature Analysis convert a digital speech signal to a set of feature vectors Pattern Matching finds the closest match of the dynamically time-aligned set of feature vectors with a set of stored patterns Speech Recognition extracting a message from a signal Command and control of computer software Voice dictation Dialog with machines help desks and call centers

Page 8 of 30 Applications: Others Speaker Recognition who is speaking Speaker Verification verify the claimed identity Speaker Diarization who spoke when Word Spotting monitoring the signal for a special word Speech/Audio Indexing identifying audio class (Broadcast news transcription) Audio Recognition identifying acoustic events (Audio-based surveillance/smart-rooms) Speech Enhancement make speech more intelligible

Page 9 of 30 Interesting Facts: Perception of Loudness Greatest sensitivity at around 3 to 4 khz. Almost precisely the range of frequencies occupied by most of the sounds of speech! Non-uniform filter-bank analysis

Page 10 of 30 Interesting Facts: Auditory Masking Critical bands phenomena Widely used in speech coding (perceptual lossless coding)

Page 11 of 30 Outline Introduction & Applications Analysis of Speech Speech Recognition Problem

Page 12 of 30 Short-Time Analysis of Speech. Windowing Windowing small portions assumed to be pseudostationary Windowing yields a set of speech samples x(n) weighted by the shape of the window w(n) Generally, successive windows will overlap as w(n) tends to have a shape that will deemphasise samples near it s edges. This breaks the speech down into a sequence of frames.

Page 13 of 30 Short-Time Analysis of Speech. FFT Wide band Narrow band

Page 14 of 30 Short-Time Analysis of Speech. Spectral Envelope Wide band Narrow band You cannot get good time resolution and good frequency resolution from the same spectrogram Uncertainty Principle

Page 15 of 30 Phoneme Speakers and listeners divide words into component sounds called phonemes. Native speakers agree on the phonemes that make up a particular word There are about 42 phonemes in English The actual sound that corresponds to a particular phoneme depends on: The adjacent phonemes in the word or sentence The accent of the speaker The talking speed Whether it is a formal or informal occasion

Page 16 of 30 Voiced / Unvoiced Phoneme Vowels/Consonants discrimination with Zero Crossing Rate and Short Time Energy Determination of Pitch (Fundamental Frequency) with autocorrelation

Page 17 of 30 Outline Introduction & Applications Analysis of Speech Speech Recognition Problem

Page 18 of 30 Speech Recognition Phonologic rules Phonetic models Dictionary and grammar Task model utterance Acoustic front-end Recognition algorithm Understanding algorithm meaning

Page 19 of 30 Speech Recognition Hz Phonologic rules Phonetic models Dictionary and grammar Task model utterance Acoustic front-end Recognition algorithm Understanding algorithm meaning

Page 20 of 30 Speech Recognition Phonologic rules Phonetic models Dictionary and grammar Task model utterance Acoustic front-end Recognition algorithm Understanding algorithm meaning

Page 21 of 30 Speech Recognition Markov Model Phonologic rules Phonetic models Phoneme k-1 Phoneme k Phoneme k+1 Dictionary and grammar Task model utterance Acoustic front-end Recognition algorithm Understanding algorithm meaning

Page 22 of 30 Speech Recognition Phonologic rules Phonetic models Dictionary and grammar Task model utterance Acoustic front-end Recognition algorithm Understanding algorithm meaning

Page 23 of 30 Speech Recognition Phonologic rules Phonetic models Dictionary and grammar Trigram Pr{ the door was not opened} = Pr{ the} Pr{ door/the} Pr{ was/the door} Pr{ not / the door was} Pr{ opened / the door was not} = Pr{ the} Pr{ door/the} Pr{ was/the door }Pr{ not /door was} Pr{ opened / was not} Task model utterance Acoustic front-end Recognition algorithm Understanding algorithm meaning

Page 24 of 30 Speech Recognition DATABASE voice text TRAINING Acoustic front-end Phonetic modeling Language modeling Phonologic rules Phonetic models Dictionary and grammar Task model utterance Acoustic front-end Recognition algorithm Understanding algorithm meaning

Page 25 of 30 A Snapshot of Acoustic Front-End No standard set of features for speech recognition: acoustic/articulatory/auditory

Page 26 of 30 A Snapshot of Recognition Algorithm (I) Viterbi/Baum- Welch alignment Dynamic Time Warping. Weighted Finite State Transducers (WFST)

Page 27 of 30 A Snapshot of Recognition (II) A simple example of the whole decoding network

Page 28 of 30 A Snapshot of Recognition (III)

Page 29 of 30 State of the Art CORPUS STYLE VOCALUBARY SIZE % WORD ERRORS Digit strings spontaneous 11 2.0 Digit strings conversational 11 5.0 Resource Management read 1.000 2.0 Airline Travel Information System (ATIS) spontaneous 2.500 2.5 North American Business News (NAB) Call Home read 64.000 6.6 conversational telephonic 28.000 40.0

Page 30 of 30 Literature - L. R. Rabiner, R. W. Schafer, Introduction to Digital Speech Processing, Foundations and Trends in Signal Processing, Vol. 1, Nos. 1 2, 2007 - X. Huang, A. Acero, H. Hon, R. Reddy, Spoken Language Processing: A Guide to Theory, Algorithm and System, Prentice Hall, 2001 -D. Jurafsky, J.H. Martin, Speech and Language Processing, Prentice Hall, 2001