highly advanced implementation technology (VLSI) exists that is well matched to the

Similar documents
Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Recognition at ICSI: Broadcast News and beyond

A study of speaker adaptation for DNN-based speech synthesis

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Emotion Recognition Using Support Vector Machine

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Human Emotion Recognition From Speech

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speaker recognition using universal background model on YOHO database

WHEN THERE IS A mismatch between the acoustic

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

On the Formation of Phoneme Categories in DNN Acoustic Models

Voice conversion through vector quantization

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

THE RECOGNITION OF SPEECH BY MACHINE

K 1 2 K 1 2. Iron Mountain Public Schools Standards (modified METS) Checklist by Grade Level Page 1 of 11

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Speaker Recognition. Speaker Diarization and Identification

Consonants: articulation and transcription

English Language and Applied Linguistics. Module Descriptions 2017/18

Body-Conducted Speech Recognition and its Application to Speech Support System

Speaker Identification by Comparison of Smart Methods. Abstract

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Learning Methods in Multilingual Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Modeling function word errors in DNN-HMM based LVCSR systems

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

MULTIMEDIA Motion Graphics for Multimedia

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

SIE: Speech Enabled Interface for E-Learning

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Modeling function word errors in DNN-HMM based LVCSR systems

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

SYLLABUS- ACCOUNTING 5250: Advanced Auditing (SPRING 2017)

Stages of Literacy Ros Lugg

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

GACE Computer Science Assessment Test at a Glance

Segregation of Unvoiced Speech from Nonspeech Interference

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Calibration of Confidence Measures in Speech Recognition

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

21st Century Community Learning Center

Evolutive Neural Net Fuzzy Filtering: Basic Description

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

SARDNET: A Self-Organizing Feature Map for Sequences

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Modeling user preferences and norms in context-aware systems

Introduction and survey

Edinburgh Research Explorer

Expressive speech synthesis: a review

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Aviation English Solutions

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Automatic Pronunciation Checker

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Course Law Enforcement II. Unit I Careers in Law Enforcement

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Florida Reading Endorsement Alignment Matrix Competency 1

CIS 121 INTRODUCTION TO COMPUTER INFORMATION SYSTEMS - SYLLABUS

MTH 141 Calculus 1 Syllabus Spring 2017

Lecture 9: Speech Recognition

Word Segmentation of Off-line Handwritten Documents

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

ASSISTIVE COMMUNICATION

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

FY16 UW-Parkside Institutional IT Plan Report

Phonetics. The Sound of Language

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Investigation on Mandarin Broadcast News Speech Recognition

9 Sound recordings: acoustic and articulatory data

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Learning, Communication, and 21 st Century Skills: Students Speak Up For use with NetDay Speak Up Survey Grades 3-5

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Circuit Simulators: A Revolutionary E-Learning Platform

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Python Machine Learning

Beyond the Blend: Optimizing the Use of your Learning Technologies. Bryan Chapman, Chapman Alliance

Transcription:

Digital Speech Processing Lecture 1 Introduction to Digital Speech Processing 1

Speech Processing Speech is the most natural form of human-human communications. Speech is related to language; linguistics i i is a branch of social science. Speech is related to human physiological capability; physiology is a branch of medical science. Speech is also related to sound and acoustics, a branch of physical science. Therefore, speech is one of the most intriguing signals that humans work with every day. Purpose of speech processing: To understand speech as a means of communication; To represent speech for transmission and reproduction; To analyze speech for automatic recognition and extraction of information To discover some physiological characteristics of the talker. 2

Why Digital Processing of Speech? digital processing of speech signals (DPSS) enjoys an extensive theoretical and experimental base developed over the past 75 years much research has been done since 1965 on the use of digital signal processing in speech communication problems highly advanced implementation technology (VLSI) exists that is well matched to the computational demands d of DPSS there are abundant applications that are in widespread use commercially 3

The Speech Stack Speech Applications coding, synthesis, recognition, understanding, verification, language translation, speed-up/slow-down Speech Algorithms speech-silence (background), voiced-unvoiced decision, pitch detection, ti formant estimation Speech Representations temporal, spectral, homomorphic, LPC Fundamentals acoustics, linguistics, pragmatics, speech perception 4

Speech Applications We look first at the top of the speech processing stack namely applications speech coding speech synthesis speech recognition and understanding other speech applications 5

Speech Coding Encoding speech A-to-D Analysis/ data Compression yˆ [ n] x Converter Coding c (t) x [n] y [n ] y[n ˆ[ ] Channel or Medium Continuous time signal Sampled signal Transformed representation Bit sequence Decoding Channel or Medium data Decompression y[ n] Decoding/ Synthesis x[ n] D-to-A Converter speech x () c t ˆ ( t) y c 6

Speech Coding Speech Coding is the process of transforming a speech signal into a representation for efficient transmission and storage of speech narrowband and broadband wired telephony cellular communications Voice over IP (VoIP) to utilize the Internet as a real-time communications medium secure voice for privacy and encryption for national security applications extremely narrowband communications channels, e.g., battlefield applications using HF radio storage of speech for telephone answering machines, IVR systems, prerecorded messages 7

Demo of Speech Coding Narrowband Speech Coding: 64 kbps PCM 32 kbps ADPCM 16 kbps LDCELP 8 kbps CELP 4.8 kbps FS1016 2.4 kbps LPC10E Wideband Speech Coding: Male talker / Female Talker 32kH 3.2 khz uncoded d 7 khz uncoded 7 khz 64 kbps 7 khz 32 kbps 7 khz 16 kbps Narrowband Speech Wideband Speech 8

Demo of Audio Coding CD Original (1.4 Mbps) versus MP3-coded at 128 kbps female vocal trumpet selection orchestra baroque guitar Can you determine which is the uncoded and which is the coded audio for each selection? Audio Coding Additional Audio Selections 9

Audio Coding Female vocal MP3-128 kbps coded, CD original Trumpet selection CD original, i MP3-128 kbps coded Orchestral selection MP3-128 kbps coded Baroque CD original, MP3-128 kbps coded Guitar MP3-128 kbps coded, CD original10

Speech Synthesis text Linguistic Rules DSP Computer D-to-A Converter speech 11

Speech Synthesis Synthesis of Speech is the process of generating g a speech signal using computational means for effective humanmachine interactions machine reading of text or email messages telematics feedback in automobiles talking agents for automatic transactions automatic agent in customer care call center handheld devices such as foreign language phrasebooks, dictionaries, i crossword puzzle helpers announcement machines that provide information such as stock quotes, airlines schedules, weather reports, etc. 12

Speech Synthesis Examples Soliloquy from Hamlet: Gettysburg Address: Third Grade Story: 1964-lrr 2002-tts 13

Pattern Matching Problems speech A-to-D Converter Feature Analysis Pattern Matching symbols speech recognition speaker recognition speaker verification word spotting automatic indexing of speech recordings Reference Patterns 14

Speech Recognition and Understanding Recognition and Understanding of Speech is the process of extracting usable linguistic information from a speech signal in support of human-machine communication by voice command and control (C&C) applications, e.g., simple commands for spreadsheets, presentation graphics, appliances voice dictation to create letters, memos, and other documents natural language voice dialogues with machines to enable Help desks, Call Centers voice dialing for cellphones and from PDA s and other small devices agent services such as calendar entry and update, address list modification and entry, etc. 15

Speech Recognition Demos 16

Speech Recognition Demos 17

Other Speech Applications Speaker Verification for secure access to premises, information, virtual spaces Speaker Recognition for legal and forensic purposes national security; also for personalized services Speech Enhancement for use in noisy environments, to eliminate echo, to align voices with video segments, to change voice qualities, to speed-up or slow-down prerecorded speech (e.g., talking books, rapid review of material, careful scrutinizing of spoken material, etc) => potentially to improve intelligibility and naturalness of speech Language Translation to convert spoken words in one language to another to facilitate natural language dialogues between people speaking different languages, i.e., tourists, business people 18

DSP/Speech Enabled Devices Internet Audio Digital Cameras PDAs & Streaming Audio/Video Hearing Aids Cell Phones 19

Apple ipod stores music in MP3, AAC, MP4, wma, wav, audio formats compression of 11-to-1 for 128 kbps MP3 can store order of 20,000 songs with 30 GB disk can use flash memory to eliminate all moving memory access can load songs from itunes store more than 1.5 billion downloads tens of millions sold Memory x[n] y[n] y c (t) Computer D-to-A 20

One of the Top DSP Applications Cellular l Phone 21

Digital Speech Processing Need to understand the nature of the speech signal, and how dsp techniques, communication technologies, and information theory methods can be applied to help solve the various application scenarios described above most of the course will concern itself with speech signal processing i.e., converting one type of speech signal representation to another so as to uncover various mathematical or practical properties of the speech signal and do appropriate processing to aid in solving both fundamental and deep problems of interest 22

Speech Signal Production Message Linguistic Articulatory Acoustic Electronic Source Construction Production Propagation Transduction Speech Waveform M W S A X Idea Message, M, Words realized Sounds Signals converted encapsulated in a message, M realized as a word sequence, W as a sequence of (phonemic) sounds, S received at the transducer from acoustic to electric, transmitted, Conventional studies of speech science use speech signals recorded in a sound booth with little interference or distortion through h acoustic ambient, A distorted t d and received as X Practical applications require use of realistic or real world speech with noise and distortions ti 23

Speech Production/Generation Model Message Formulation desire to communicate an idea, a wish, a request, => express the message as a sequence of words Desire to Communicate Message Formulation Text String I need some string Please get me some string Where can I buy some string (Discrete Symbols) Language Code need to convert chosen text string to a sequence of sounds in the language that can be understood by others; need to give some form of emphasis, prosody (tune, melody) to the spoken sounds so as to impart non-speech information such as sense of urgency, importance, psychological state of talker, environmental factors (noise, echo) Text String Language Code Generator Phoneme string with prosody (Discrete Symbols) Pronunciation Vocabulary (In The Brain) 24

Speech Production/Generation Model Neuro-Muscular Controls need to direct the neuro-muscular system to move the articulators (tongue, lips, teeth, jaws, velum) so as to produce the desired spoken message in the desired manner Phoneme String with Prosody Neuro- Muscular Controls Articulatory motions (Continuous control) Vocal Tract System need to shape the human vocal tract system and provide the appropriate sound sources to create an acoustic waveform (speech) that is understandable in the environment in which it is spoken Articulatory Motions Vocal Tract System Acoustic Waveform (Speech) (Continuous control) Source control (lungs, diaphragm, chest muscles) 25

The Speech Signal Background Signal Pitch Period Unvoiced Signal (noiselike sound) 26

Speech Perception Model The acoustic waveform impinges on the ear (the basilar membrane) and is spectrally analyzed by an equivalent filter bank of the ear Acoustic Waveform Basilar Membrane Motion Spectral Representation (Continuous Control) The signal from the basilar membrane is neurally transduced and coded into features that can be decoded by the brain Spectral Features Neural Transduction Sound Features (Distinctive Features) (Continuous/Discrete Control) The brain decodes the feature stream into sounds, words and sentences Phonemes, Language Words, and (Discrete Message) Sound Features Translation Sentences The brain determines the meaning of the words via a message understanding mechanism Phonemes, Words and Sentences Message Understanding Basic Message (Discrete Message) 27

The Speech Chain Text Phonemes, Prosody Articulatory Motions Message Language Neuro-Muscular Vocal Tract Formulation Code Controls System Discrete Input Continuous Input Acoustic Waveform 50 bps 200 bps 2000 bps Information Rate Semantics Phonemes, Words, Sentences Feature Extraction, Coding Message Language Neural Understanding Translation Transduction Discrete Output 30-50 kbps Spectrum Analysis Basilar Membrane Motion Continuous Output Transmission Channel Acoustic Waveform 28

The Speech Chain 29

Speech Sciences Linguistics: science of language, g including phonetics, phonology, morphology, and syntax Phonemes: smallest set of units considered to be the basic set of distinctive sounds of a languages (20-60 units for most languages) Phonemics: study of phonemes and phonemic systems Phonetics: study of speech sounds and their production, transmission, and reception, and their analysis, classification, and transcription Phonology: phonetics and phonemics together Syntax: meaning of an utterance 30

The Speech Circle Voice reply to customer What number did you want to call? Customer voice request Text-to-Speech Synthesis TTS What s next? Determine correct number Data ASR Words spoken I dialed a wrong number Automatic Speech Recognition Dialog Management (Actions) and Spoken Language Generation (Words) DM & SLG Meaning Billing credit SLU Spoken Language Understanding 31

Information Rate of Speech from a Shannon view of information: message content/information--2**6 symbols (phonemes) in the language; 10 symbols/sec for normal speaking rate => 60 bps is the equivalent information rate for speech (issues of phoneme probabilities, phoneme correlations) from a communications point of view: speech bandwidth is between 4 (telephone quality) and 8 khz (wideband hi-fi speech) need to sample speech at between 8 and 16 khz, and need about 8 (log encoded) bits per sample for high quality encoding => 8000x8=64000 bps (telephone) to 16000x8=128000 bps (wideband) 1000-2000 times change in information rate from discrete message symbols to waveform encoding => can we achieve this three orders of magnitude reduction in information rate on real speech waveforms? 32

Information Source Human speaker lots of variability Measurement or Observation Acoustic waveform/articulatory positions/neural control signals Signal Representation Signal Transformation Signal Processing Purpose of Course Extraction and Utilization of Information Human listeners, machines 33

Digital Speech Processing DSP: obtaining discrete representations of speech signal theory, design and implementation of numerical procedures (algorithms) for processing the discrete representation in order to achieve a goal (recognizing the signal, modifying the time scale of the signal, removing background noise from the signal, etc.) Why DSP reliability flexibility accuracy real-time implementations on inexpensive dsp chips ability to integrate with multimedia and data encryptability/security of the data and the data representations via suitable techniques 34

Hierarchy of Digital Speech Processing Representation of Speech Signals Waveform Representations Parametric Representations represent signal as output of a speech production model preserve wave shape through sampling and quantization Excitation Parameters Vocal Tract Parameters pitch, voiced/unvoiced, noise, transients spectral, articulatory 35

Information Rate of Speech Data Rate (Bits Per Second) 200,000 60,000 20,000 10,000 500 75 LDM, PCM, DPCM, ADM Analysis- Synthesis Methods Synthesis from Printed Text (No Source Coding) (Source Coding) Waveform Representations Parametric Representations 36

Speech Processing Applications Cellphones VoIP Vocoder Conserve bandwidth, encryption, secrecy, seamless voice and data Messages, IVR, call centers, telematics Secure access, forensics Dictation, commandand- control, agents, NL voice dialogues, call centers, help desks Readings for the blind, speed-up and slowdown of speech rates Noise and echo removal, alignment of speech and text 37

The Speech Stack

What We Will Be Learning review some basic dsp concepts speech production model acoustics, articulatory concepts, speech production models speech perception model ear models, auditory signal processing, equivalent acoustic processing models time domain processing concepts speech p properties, p pitch, voiced- unvoiced, energy, autocorrelation, zero-crossing rates short time Fourier analysis methods digital filter banks, spectrograms, analysis-synthesis systems, vocoders homomorphic speech processing cepstrum, cepstrum pitch detection, formant estimation, homomorphic vocoder linear predictive coding methods autocorrelation method, covariance method, lattice methods, relation to vocal tract models speech waveform coding and source models delta modulation, PCM, mu-law, ADPCM, vector quantization, multipulse coding, CELP coding methods for speech synthesis and text-to-speech systems physical models, formant models, articulatory models, concatenative models methods for speech recognition the Hidden Markov Model (HMM) 39