Speech Communication, Spring 2006

Similar documents
Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Emotion Recognition Using Support Vector Machine

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Speaker recognition using universal background model on YOHO database

Human Emotion Recognition From Speech

WHEN THERE IS A mismatch between the acoustic

Voice conversion through vector quantization

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Rhythm-typology revisited.

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speech Recognition at ICSI: Broadcast News and beyond

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Expressive speech synthesis: a review

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Segregation of Unvoiced Speech from Nonspeech Interference

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Body-Conducted Speech Recognition and its Application to Speech Support System

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

A study of speaker adaptation for DNN-based speech synthesis

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

A Hybrid Text-To-Speech system for Afrikaans

Learning Methods in Multilingual Speech Recognition

Phonological Processing for Urdu Text to Speech System

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Natural Language Processing. George Konidaris

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

English Language and Applied Linguistics. Module Descriptions 2017/18

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

THE RECOGNITION OF SPEECH BY MACHINE

Speaker Identification by Comparison of Smart Methods. Abstract

Speaker Recognition. Speaker Diarization and Identification

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Statistical Parametric Speech Synthesis

Software Maintenance

Letter-based speech synthesis

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Mandarin Lexical Tone Recognition: The Gating Paradigm

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Word Stress and Intonation: Introduction

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

CS Machine Learning

CEFR Overall Illustrative English Proficiency Scales

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Automatic intonation assessment for computer aided language learning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

A Neural Network GUI Tested on Text-To-Phoneme Mapping

REVIEW OF CONNECTED SPEECH

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Proceedings of Meetings on Acoustics

On the Formation of Phoneme Categories in DNN Acoustic Models

Radius STEM Readiness TM

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

Automatic segmentation of continuous speech using minimum phase group delay functions

Stages of Literacy Ros Lugg

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Florida Reading Endorsement Alignment Matrix Competency 1

Lecture 1: Machine Learning Basics

AQUA: An Ontology-Driven Question Answering System

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Using dialogue context to improve parsing performance in dialogue systems

Lecture 10: Reinforcement Learning

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Some Principles of Automated Natural Language Information Extraction

SARDNET: A Self-Organizing Feature Map for Sequences

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Course Law Enforcement II. Unit I Careers in Law Enforcement

Evolutive Neural Net Fuzzy Filtering: Basic Description

SIE: Speech Enabled Interface for E-Learning

Major Milestones, Team Activities, and Individual Deliverables

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

Investigation on Mandarin Broadcast News Speech Recognition

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Getting the Story Right: Making Computer-Generated Stories More Entertaining

Calibration of Confidence Measures in Speech Recognition

Transcription:

Speech Communication, Spring 2006 Lecture 3: Speech Coding and Synthesis Zheng-Hua Tan Department of Communication Technology Aalborg University, Denmark zt@kom.aau.dk Speech Communication, III, Zheng-Hua Tan, 2006 1 Human speech communication process Lecture 1 Rabiner and Levinson, IEEE Tans. Communications, 1981 (After Rabiner & Levinson, 1981) Lecture 2 Speech synthesis Vocoder coding Waveform coding Speech coding Speech understanding Speech recognition Speech Communication, III, Zheng-Hua Tan, 2006 2

Part I: Speech coding Speech coding Waveform coding Parametric coding (vocoder) Analysis-by-synthesis Speech synthesis Articulatory synthesis Formant synthesis Concatenative synthesis Speech Communication, III, Zheng-Hua Tan, 2006 3 Speech coding Definition: analogue waveform digital form Objectives: (for transmission and storage) High compression - reduction in bit rate Low distortion - high quality of reconstructed speech But, the lower the bit rate, the lower the quality. Theoretical foundation Redundancies in the speech signals Properties of speech production and perception Applications VoIP Digital cellular telephony audio conferencing voice mail Speech Communication, III, Zheng-Hua Tan, 2006 4

Speech coders Waveform coders Directly encode waveforms by exploiting the characteristics of speech signals, mostly (scalar coders) sample-by-sample. High bit rates and high quality Examples: 64kb/s PCM (G.711), 32 kb/s ADPCM (G.726) Parametric (voice coder i.e., vocoder) coders Represent speech signal by a set of parameters of models Estimate and encode the parameters from frames of speech Low bit rates, good quality Examples: 2.4 kb/s LPC, 2.4 kb/s MELP Analysis-by-synthesis coders Combination of waveform and parametric coders Medium bit rates Examples: 16 kb/s CELP (G.728), 8 kb/s CELP (G.729) Speech Communication, III, Zheng-Hua Tan, 2006 5 Time domain waveform coding Waveform coders directly encode waveforms by exploiting the temporal (time domain) or spectral (frequency domain) characteristics of speech signals. Treats speech signals as normal signal waveforms. It aims at obtain the most similar reconstructed (decoded) signal to the original one. So SNR is always a useful performance measure. In the time domain: Pulse code modulation (PCM) Linear PCM, µ-law PCM, A-law PCM Adaptive PCM (APCM) Differential PCM (DPCM) Adaptive DPCM (ADPCM) Speech Communication, III, Zheng-Hua Tan, 2006 6

Linear PCM Analog-to-digital converters perform both sampling and quantization simultaneously. Here we analyse the effects of quantization: each sample a fixed number of bits, B. Linear PCM B bits represent 2 B separate quantization levels Assumption: bounded input discrete signal x[ X max Uniform quantization: with a constant quantization step size for all levels x i x i x i 1 = Speech Communication, III, Zheng-Hua Tan, 2006 7 Linear PCM (cont d) Two common uniform quantization characteristics: mid-riser quantizer mid-tread quantizer xˆ Two parameters for a uniform quantizer: the number of levels N=2 B the step size. 2 X max = 2 B Three-bit (N=8) mid-riser quantizer Speech Communication, III, Zheng-Hua Tan, 2006 8

Quantization noise and SNR Quantization noise: B if 2 X max = 2, Variance of e[ which is uniformly distributed. σ 2 e SNR of the quantization e[ = x[ xˆ[ e[ 2 2 2 2 2 2 X E e n E e n 2 2 1 max [( [ ] µ ) ] = [ [ ]] = e [ de[ = = 2B 12 3 2 2 = SNR( db) = 10log 10 2 σ x ( ) = (20 log 2 σ e 2) B + 10log 3 20log indicating each bit contributes to 6 db of SNR 10 X ( σ 11~12-bit PCM achieves 35 db since signal energy can vary 40 db max Speech Communication, III, Zheng-Hua Tan, 2006 9 10 10 x ) Applications of PCM 16-bit linear PCM Digital audio stored in computers: Windows WAV, Apple AIF, Sun AU Compact Disc Digital Audio A CD can store up to 74 minutes of music Total amount of data = 44,100 samples/(channel*second) * 2 bytes/sample * 2 channels * 60 seconds/minute * 74 minutes = 783,216,000 bytes Speech Communication, III, Zheng-Hua Tan, 2006 10

µ-law and A-law PCM Human perception is affected by SNR constant SNR for all quantization levels the step size being proportional to the signal value rather than being uniform a logarithmic compander y [ = ln x[ + a uniform quantizer on y[ so that yˆ [ = y[ + ε[ xˆ [ = x[ exp{ ε [ } x[ (1 + ε[ ) = x[ + x[ ε[ thus SNR is constant for all levels SNR = 1 2 σ ε Speech Communication, III, Zheng-Hua Tan, 2006 11 µ-law and A-law PCM (cont d) µ-law approximation y[ = X max A-law approximation x[ log[1 + µ ] X max sign{ x[ } log[1 + µ ] G.711 standardized telephone speech coding 64 kbps = 8 khz sampling rate * 8 bits per sample Approximate 35 db SNR 12 bits uniform quantizer Whose quality is considered toll and an MOS of about 4.3, a widely used baseline. Speech Communication, III, Zheng-Hua Tan, 2006 12

Parametric coding (vocoder) Are based on the all-pole model of the vocal system Estimate the model parameters from frames of speech (speech analysis) and encode the parameters on a frame-by-frame basis Reconstruct the speech signal from the model (speech synthesis) Speech Communication, III, Zheng-Hua Tan, 2006 13 Parametric coding (vocoder) (cont d) Does not require/guarantee similarity in the waveform Lower bit rate, but the quality of the synthesized speech is not as good both in clearness and naturalness Example LPC vocoder The source-filter model & LPC vocoder Source Filter Vocal tract linear predictive coding Output an LPC vocoder Speech Communication, III, Zheng-Hua Tan, 2006 14

Analysis-by-synthesis - CELP CELP (code excited linear prediction): a family of tech. that quantize the LPC residual using VQ, thus the term code excited, in addition to encoding the LPC parameters. CELP based standards kbps MOS Delay G.728 16 4.0 low G.729 8 3.9 10ms G.723.1 5.3/6.3 3.9 30ms EFR GSM 12.2 4.5 Speech Communication, III, Zheng-Hua Tan, 2006 15 Speech coders attributes Factors: bandwidth (sampling rate), bit rate, quality of reconstructed speech, noise robustness, computational complexity, delay, channel-error sensitivity. In practice, coding strategies are the trade-off among them. Telephone speech: bandwidth 300~3400Hz, sampled at 8kHz Wideband speech is used for a bandwidth of 50-7000Hz and a sampling rate of 16kHz Audio coding is used to dealing with high-fidelity audio signals with a sampling rate of 44.1kHz Speech Communication, III, Zheng-Hua Tan, 2006 16

Mean Opinion Score (MOS) The most widely used measure of quality is the Mean Opinion Score (MOS), which is the result of averaging opinion scores for a set of subjects. MOS is a numeric value computed as an average for a number of subjects, where each number maps to a subjective quality excellent good fair poor bad 5 4 3 2 1 Speech Communication, III, Zheng-Hua Tan, 2006 17 Organisations and standards The International Telecommunications Union (ITU) Standard Method Bit rete (kb/s) MOS Complexity (MIPS) Release Time ITU G.711 Mu/A-law PCM 64 4.3 0.01 1972 ITU G.729 CS-ACELP 8 4.0 20 1996 The European Telecommunications Standards Institutes (ETSI) Standard Method Bit rete (kb/s) MOS Complexity (MIPS) Release Time GSM FR RPE-LTP 13 1987 GSM AMR ACELP 4.75-12.2 1998 Speech Communication, III, Zheng-Hua Tan, 2006 18

Part II: Speech synthesis Speech coding Waveform coding Parametric coding (vocoder) Analysis-by-synthesis Speech synthesis Articulatory synthesis Formant synthesis Concatenative synthesis Speech Communication, III, Zheng-Hua Tan, 2006 19 Text-to-speech (TTS) TTS converts arbitrary text to intelligible and natural sounding speech. TTS is viewed as a speech coding system with an extremely high compression ratio. The text file that is input to a speech synthesizer is a form of coded speech. What is the bit rate? TTS Speech Communication, III, Zheng-Hua Tan, 2006 20

Overview of TTS Lexicon Text Text analysis Text normalization: - numerical expansion - abbreviations - acronyms - proper names Phonetic analysis Prosody generation Letter-to-sound: - phonemes -pitch - duration - loudness Phonetic transcription Prosody Synthesizer Speech Units: - words, phones, diphones, syllables Parameters: - LPC, formants, waveform templates, articulatory Algorithms: - rules, concatenation Speech Communication, III, Zheng-Hua Tan, 2006 21 Text analysis document structure detection to provide context for later processes, e.g. sentence breaking and paragraph segmentation affect prosody. e.g. email needs special care. This is easy :-) ZT text normalization to convert symbols, numbers into an orthographic transcription suitable for phonetic conversion. Dr., 9 am, 10:25, 16/02/2006 (Europe), DK, OPEC linguistic analysis to recover the syntactic and semantic features of words, phrases and sentences for both pronunciation and prosodic choices. word type (name or verb), word sense (river or money bank) Speech Communication, III, Zheng-Hua Tan, 2006 22

Letter-to-sound LTS conversion provides phonetic pronunciation for any sequence of letters. Approaches Dictionary lookup If lookup fails, use rules. knight: k -> /sil/ % _n Kitten: k -> /k/ Classification and regression trees (CART) is commonly used which includes a set of yes-no questions and a procedure to select the best question at each node to grow the tree from the root. Speech Communication, III, Zheng-Hua Tan, 2006 23 Prosody Pause: indicating phrases and having break Pitch: accent, tone, intonation Duration Loudness Block diagram of a prosody generation system Parsed text and phone string Pause insertion and prosodic phrasing Duration F0 contour Volume Speaking style Speech Communication, III, Zheng-Hua Tan, 2006 24

Speech synthesis A module of a TTS system that generates the waveform. Phonetic transcription + associated prosody Approaches: Speech synthesis Waveform Limited-domain waveform concatenation, e.g. IVR Concatenative systems with no waveform modification, from arbitrary text Concatenative systems with waveform modification, for prosody consideration Rule-based systems as opposed to the above data-driven synthesis. For example, formant synthesizer normally uses synthesis by rule. Speech Communication, III, Zheng-Hua Tan, 2006 25 Types according to the model Articulatory synthesis uses a physical model of speech production including all the articulators Formant synthesis uses a source-filter model, in which the filter is determined by slowly varying formant frequencies Concatenative synthesis concatenates speech segments, where prosody modification plays a key role. Speech Communication, III, Zheng-Hua Tan, 2006 26

Formant speech synthesis A type of synthesis-by-rule where a set of rules are applied to decide how to modify the pitch, formant frequencies, and other parameters from one sound to another Block diagram Phonemes + prosodic tags Rule-based system Pitch contour Formant tracks Formant synthesizer Waveform Speech Communication, III, Zheng-Hua Tan, 2006 27 Concatenative speech synthesis Synthesis-by-rule generates unnatural speech Concatenative synthesis A speech segment is generated by playing back waveform with matching phoneme string. cut and paste, no rules required completely natural segments An utterance is synthesized by concatenating several speech segments. Discontinuities exist: spectral discontinuities due to formant mismatch at the concatenation point prosodic discontinuities due to pitch mismatch at the concatenation point Speech Communication, III, Zheng-Hua Tan, 2006 28

Key issues in concatenative synthesis Choice of unit Speech segment: phoneme, diphone, word, sentence? Design of the set of speech segments Set of speech segments: which and how many? Choice of speech segments How to select the best string of speech segments from a given library of segments, given a phonetic string and its prosody? Modification of the prosody of a speech segment To best match the desired output prosody Speech Communication, III, Zheng-Hua Tan, 2006 29 Choice of unit Unit types in English (After Huang et al., 2001) Unit length Unit type # units Quality Short Phoneme 42 Low Diphone ~1500 Triphone ~30K Semisyllable ~2000 Syllable ~15K Word 100K-1.5M Long Phrase Sentence High Speech Communication, III, Zheng-Hua Tan, 2006 30

Attributes of speech synthesis system Delay For interactive applications, < 200ms Momory resources Rule-based, < 200 KB; Concatenative systems, 100 MB CPU resources For concatenative systems, searching may be a problem Variable speed e.g., fast speech; difficult for concatenative system Pitch control e.g., a specific pitch requirement; difficult for concatenative Voice characteristics e.g., specific voices like robot; difficult for concatenative Speech Communication, III, Zheng-Hua Tan, 2006 31 Difference between synthesis and coding Rabiner and Levinson, IEEE Tans. Communications, 1981 (After Rabiner & Levinson, 1981) Speech synthesis Speech understanding Speech coding Speech recognition Speech Communication, III, Zheng-Hua Tan, 2006 32

Summary Speech coding Speech synthesis Next lectures: Speech Recognition Speech Communication, III, Zheng-Hua Tan, 2006 33