SPEECH PROCESSING Overview

Similar documents
Consonants: articulation and transcription

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Phonetics. The Sound of Language

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Speaker Recognition. Speaker Diarization and Identification

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Body-Conducted Speech Recognition and its Application to Speech Support System

THE RECOGNITION OF SPEECH BY MACHINE

Speech Recognition at ICSI: Broadcast News and beyond

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

age, Speech and Hearii

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Segregation of Unvoiced Speech from Nonspeech Interference

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Contrasting English Phonology and Nigerian English Phonology

Speech Emotion Recognition Using Support Vector Machine

Automatic segmentation of continuous speech using minimum phase group delay functions

Speaker recognition using universal background model on YOHO database

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Word Stress and Intonation: Introduction

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Proceedings of Meetings on Acoustics

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Automatic intonation assessment for computer aided language learning

Evaluation of Various Methods to Calculate the EGG Contact Quotient

9 Sound recordings: acoustic and articulatory data

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

Fix Your Vowels: Computer-assisted training by Dutch learners of Spanish

Phonology Revisited: Sor3ng Out the PH Factors in Reading and Spelling Development. Indiana, November, 2015

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Human Emotion Recognition From Speech

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

On the Formation of Phoneme Categories in DNN Acoustic Models

Quarterly Progress and Status Report. Sound symbolism in deictic words

Mandarin Lexical Tone Recognition: The Gating Paradigm

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

5.1 Sound & Light Unit Overview

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Klaus Zuberbühler c) School of Psychology, University of St. Andrews, St. Andrews, Fife KY16 9JU, Scotland, United Kingdom

Provisional. Using ambulatory voice monitoring to investigate common voice disorders: Research update

Voice conversion through vector quantization

Beginning primarily with the investigations of Zimmermann (1980a),

Online Publication Date: 01 May 1981 PLEASE SCROLL DOWN FOR ARTICLE

source or where they are needed to distinguish two forms of a language. 4. Geographical Location. I have attempted to provide a geographical

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

A Neural Network GUI Tested on Text-To-Phoneme Mapping

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Using a Native Language Reference Grammar as a Language Learning Tool

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Speaker Identification by Comparison of Smart Methods. Abstract

Audible and visible speech

COORDINATING SKINNER SPEECH AND LINKLATER VOICE FOR THE BEGINNING ACTOR DAVID L. WYGANT, B.F.A. A THESIS THEATRE ARTS

Phonological and Phonetic Representations: The Case of Neutralization

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

The pronunciation of /7i/ by male and female speakers of avant-garde Dutch

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

Learners Use Word-Level Statistics in Phonetic Category Acquisition

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

The Acquisition of English Intonation by Native Greek Speakers

Edinburgh Research Explorer

COMMUNICATION DISORDERS. Speech Production Process

Rhythm-typology revisited.

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

Universal contrastive analysis as a learning principle in CAPT

Affricates. Affricates, nasals, laterals and continuants. Affricates. Affricates. Study questions

Guidelines for blind and partially sighted candidates

Christine Mooshammer, IPDS Kiel, Philip Hoole, IPSK München, Anja Geumann, Dublin

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Speak with Confidence The Art of Developing Presentations & Impromptu Speaking

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Journal of Phonetics

MASTERY OF PHONEMIC SYMBOLS AND STUDENT EXPERIENCES IN PRONUNCIATION TEACHING. Master s thesis Aino Saarelainen

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

WHEN THERE IS A mismatch between the acoustic

A study of speaker adaptation for DNN-based speech synthesis

Phonological Processing for Urdu Text to Speech System

Aviation English Solutions

SARDNET: A Self-Organizing Feature Map for Sequences

Prevalence of Oral Reading Problems in Thai Students with Cleft Palate, Grades 3-5

The ABCs of O-G. Materials Catalog. Skills Workbook. Lesson Plans for Teaching The Orton-Gillingham Approach in Reading and Spelling

In how many ways can one junior and one senior be selected from a group of 8 juniors and 6 seniors?

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Coast Academies Writing Framework Step 4. 1 of 7

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

REVIEW OF CONNECTED SPEECH

How People Learn Physics

Transcription:

SPEECH PROCESSING Overview Patrick A. Naylor Spring Term 2008/9 Voice Communication Speech is the way of choice for humans to communicate: no special equipment required no physical contact required no visibility required can communicate while doing something else 2 1

What type of Processing? Speech Processing: Coding Synthesis Recognition Identity Verification Enhancement 3 The Human Speech Production Apparatus Tongue: used to alter the vocal tract shape Velum: closes off nose cavity for all sounds except m, n and ng Epiglottis: closes off larynx during eating Larynx: Vocal folds vibrate during voiced sounds To lungs To stomach 4 2

Speech Production Physical Model Vocal Folds Velum Nose Cavity Lungs Pharynx Cavity Mouth Cavity 5 Sources of Sound Energy Turbulence: air moving quickly through a small hole (e.g./s/ in size ) Explosion: pressure built up behind a blockage is suddenly released (e.g. /p/ in pop ) Vocal Fold Vibration: like the neck of a balloon (e.g. /a/ in hard ) airflow through vocal folds (vocal cords) reduces the pressure and they snap shut (Bernoulli effect) muscle tension and air pressure build up force the folds open again and the process repeats frequency of vibration (fx) determined by tension in vocal folds and pressure from lungs for normal breathing and voiceless sounds (e.g. /s/) the vocal folds are held wide open and don t vibrate 6 3

Speech Sound Categories Voiced speech sounds where the vocal folds vibrate. Vowels no blockage of the vocal tract and no turbulence Consonants non-vowels Plosives consonants involving an explosion of air 7 Vocal Tract Filter The sound spectrum is modified by the shape of the vocal tract. This is determined by movements of the jaw, tongue and lips. The resonant frequencies of the vocal tract cause peaks in the spectrum called formants. The first two formant frequencies are roughly determined by the distances from the tongue hump to the larynx and to the lips respectively. 8 4

Vocal Tract Examples 9 Speech Waveform Examples Extracts from my speech (a) start of y vowel 8.8 ms (114 Hz) 0.8 ms (1.25 khz) (b) ee vowel 8.8 ms (114 Hz) 4.3 ms (233 Hz) (c) s consonant 10 5

Spectrogram Dark areas of spectrogram show high intensity Voiced segments are much louder than unvoiced Horizontal dark bands are the formant peaks s very high frequency of around 4.5 khz (compare with telephone bandwidth: 0.5-3.4 khz) sh is lower frequency because tongue is further back Vertical bands in my are individual larynx closures The y of my is a diphthong: two successive vowels m a i s p i t ʃ 11 Phonemes Speakers and listeners divide words into component sounds called phonemes. Native speakers agree on the phonemes that make up a particular word There are about 42 phonemes in English The phonemes in a particular word may vary with dialect High amplitude speech will mask noise at the same frequency The actual sound that corresponds to a particular phoneme depends on: the adjacent phonemes in the word or sentence the accent of the speaker the talking speed whether it is a formal or informal occasion 12 6

Speech Coding To transmit/store a speech waveform using as few bits as possible while retaining sufficiently high quality Required quality depends on the application Motivation is to save bandwidth in telecoms applications and to reduce memory storage requirements Everyone uses speech coders when talking on the phone 13 Speech Coding - approach Correlation Predictability Redundancy Predict waveform samples from previous samples and transmit only the prediction error Autocorrelation is fourier transform of power spectrum: a peaky spectrum strong short-term correlations (~ 0.5 ms) Voiced speech is almost periodic strong long-term correlations (~ 10 ms) Devote few bits to the aspects of speech where errors are least noticeable High amplitude speech will mask noise at the same frequency Ignore aspects of the speech that are inaudible Power spectrum is much more important than precise waveform For aperiodic sounds, the fine detail of the spectrum does not matter 14 7

Speech Synthesis To convert a text string into a speech waveform Useful for technology to communicate when a display would be inconvenient because: (a) Too big, (b) Eyes busy, (c) Via phone, (d) In the dark, (e) Moving around 15 Speech Synthesis - issues The spelling of words doesn t match their sound Pronunciation rules + an exceptions dictionary Some words have multiple meanings+sounds Must guess which is the correct sound Simplistic speech models sound mechanical Can use extracts from real speech Speech sounds are influenced by adjacent phonemes Use phoneme pairs from real speech Important words must be slightly louder Must try to understand the text Voice pitch and talking speed must vary smoothly throughout a sentence Must be able to change pitch and speed without affecting formant frequencies 16 8

Speech Recognition To convert a speech waveform into text Useful to communicate and control technology when a keyboard would be inconvenient because: (a) Too big, (b) Hands busy, (c) Via phone, (d) In the dark, (e) Moving around 17 Speech Recognition - issues The spelling of words doesn t match their sound Have a big phonetic dictionary The waveform of a word varies a lot between different speakers (or even the same speaker) Extract features from the speech waveform that are more consistent than the waveform The extracted features won t be exactly repeatable Characterize them with a probability distribution Speech sounds are influenced by adjacent phonemes Use context-dependent probability distributions Speaking speed varies enormously Try all possible speaking speeds No clear boundary between words or phonemes Try all possible boundaries 18 9

Supporting Materials Books Discrete-Time Processing of Speech Signals, JR Deller, Jr, JG Proakis & JHL Hansen, Macmillan 1993, 0-02-328301-7 Comprehensive and quite good but has a few errors. Digital Processing of Speech Signals, LR Rabiner & RW Schafer, Prentice- Hall 1978, 0-13-213603-1 Excellent treatment of linear prediction, too old for coding, synthesis and recognition. Statistical Methods for Speech Recognition,F Jelinek, MIT Press 1998, 0-262-10066-5 Excellent treatment of theory underlying recognition. Website www.ee.ic.ac.uk/hp/staff/pnaylor/speechprocessing.html 19 Syllabus Lectures 2 Modelling Speech Production Acoustics 3 Time/Frequency Representation 4 Properties of Digital Filters 5-7 Linear Predictive Modelling 8-10 Speech Coding 11 Phonetics 12-13 Speech Synthesis 14-19 Speech Recognition 20 10