A new method to distinguish non-voice and voice in speech recognition

Similar documents
Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Speech Emotion Recognition Using Support Vector Machine

Consonants: articulation and transcription

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Learning Methods in Multilingual Speech Recognition

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Phonetics. The Sound of Language

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speaker Recognition. Speaker Diarization and Identification

Mandarin Lexical Tone Recognition: The Gating Paradigm

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

THE RECOGNITION OF SPEECH BY MACHINE

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Body-Conducted Speech Recognition and its Application to Speech Support System

On the Formation of Phoneme Categories in DNN Acoustic Models

Automatic segmentation of continuous speech using minimum phase group delay functions

Calibration of Confidence Measures in Speech Recognition

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Speaker recognition using universal background model on YOHO database

Human Emotion Recognition From Speech

Modeling function word errors in DNN-HMM based LVCSR systems

Speak with Confidence The Art of Developing Presentations & Impromptu Speaking

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

CEFR Overall Illustrative English Proficiency Scales

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

Word Segmentation of Off-line Handwritten Documents

Speaker Identification by Comparison of Smart Methods. Abstract

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Segregation of Unvoiced Speech from Nonspeech Interference

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Star Math Pretest Instructions

Appendix L: Online Testing Highlights and Script

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Evaluation of Various Methods to Calculate the EGG Contact Quotient

Automatic intonation assessment for computer aided language learning

Longman English Interactive

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Software Maintenance

WHEN THERE IS A mismatch between the acoustic

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A study of speaker adaptation for DNN-based speech synthesis

SIE: Speech Enabled Interface for E-Learning

Proceedings of Meetings on Acoustics

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Using dialogue context to improve parsing performance in dialogue systems

Modeling function word errors in DNN-HMM based LVCSR systems

Measurement & Analysis in the Real World

Data Fusion Models in WSNs: Comparison and Analysis

Language Acquisition Chart

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Circuit Simulators: A Revolutionary E-Learning Platform

age, Speech and Hearii

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Case study Norway case 1

Voice conversion through vector quantization

Stimulating Techniques in Micro Teaching. Puan Ng Swee Teng Ketua Program Kursus Lanjutan U48 Kolej Sains Kesihatan Bersekutu, SAS, Ulu Kinta

STUDENTS' RATINGS ON TEACHER

Support Vector Machines for Speaker and Language Recognition

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

Phonological and Phonetic Representations: The Case of Neutralization

Prevalence of Oral Reading Problems in Thai Students with Cleft Palate, Grades 3-5

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Author's personal copy

Use and Adaptation of Open Source Software for Capacity Building to Strengthen Health Research in Low- and Middle-Income Countries

M55205-Mastering Microsoft Project 2016

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Rule Learning with Negation: Issues Regarding Effectiveness

Investigation on Mandarin Broadcast News Speech Recognition

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Learners Use Word-Level Statistics in Phonetic Category Acquisition

First Grade Curriculum Highlights: In alignment with the Common Core Standards

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

The One Minute Preceptor: 5 Microskills for One-On-One Teaching

Rhythm-typology revisited.

Functional Skills Mathematics Level 2 assessment

Transcription:

A new method to distinguish non-voice and voice in speech recognition LI CHANGCHUN Centre for Signal Processing NANYANG TECHNOLOGICAL UNIVERSITY SINGAPORE 639798 Abstract we addressed the problem of remove the non-voice disturbance in speech recognition. It is always a big problem that the system will wrongly recognize our natural sound, like cough, breath, or sound of lip, nose as speech input and give recognized words output, when we use a speech recognition system. As we know, such non-voice speech is unavoidable for natural speaking, and if we don t supply effective control, the performance often drops to unacceptable level [1]. This paper puts forward a new method to detect fundamental frequency, and use it to distinguish real speech input and non-voice sound, like breath, lip, or noise by people walking by. Applying this method into our command recognition system, we get good results and make the system very robust and could be used in real life. Key-words distinction Auto-relation Fundamental frequency endpoint detection 1 Introduction n speech recognition, when one only I speaks what the system could recognize, no other additional noise or sound, most popular speech recognition system would work well[2]. But when we pause (not tell the system to pause too), our breath and some sounds coming from throat or nose can cause False speech input and give recognized words. Maybe, you could correct it when it is a text input system, but if a command recognition system, especially when you use your sound to control something, such error would be unbearable. Some people utilize filler models to absorb such noise, but since there are so many different nonvoice sounds, it is almost impossible completely exclude them by training so many filler models. By ways of analysis to the non-voice sound, we find that there is special character in these noises, compared with normal speech. These noises seldom have fixed (or almost) Fundamental Frequency (FF). So, we could use this property to distinguish them. This paper paid attention to give the difference between non-voice sound and real voice, and select fundamental frequency as features. It introduced the modified FF extraction algorithm in Section 2. Its usage in voice distinction is in Section 3. In section 3, we supplied a comprehensive application of this method, combined with energy and duration feature to construct a robust system. Conclusion and remark are in the last section, Section 4. 2 Fundamental Frequency Detection Fundamental Frequency reflects one s vocal cords. According to the mode of stimulation, sound could be divided into 3 types [3]:

1. Vowels and semivowels Vowels may be the most frequently used part in speech recognition systems in English. When speaking, the vocal cords vibrate, and produce quasiperiodic air pulse to excite fixed vocal tract shape, and then we get vowels, such as /a/ /o/, /i/, /æ/ and /u/. As for /w/, /l/, /r/, and /y/ has similar acoustic property with vowels, are called semivowels. 2. Nasal consonants Nasal consonants, like /m/, /n/. These are produced with glottal excitation, without vocal track vibration. 3. Fricatives and stops Fricatives could be divided into unvoiced /f/, /s/, and voiced like /v/, /z/. Stops also include voiced (like /b/, /d/, /g/) and unvoiced (like /p/, /t/, /k/). Stops are produced by setting up pressure behind somewhere in the oral tract and releasing it all a sudden, without vocal vibration either. From the definition give above, we could find the major difference between the first type and the others is whether or not vocal cords vibrate. Commonly, every word includes some vowels or semivowels (excluding few exceptio, so if we could determine the Fundamental Frequency, we could know if it is a voice. To extract FF, we select auto-relation algorithm [4], and make some modification to improve it. = S( W( (1) R ( k) = n N 1 k m= 0 n+ m) n+ m+ k) (2) Traditional algorithm 1 Center cutting: find the maximum value of first 1/3 and the last 1/3, then use the smaller (V 0 ) one as threshold to cut the waveform. [ X ( )] (3) Y ( = C n CL = a*v 0, commonly a = 0.6~0.8 2 Observe the figure of auto-relation function. Decide if there is Fundamental Frequency. -CL C[x] +CL Fig. 1 Function of center-cut, attention: for upper and nether part, use same threshold The traditional auto-relation algorithm did not consider the asymmetry of the waveform above and below axis. From the Fig. 2, we could find if we use the same threshold to cut the waveform, it will lose periodic information, which is the basement of FF detection. Therefore, we modify the algorithm to use different threshold for upper and nether waveform. From Fig. 3, you could see the essential information is reserved. Fig. 2 waveform of /a/ From Fig. 3, we could see clearly the periodic waveform, and the upper and the nether is not symmetrical.

Cut one frame (50ms window, 10ms step, 40ms overlap for 8000Hz sample rate), and use center-cut filter on it. If we observe these figures carefully, we could find that the non-voice s FF results have two main features different from real voice (means vowels or semivowels): 1) The FF values are very irregular, and almost distribute randomly. 2) Even if there are some continuous FF values, they are also below 100Hz or lower. Thus, we set up the checking measure like these: Frame k=0,number of FF(FF>100) nff=0; If find continuous 5 FF values(ff>100hz&(ff[i]-ff[i-1])<th0 Th0=10), Then nff=5 k=last one? No Fig. 3 (a) Original waveform (b) After center-cut with traditional method (c) Auto-relation function using (d) s result. (d) After center-cut with improved method Yes Yes nff==5? No Non- Fig. 3 is the result of auto-relation. Fig. 3(d) cuts the little disturbance part and keeps the periodic information. This procedure simplifies the auto-relation function greatly, and makes it easy for one to extract fundamental frequency in next section. 3 Distinction & Its Application Now, using FF extraction algorithm, we get the robust voice distinction method. Fig. 6, Fig. 7 and Fig. 8 are the waveforms and their FF detection results. The waveform s format is 8K Hz, 8Bits mono sample. Fig. 4 Flow chart of FF extraction Speech Input Recognized Command Detection /Non-voice judgement Command recognition system Fig. 5 Diagram for FF extraction s application in command recognition system Table 1 is the experiment s result. From it, we could see that this algorithm could clearly distinguish voice from other

non-voice sound, like breath, cough, lip or throat sound, and other noises. If combined with other fast algorithm to compute auto-relation, and voice detection algorithm [5][6], it could be used in speech recognition system successfully. The Detection part uses frame energy and word duration feature [7][8]. Fig. 8 Cough and its FF value Table 1 Experiment result for voice/nonvoice distinction Times Correctly recognized* Cough 20 19 # Breath 20 20 Lip/Throat 20 20 Other noise 20 20 Real voice 50 50 Fig. 6 Real and its FF value *Note: recognized means classifying either as voice or as non-voice. # Note: One sound is by a male speaker, who coughed on purpose, very like speaking. Fig. 7 Breath and its FF value 4 Conclusions This paper focused on a problem in speech recognition, and set up a new method based on fundamental frequency extraction to distinguish real voce from non-voice noise. It will find application in real speech recognition systems. Also, the paper supplied an improved algorithm to extract fundamental frequency. Because the old method did not consider the asymmetry of the waveform, which could take place when the audio input device has different response for positive and negative waveform. In fact, according to

our analysis, this is a common case. We test more than 10 microphones. With a reliable FF extraction algorithm, we analysis the FF results applied to different sound, real voice or noise (breath, cough, lip or throat sound, nose vibration, etc). Finally, based on the difference between them, the paper put forward a distinction method. Experiments verified our analysis. When used in real system (constructed before to test speaker-independent command recognitio, we get promising improvements compared with the baseline system without dong so. References: [1] C.H. Lee, Some techniques for creating robust stochastic models for speech recognition J. Acoust, Soc, America, suppl. 1.vol.82, Fall 1987 [2] L.R Rabiner & S.E Levinson. Isolated and connected word recognition theory and selected applications, IEEE Trans. Commun. Vol. Com 29. No 5, pp. 621-659, May 1981 [3] G.E. Peterson and H.L. Barney, Control Methods Used in a Study of the vowels, Journal of Acoust Soc. Ameri. 24(2) pp. 175-194, March 1952 [4] M. M. Sondhi, New Methods of Pitch Extraction, IEEE Audio and Electroacoustics, Vol. AU-16, No. 2, pp. 262-266, June 1968 [5] M. H. Savoji, A Robust Algorithm for Accurate End-pointing of Speech, Speech Commun, Vol. 8, pp. 45-60,1989 [6] H. Ney, An Optimisation algorithm for determining the endpoints of isolated utterances, Proc, ICASSP 81, 1981 pp. 720-723 [7] B. Reaves, Comments on an improved endpoint detector for isolated word recognition, Corresp, IEEE, Acous. Speech, signal processing, Vol. 39, pp. 526-527, Feb 1991 [8] L. R. Rabiner and M.R. Sambur dunvoiced-silence detection using the Itakura distance measure, Proc. Conf. Acoust, Speech, Signal Processing. May 1977, pp. 323-326