Abstract. 1. Introduction

Similar documents
Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Emotion Recognition Using Support Vector Machine

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

WHEN THERE IS A mismatch between the acoustic

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Human Emotion Recognition From Speech

A study of speaker adaptation for DNN-based speech synthesis

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Word Segmentation of Off-line Handwritten Documents

Body-Conducted Speech Recognition and its Application to Speech Support System

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Methods for Fuzzy Systems

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Recognition at ICSI: Broadcast News and beyond

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Learning Methods in Multilingual Speech Recognition

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Speaker recognition using universal background model on YOHO database

Modeling function word errors in DNN-HMM based LVCSR systems

Speaker Recognition. Speaker Diarization and Identification

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Lecture 1: Machine Learning Basics

Proceedings of Meetings on Acoustics

Segregation of Unvoiced Speech from Nonspeech Interference

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Probabilistic Latent Semantic Analysis

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Support Vector Machines for Speaker and Language Recognition

Automatic segmentation of continuous speech using minimum phase group delay functions

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Python Machine Learning

On the Formation of Phoneme Categories in DNN Acoustic Models

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Modeling function word errors in DNN-HMM based LVCSR systems

SIE: Speech Enabled Interface for E-Learning

Probability and Statistics Curriculum Pacing Guide

SARDNET: A Self-Organizing Feature Map for Sequences

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Speaker Identification by Comparison of Smart Methods. Abstract

Extending Place Value with Whole Numbers to 1,000,000

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

S T A T 251 C o u r s e S y l l a b u s I n t r o d u c t i o n t o p r o b a b i l i t y

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

INPE São José dos Campos

On-Line Data Analytics

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Calibration of Confidence Measures in Speech Recognition

Data Fusion Models in WSNs: Comparison and Analysis

Circuit Simulators: A Revolutionary E-Learning Platform

Disambiguation of Thai Personal Name from Online News Articles

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Mastering Team Skills and Interpersonal Communication. Copyright 2012 Pearson Education, Inc. publishing as Prentice Hall.

THE RECOGNITION OF SPEECH BY MACHINE

Software Maintenance

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Using dialogue context to improve parsing performance in dialogue systems

Evolutive Neural Net Fuzzy Filtering: Basic Description

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Generative models and adversarial training

Summarizing A Nonfiction

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A Case Study: News Classification Based on Term Frequency

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

South Carolina English Language Arts

Speech Recognition by Indexing and Sequencing

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Corpus Linguistics (L615)

Rendezvous with Comet Halley Next Generation of Science Standards

On the Combined Behavior of Autonomous Resource Management Agents

Mathematics subject curriculum

Why Did My Detector Do That?!

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

School of Innovative Technologies and Engineering

CEFR Overall Illustrative English Proficiency Scales

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Physics 270: Experimental Physics

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

Mandarin Lexical Tone Recognition: The Gating Paradigm

Transcription:

A New Silence Removal and Endpoint Detection Algorithm for Speech and Speaker Recognition Applications G. Saha 1, Sandipan Chakroborty 2, Suman Senapati 3 Department of Electronics and Electrical Communication Engineering Indian Institute of Technology, Khragpur, Kharagpur-721 302, India Email: {gsaha@ece.iitkgp.ernet.in 1, sandipan@ece.iitkgp.ernet.in 2, speech_ece@rediffmail.com 3 } Abstract Pre-processing of Speech Signal serves various purposes in any speech processing application. It includes Noise Removal, Endpoint Detection, Pre-emphasis, Framing, Windowing, Echo Canceling etc. Out of these, silence/unvoiced portion removal along with endpoint detection is the fundamental step for applications like Speech and Speaker Recognition. The proposed method uses Probability Density Function (PDF) of the background noise and a Linear Pattern Classifier for classification of Voiced part of a speech from silence/unvoiced part. The work shows better end point detection as well as silence removal than conventional Zero Crossing Rate (ZCR) and Short Time Energy (STE) function methods. 1. Introduction Pre-Processing of Speech Signal is very crucial in the applications where silence or background noise is completely undesirable. Applications like Speech and Speaker Recognition [1] needs efficient feature extraction techniques from speech signal where most of the voiced part contains Speech or Speaker specific attributes. Endpoint Detection [2],[3] as well as silence removal are well known techniques adopted for many years for this and also for dimensionality reduction in speech that facilitates the system to be computationally more efficient. This type of classification of speech into voiced or silence/unvoiced [4] sounds finds other applications mainly in Fundamental Frequency Estimation, Formant Extraction or Syllable Marking, Stop Consonant Identification and End Point Detection for isolated utterances. There are several ways of classifying (labeling) events in speech. It is accepted convention to use a three-state representation in which states are (i) silence (S), where no speech is produced; (ii) unvoiced (U), in which the vocal cords [5] are not vibrating, so the resulting speech waveform is aperiodic or random in nature and (iii) voiced (V), in which the vocal chords are tensed and therefore vibrate periodically when air flows from the lungs, so the resulting waveform is quasi-periodic [6]. It should be clear that the segmentation of the waveform into welldefined regions of silence, unvoiced, signals is not exact; it is often difficult to distinguish a weak, unvoiced sound (like /f/ or /th/) from silence, or weak voiced sound (like /v/ or /m/) from unvoiced sounds or even silence. However, it is usually not critical to segment the signal to a precision much less than several milliseconds; hence, small errors in boundary locations usually have no consequence for most applications. Since for most of the practical cases the unvoiced part has low energy content and thus silence (background noise) and unvoiced part is classified together as silence/unvoiced and is distinguished from voiced part. Two widely accepted methods namely Short Time Energy (STE) [6],[7] and Zeros Crossing Rate (ZCR) [6],[7] have been used for a long time for silence removal. But they have their own limitation regarding setting thresholds as an ad hoc basis. STE uses the fact that energy in voiced sample is greater than silence/unvoiced sample. However, it is not specific about how much greater it needs to be for proper classification and varies case to case. On the other hand ZCR has a demarcation rule specifying that if the ZCR of a portion speech exceeds 50 then this portion will be labeled as unvoiced or background noise whereas any segment showing ZCR at about 12 is considered to be the voiced one. One attempt [8] was made by taking these two methods together and results reported only 65% accuracy with respect to manually labeled speech sample. In this paper, we detect silence/unvoiced part from the speech sample using uni-dimensional Mahalanobis Distance [9] function which itself is a Linear Pattern Classifier [9],[10]. Our algorithm uses statistical properties of background noise as well as physiological aspect of speech production and does not assume any ad hoc threshold. We also show the algorithm s performance using the measure of correctness taking manually labeled speech as a reference. The experiments are done on two kinds of speeches which are a running text read from a paragraph and a combination lock number. The result shows better classification for the proposed method in both the cases when compared against conventional silence/unvoiced detection methods. We assume that background noise present in the utterances are Gaussian [11] in nature, however a speech signal may also be contaminated with different types of noise [12]. In such cases the corresponding properties of the noise distribution function are to be used for detection purpose. This paper is organized as follows. In section 2 we describe the theoretical background. Section 3 presents the algorithm along with a short discussion regarding computational complexity and defining the measure of correctness. The results are presented in section 4 and section 5 describes the principal conclusion. 1

2. Theoretical Background 2.1 Speech Signal and its Basic Properties The speech signal [13] is a slowly time varying signal [14] in the sense, that, when examined over a sufficiently short period of time (between 5 and 100 msec), its characteristics are fairly stationary; however, over long periods of time (on the order of 1/5 seconds or more) the signal characteristics change to reflect the different speech sounds being spoken. Usually first 200 msec or more (1600 samples if the sampling rate is 8000 samples/sec) of a speech recording corresponds to silence (or background noise) because the speaker takes some time to read when recording starts. Figure 1 illustrates the fact. Pr [ x μ 2] 0.95 (3) Pr [ x μ 3] 0.997 (4) as shown in Fig. 2 given below : Fig. 2. A one-dimensional Gaussian distribution, pu ( ) N(0,1), has 68% of its probability mass in the range u 1, 95% in the range of u 2, and 99.7% in the range of u 3. A natural measure of the distance from x to the mean is the distance x-μ measured in units of standard deviation which can be analytically expressed as: Fig. 1. Diagram of a typical Speech Signal 2.2 Gaussian or Normal Distribution One of the most important results of the probability theory is the Central Limit Theorem [9], which states that, under various conditions, the distribution for the sum of d independent random variables approaches a particular limiting form known as the normal distribution. As such, the normal or Gaussian probability density function is very important, both for theoretical and practical reasons. In one dimension, it is defined by: -1-2 ( x μ 1 ) px ( ) = e2 (1) 2π x μ r = (5) and defined as Mahalanobis Distance from x to μ (In the one-dimensional case, this is sometimes called z-score). Thus for instance the probability is 0.95 that the Mahalanobis distance from x to μ will be less than 2. If a random variable x is modified by (a) subtracting its mean and (b) dividing by its standard deviation, it is said to be standardized. Clearly, a standardized normal random variable r=(x-μ)/ has zero mean and unit standard deviation-that is, 2 -u 1 pu ( ) = e2 (6) 2π The normal density is traditionally described as bellshaped curve ; it is completely determined by the numerical values for two parameters, the mean μ and the variance 2. This is often emphasized by writing p(x)~n(μ 2 ), which is read as x is distributed normally with mean µ and variance 2. The distribution is symmetrical about the mean, the peak occurring at x=µ and the width of the bell is proportional to the standard deviation. Normally distributed data points tend to cluster about the mean. Numerically, the probabilities obey Pr [ x μ ] 0.68 (2) 2 which can be written as pu ( ) N(0,1). 3. Method 3.1 The Algorithm The algorithm described below is divided into two parts. First part assigns label to the samples by using a statistical properties of background noise while the second part smoothens the labeling by the physiological aspects from the speech production process. The Algorithm two passes over speech samples. In Pass I (Step 1 to 3) we use statistical property of background noise to make a sample

as voiced or silence/unvoiced. In Pass II (Step 4 and 5) we use physiological aspects of speech production for smoothening and reduction of probabilistic errors in statistical marking of Pass I. samples (10 ms) window. Then a condition checking is done on this sum to classify it as voiced part or not. Thus the proposed method is computationally comparable to Step 1: Calculate the mean and standard deviation of the first 1600 samples of the given utterance. If μ and are the mean and the standard deviation respectively then analytically we can write, 1 1600 μ = x() i (7) 1600 i = 1 1 1600 2 = ( ( ) ) 1600 i 1 xi μ (8) = Note that background noise is characterized by this μ and. Step 2: Go from 1 st sample to the last sample of the speech recording. In each sample check whether onedimensional Mahalanobis distance function i.e. x-μ / greater than 3 or not. Analytically, x μ If, > 3 (9) the sample is to be treated as voiced sample otherwise it is an silence/unvoiced. Note that the threshold reject the samples upto 99.7% as per given by equation no. 4 in a Gaussian Distribution thus accepting only the voiced samples. Step 3: Mark the voiced sample as 1 and unvoiced sample as 0. Divide the whole speech signal into 10 ms non-overlapping windows. Now the complete speech is represented by only zeros and ones. Step 4: Consider there are M no. of zeros and N number of ones in a window. If M N then convert each of ones to zeros and vice versa. This method adopted here keeping in mind that a speech production system consisting of vocal chord, tongue, vocal tract etc. cannot change abruptly in a short period of time window taken here as 10 ms. Step 5: Collect the voiced part only according to the labeled 1 samples from the windowed array and dump it in a new array. Retrieve the voiced part of the original speech signal from labeled 1 samples. The algorithm is illustrated in the flow chart given in fig.3. Note that in the proposed method after μ and calculation it requires one division and condition checking per sample in Pass I and in Pass II one condition checking per 80 samples (10 ms). In ZCR method sign of each sample is checked and number of reversal in a window (80 samples, 10 ms) is calculated. Then the no. of sign reversal is checked if within certain range for voice part classification purpose. In STE method each sample sequence energy is calculated and then summed over 80 Fig. 3. Flow Chart of the algorithm conventional STE & ZCR based silence/unvoiced detection method and can be used for real time analysis. However, in the result section we show the proposed method is superior performance wise for silence-voice classification. Note that in STE and ZCR method threshold is calculated after few trials or adhoc basis whereas, proposed method uniquely defines threshold first instance. 3.2 Percentage of Correctness Percentage of correctness regarding extraction of voiced sample from a speech signal is defined as follows: Nmanual Na lgorithm % of correctness = 100 100 (11) N manual 3

Where, N manual is the no. of voiced samples from manually labeled speech, N algorithm is the no. of voiced samples from a specified algorithm. 4. Results Two experiments are conducted here. In the first experiment, a combination lock number ( 26-81-57-29- 94-52-35-79-89 ) from YOHO database is taken while in the second one a running text is read from a paragraph for about 20 sec duration is considered as a speech sample. The second speech is recorded keeping fan, air condition and computers on. For both utterances three algorithms 1) STE 2) ZCR with STE together (because ZCR when used showed poor performance) and 3) Proposed Method are used and output waveforms are presented as follows. Figure 4 and 8 show two original speech samples for two different utterances. Figure 5, 6 and 7 are the results of the combinational lock no. and fig. 9, 10 and 11 are the results of the running text for STE, ZCR-STE & proposed method respectively. Table 1 summarizes results showing percentage of correctness in detection of all three algorithms for both the phrases. Note that the result shows that all the algorithms perform better for YOHO data (combinational lock no.) which is relatively noise free than the speech (running text) collected from noisy environment for the second experiment. Proposed method performs well in both the experiments than conventional ZCR & STE method. Fig. 6. Output of STE-ZCR method for combination lock number Fig. 7. Output of Proposed Method for combination lock number Table 1: Performance Index of the Algorithms using percentage of correctness criteria Phrases STE ZCR-STE Proposed Method Combination 77.9531% 70.3720% 83.5565% lock number Running Text 50.8391% 50.1231% 59.7181% Fig. 8. Original speech signal for running text Fig. 4. Original speech signal for combination lock Number Fig. 9. Output of STE method for running text Fig. 5. Output of STE method for combination lock number Fig. 10. Output of STE-ZCR method for running text 4

Fig. 11. Output of Proposed Method for running text 5. Conclusion A new silence end-point detection technique for Speech/Speaker Recognition is presented. The method uses statistical properties of background noise and also the physiological aspects of speech production process. The method assumes the noise to be white Gaussian. However, for other types of noises [12] a similar approach of characterization of noise through probabilistic model can be used. The threshold used in this method is uniquely specified and require no trial and error or adhocism. It is shown to be computationally efficient for real time applications and it performs better than conventional methods for speech samples collected from noisy as well as noise free environment. Classification of Speech, IEEE Transaction on ASSP, Vol-37, No-11, pp. 1771-74, Nov 1989. [8] Mark Greenwood and Andrew KInghorn, SUVing: Automatic Silence/Unvoiced/Voiced Classification of Speech'', Presented at the university of Sheffield. [9] Richard. O. Duda, Peter E. Hart, David G. Strok, Pattern Classification, A Wiley-interscience publication, John Wiley & Sons, Inc, Second Edition, 2001. [10] Sarma, V.; Venugopal, D., Studies on pattern recognition approach to voiced-unvoiced-silence classification, Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP '78., Volume: 3, Apr 1978, Pages: 1-4. [11] L. R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals, First Edition, Chapter-4, Pearson Education, Prentice-Hall. [12] http://cslu.ece.ogi.edu/nsel/data/spear_technical.html. [13] J. L. Flanagan, Speech Analysis, Synthesis, and Perception, 2 nd ed., Springer-Verlag, New York, 1972. [14] L. R. Rabiner and B. H. Juang, Fundamentals of speech recognition, 1st Indian Reprint, Pearson Education. 6. Acknowledgement The work is partly supported by Indian Space Research Organization (ISRO), Government of India. References [1] J. P. Campbell, Jr., Speaker Recognition: A Tutorial, Proceedings of The IEEE, Vol.85, No.9, pp.1437-1462, Sept.1997. [2] Koji Kitayama, Masataka Goto, Katunobu Itou and Tetsunori Kobayashi, Speech Starter: Noise-Robust Endpoint Detection by Using Filled Pauses, Eurospeech 2003, Geneva, pp. 1237-1240. [3] S. E. Bou-Ghazale and K. Assaleh, A robust endpoint detection of speech for noisy environments with application to automatic speech recognition, in Proc. ICASSP2002, vol. 4, 2002, pp. 3808 3811. [4] A. Martin, D. Charlet, and L. Mauuary, Robust speech / non-speech detection using LDA applied to MFCC, in Proc. ICASSP2001}, vol. 1, 2001, pp. 237 240. [5] K. Ishizaka and J.L Flanagan, Synthesis of voiced Sounds from a Two-Mass Model of the Vocal Chords, Bell System Technical J., 50(6): 1233-1268, July-Aug., 1972. [6] Atal, B.; Rabiner, L., A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition Acoustics, Speech, and Signal Processing [see also IEEE Transactions on Signal Processing], IEEE Transactions on, Volume: 24, Issue: 3, Jun 1976, Pages: 201-212. [7] D. G. Childers, M. Hand, J. M. Larar, Silent and Voiced/Unvoied/ Mixed Excitation(Four-Way), 5