Combining Finite State Machines and LDA for Voice Activity Detection

Similar documents
Human Emotion Recognition From Speech

WHEN THERE IS A mismatch between the acoustic

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Emotion Recognition Using Support Vector Machine

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Modeling function word errors in DNN-HMM based LVCSR systems

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Learning Methods in Multilingual Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speaker recognition using universal background model on YOHO database

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

A study of speaker adaptation for DNN-based speech synthesis

Word Segmentation of Off-line Handwritten Documents

Python Machine Learning

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Calibration of Confidence Measures in Speech Recognition

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Affective Classification of Generic Audio Clips using Regression Models

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Support Vector Machines for Speaker and Language Recognition

Speaker Recognition. Speaker Diarization and Identification

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

On the Formation of Phoneme Categories in DNN Acoustic Models

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Speaker Identification by Comparison of Smart Methods. Abstract

Australian Journal of Basic and Applied Sciences

INPE São José dos Campos

Speech Recognition by Indexing and Sequencing

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Learning Methods for Fuzzy Systems

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

An Online Handwriting Recognition System For Turkish

STA 225: Introductory Statistics (CT)

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Software Maintenance

Lecture 1: Machine Learning Basics

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

Probabilistic Latent Semantic Analysis

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Author's personal copy

Evolutive Neural Net Fuzzy Filtering: Basic Description

Corrective Feedback and Persistent Learning for Information Extraction

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Softprop: Softmax Neural Network Backpropagation Learning

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Body-Conducted Speech Recognition and its Application to Speech Support System

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Automatic Pronunciation Checker

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

(Sub)Gradient Descent

Large vocabulary off-line handwriting recognition: A survey

Reducing Features to Improve Bug Prediction

Segregation of Unvoiced Speech from Nonspeech Interference

Probability and Statistics Curriculum Pacing Guide

Semi-Supervised Face Detection

Statewide Framework Document for:

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Assignment 1: Predicting Amazon Review Ratings

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

A Reinforcement Learning Variant for Control Scheduling

Rule Learning With Negation: Issues Regarding Effectiveness

Grade 6: Correlated to AGS Basic Math Skills

Artificial Neural Networks written examination

Learning From the Past with Experiment Databases

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

A student diagnosing and evaluation system for laboratory-based academic exercises

Rule Learning with Negation: Issues Regarding Effectiveness

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Using dialogue context to improve parsing performance in dialogue systems

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

Transcription:

Combining Finite State Machines and LDA for Voice Activity Detection Elias Rentzeperis, Christos Boukis, Aristodemos Pnevmatikakis, and Lazaros C. Polymenakos Athens Information Technology, 19.5 Km Markopoulo Ave., Peania/Athens 192, Greece {eren,cbou,apne,lcp}@ait.edu.gr Abstract. Arobustspeechactivitydetectionsystemispresentedin this paper. The proposed approach combines the well-known linear discriminant analysis with a finite state machine in order to successfully identify speech patterns within a recorded audio signal. The derived method is compared with existing ones to demonstrate its superiority, especially when performing on noisy audio signals, obtained with far field microphones. 1 Introduction Voice activity detection (VAD) is a fundamental component of several modern speech processing systems like automatic speech recognition (ASR), voice commanding and teleconferencing. Providing such systems with accurate information about the existence of speech within an audio signal can result in reduction of the computational and energy requirements and improved performance of the overlying system. Most VAD systems monitor a quantity and they compare it to a threshold in order to decide whether the observed signal is speech or not [1]. This quantity is usually the energy of the observed signal, which has presented remarkable performance with close talking (CT) microphones. The threshold can be chosen either with heuristic methods or adaptively [2], so as to be able to cope with nonstationary environments. Another approach is to use classification techniques, like the well-documented linear discriminant analysis [3], in order to distinguish speech from non-speech patterns. These techniques have noticeable results for both CT and far field (FF) microphones. The same holds for VAD systems that rely on the use of Hidden Markov Models (HMM). The use of finite state machines (FSMs) in VAD was proposed as well [4]. These models pose some lower bounds on the duration of silence and speech intervals. Hence more accurate separation is performed since segments of very small duration characterised as speech within a silent interval are neglected and vice versa. In this paper we propose the use of a five state automaton, as was presented in [4, 5], which uses the LDA method applied to Mel Frequency Cepstral Coefficients (MFCC) as primary criterion for transition between states contrary to the approaches presented in [4, 5] which use the energy instead.

324 E. Rentzeperis, C. Boukis et al. Our approach was found to have improved performance. The energy was completely neglected, since it might vary depending on the relative position of the microphone and the speaker. This paper is organised as follows: Section 2 provides the basic background and summarises the previous VAD methods that employ FSMs. In Section 3 the proposed system is presented. The results of the performance of the introduced approach are provided in Section 4 and are compared to those of other methods. Finally Section 5 concludes the paper. 2Background 2.1 Mel Frequency Cepstral Coefficients Mel Frequency Cepstral Coefficients (MFCC) are the dominant features used in speech applications. They are obtained by taking the inverse Fourier transform of the log spectrum after it is wrapped according to a nonlinear scale that is matching by properties of human hearing, the Mel scale. It was shown in our experiments that the addition of the first and second derivatives of the MFCC as well as of the energy of each preprocessed frame enhances the performance of the algorithm. Fig. 1. Finite State Machine 2.2 Linear Discriminant Analysis Linear discriminant analysis (LDA) is a method that efficiently separates data into classes [3]. In the case of VAD there are two classes to be discriminated, speech and non speech. The optimal discriminating line www is derived by maximising the following criterion function J (www )= www t SSS Bwww www t SSS Wwww (1)

SSS Combining FSMs and LDA for Voice Activity Detection 325 where SSS B is the between-class scatter matrix and SSS W is the within-class scatter matrix. SSS B is a measure of the separation of the means of the clusters, while SSS W is a measure of the spread of the clusters. The maximization problem reduces to a general eigenvalue one, given by 1 W SSS Bwww = λ www (2) The eigenvector that corresponds to the greatest eigenvalue from the solutions is chosen as the projecting vectorofthetestvectors. 2.3 Finite State Model s e c s e c In [4] the use of a five state automaton was proposed for VAD. Its five states were silence, speech presumption, speech, plosive or silence and possible speech continuation. Thetransitionsbetweenstateswere controlled by comparing the derived short and long term energy estimates with an energy threshold. From Fig. 1 and Tab. 1, where the introduced FSM and the associated conditions and actions are presented, it is observed that a segment is characterised as speech if its duration is longer than 64m AND its energy is above the employed threshold. Similarly, a silent interval smaller than 24m is classified as plosive, and thus speech. Table 1. Conditions and Actions of the energy controlled five state automaton for VAD Conditions C1 Energy<Energy Threshold C2 Speech Duration (SD)>=64ms C3 Silence Duration (SiD)>=24ms Actions A1 SiD = SiD + l A2 SD = l A3 SiD = SiD + SD A4 SD = SD + l A5 SiD = l A6 SiD = SD = In order to improve the performance of this system the introduction of an extra criterion was proposed in [5]. Thissystemcharacterisedasspeech segments that satisfied not only the energy but the LDA criterion as well. It does not clarify though what happens when the results of the energy and the LDA criteria do not match. The LDA was trained by using two learning databases where the speech and non-speech intervals have been manually segmented. The LDA threshold was derived from these databases as well.

326 E. Rentzeperis, C. Boukis et al. 6 x 14 4 2 1 5 1 2 Energy of Speech.1.2.3 Energy of Non Speech 6 4 2 4 2 2 LDA projected value of Speech 6 4 2 2 2 LDA projected value of Non Speech Fig. 2. Histograms of the energy and the LDA projected values of the speech/nonspeech segments of the training data. 3ProposedSystem Embarking upon the observation that LDA provides more accurate discrimination between speech and non-speech classes than simply comparing the energy estimate with a threshold, and adopting the FSM of [4] a robust VAD system was developed. The choice to use LDA projection instead of energy is justified from Fig. 2 where is illustrated that the speech and silent segments have similar energy values but different LDA projections of their MFCC. The proposed architecture used the five state automaton of Fig. 1, but the primary criterion that controlled the transition between states was derived by comparing the linear combination of the MFCC provided by the LDA, with a threshold. The LDA classifier was trained with manually segmented speech/nonspeech signals. The threshold was obtained from the provided training data as well. Moreover, median filtering was applied to the results obtained from FSM in order to remove spiky decision regions and get improved error rates. The audio signal was processed in frames. For each frame the corresponding MFCC were computed and subsequently their linear combination, which was derived by LDA, was compared to the Threshold LDA to decide whether this is speech or not. Notice that the duration bounds and the time counters (S D, S i D )

Combining FSMs and LDA for Voice Activity Detection 327 Table 2. Conditions and Actions of the proposed LDA controlled five state automaton for VAD Conditions C1 Linear MFCC Combination<Threshold LDA C2 Speech Duration (SD)>=5 frames C3 Silence Duration (SiD)>=16 frames Actions A1 SiD = SiD +1 A2 SD =1 A3 SiD = SiD + SD A4 SD = SD +1 A5 SiD =1 A6 SiD = SD = are expressed in frames instead of msec. The proposed approach is summarised in Tab. 2. 4Experiments To evaluate its performance the introduced VAD system was compared to the approach of [4] that uses the same five state automaton, but the state transitions are controlled by the comparison of the energy estimates with an energy threshold thestand-aloneldaappliedtomffcsforthediscriminationofthespeech from the non-speech class theenergybasedadaptivealgorithmpresentedin[1]whichreliesonan estimation of the instantaneous SNR for the distinction of speech and non speech segments The VAD systems were evaluated on a database collected by the University of Karlsruhe (ISL-UKA). The database is comprised of seven seminars. Each seminar contains four segments of audio data that are approximately five minutes long. The audio segments are sampled at a rate of 16. khz. All the data were obtained from FF microphones resulting in comparable energy values of speech and non-speech segments (Fig. 2). Segmentsthreeandfourwereused for the training of the algorithm while one and two for testing. Manual human transcriptions were provided for the separation of the training segments and evaluation of the testing recordings. The following metrics were used for the evaluation of the algorithms: MismatchRate(MR):theratiooftheincorrectdecisionsoverthetotaltime of the tested segment.

328 E. Rentzeperis, C. Boukis et al. Table 3. Comparison of the proposed VAD with exiting approaches Method LDA Energy MR SDER NDER ADER Wpeps Threshold Threshold LDA 4.9-1.9% 1.4% 8.62% 9.51%.9 Adaptive Energy - - 18.1% 18.4% 15.6% 17.%.8 Thresholding FSM+LDA 4.9-9.94% 1.19% 8.65% 9.42%.8 FSM+Energy -.43 17.28% 17.69% 14.63% 16.16%.8 Speech Detection Error Rate(SDER):the ratio of incorrect decisions at speech segments over the total time of speech segments. NonSpeechDetectionErrorRate(NDER): the ratio of incorrect decisions at non speech segments over the total time of non speech segments. AverageDetectionErrorRate(ADER):theaverageofSDERandNDER. WorkingPointEpsilon(WPeps):anindicatorofthebalancebetweenSDER and NDER. It is the absolute value of the difference between SDER and NDER over their sum. Considering that SDER and NDER should be relatively balanced in order to draw any conclusions for the value of the algorithms, we required WPeps to be between and.1 for the results to be valid. Under this constraint the parameter that we seek to optimize is the ADER. Each frame consisted of 124 samples. Furthermore the amount of overlapping between neighbouring frames was 75%. The LDA method was trained with manually segmented speech and nonspeech data. The SD threshold was 5framesandtheSiDone16frames,whichcorrespondto128msecand34 msec respectively, since the sampling rate was 16. khz. The window size in the median filtering step was 29 frames long. The performance of the compared VAD systems is presented in Tab. 3. From this table it is observed that the proposed method presents improved performance compared to the other approaches. 5Conclusions Arobustvoiceactivitydetectionsystemhasbeenproposedinthispaper,which combines a finite state machine along with the linear discriminant analysis in order to perform accurate segmentation of audio signals to speech/non-speech segments. This approach was found to outperform the stand-alone LDA and the existing approaches that combine FSMs with the energy criterion for VAD. Its performance was evaluated with noisy far field microphone recordings. Acknowledgments: This work is sponsored by the European Union under the integrated project CHIL, contract number 5699.

Combining FSMs and LDA for Voice Activity Detection 329 References 1. D. A. Reynolds, R. C. Rose and M. J. T. Smith, PC-Based TMS32C3 Implementation of the Gaussian Mixture Model Text-Independent Speaker Recognition System, in International Conference on Signal Processing Applications and Technology, Hyatt Regency, Cambridge, Massachusetts, pp. 967 973, November 1992 2. S.Gökhun Tanyer and Hamza Özer, Voice Activity Detection in Nonstationary Noise, IEEE Trans. Sp. Au. Proc., vol. 8, no. 4, pp 479 482, Jul. 2 3. R.O. Duda P.E. Hart and D.G. Stork, Pattern Classification, John Willey & Sons, 21 4. L. Mauuary and J. Monné, Speech/non-speech Detection for Voice Response Systems, in Eurospeech 93, Berlin, Germany, 1993, pp197 11 5. A. Martin, D. Charlet and L. Mauuary, Robust Speech/Non-Speech Detection Using LDA Applied to MFCC, ICASSP, 21