A Hybrid System for Audio Segmentation and Speech endpoint Detection of Broadcast News

Similar documents
Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Human Emotion Recognition From Speech

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Emotion Recognition Using Support Vector Machine

WHEN THERE IS A mismatch between the acoustic

Learning Methods in Multilingual Speech Recognition

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Recognition at ICSI: Broadcast News and beyond

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Python Machine Learning

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Lecture 1: Machine Learning Basics

A study of speaker adaptation for DNN-based speech synthesis

Support Vector Machines for Speaker and Language Recognition

Word Segmentation of Off-line Handwritten Documents

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Modeling function word errors in DNN-HMM based LVCSR systems

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Affective Classification of Generic Audio Clips using Regression Models

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Modeling function word errors in DNN-HMM based LVCSR systems

Australian Journal of Basic and Applied Sciences

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Rule Learning With Negation: Issues Regarding Effectiveness

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speaker Identification by Comparison of Smart Methods. Abstract

Calibration of Confidence Measures in Speech Recognition

Speaker recognition using universal background model on YOHO database

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Assignment 1: Predicting Amazon Review Ratings

Speaker Recognition. Speaker Diarization and Identification

Learning Methods for Fuzzy Systems

Rule Learning with Negation: Issues Regarding Effectiveness

INPE São José dos Campos

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Linking Task: Identifying authors and book titles in verbose queries

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Proceedings of Meetings on Acoustics

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Segregation of Unvoiced Speech from Nonspeech Interference

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Learning From the Past with Experiment Databases

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Reducing Features to Improve Bug Prediction

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Using dialogue context to improve parsing performance in dialogue systems

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

On the Combined Behavior of Autonomous Resource Management Agents

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Statewide Framework Document for:

Speech Recognition by Indexing and Sequencing

Generative models and adversarial training

Mandarin Lexical Tone Recognition: The Gating Paradigm

Probabilistic Latent Semantic Analysis

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

CS Machine Learning

Body-Conducted Speech Recognition and its Application to Speech Support System

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Probability and Statistics Curriculum Pacing Guide

Evolutive Neural Net Fuzzy Filtering: Basic Description

Automatic Pronunciation Checker

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Switchboard Language Model Improvement with Conversational Data from Gigaword

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Reduce the Failure Rate of the Screwing Process with Six Sigma Approach

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

A student diagnosing and evaluation system for laboratory-based academic exercises

Artificial Neural Networks written examination

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

A Comparison of Two Text Representations for Sentiment Analysis

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

STA 225: Introductory Statistics (CT)

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Transcription:

A Hybrid System for Audio Segmentation and Speech endpoint Detection of Broadcast News Maria Markaki 1, Alexey Karpov 2, Elias Apostolopoulos 1, Maria Astrinaki 1, Yannis Stylianou 1, Andrey Ronzhin 2 1 Multimedia Informatics Lab, Computer Science Department, University of Crete (UoC), Greece {mmarkaki,ilapost,astrin,yannis}@csd.uoc.gr 2 St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences {karpov,ronzhin}@iias.spb.su Abstract A hybrid speech/non speech detector is proposed for the pre processing of broadcast news. During the first stage speech/non speech classification of uniform overlapping segments is performed. The accuracy in the detection of boundaries is determined by the degree of overlap of the audio segments and it is 250 ms in our case. Extracted speech segments are further processed on a frame basis using the entropy of the signal spectrum. Speech endpoint detection is accomplished with an accuracy of 10 ms. The combination of the two methods in one speech/non speech detection system, exhibits the robustness and accuracy required for subsequent processing stages like broadcast speech transcription and speaker diarization. 1. Introduction Automatic audio classification and segmentation is a research area of great interest in multimedia processing for automatic labeling and extraction of semantic information. In the case of broadcast audio recordings, pre processing for speech/non speech segmentation greatly improves subsequent tasks such as speaker change detection and clustering as well as speech transcription. Regarding speaker diarization systems, elimination of non speech frames is more critical whereas for speech transcription accurate detection of speech is equally important. In broadcast news, silence is usually reduced to a minimum and what mostly appears are other noises and music. Moreover, methods that work well on speech/music discrimination usually do not handle efficiently other non speech classes commonly present in broadcast data such as environmental noises, moving cars, claps, crowd babble, etc. Speech/non speech segmentation can be formulated as a pattern recognition problem where the optimal features and the classifier built on them are application dependent. Reviewing relevant past work, many approaches in the literature have examined various features and classifiers. MFCCs and SVMs have been extensively evaluated and seem to be among the most promising ones [1,2]. Furthermore, it has been shown that for successful audio segmentation and classification, the classification unit has to be a segment i.e. a sequence of frames rather than a single frame [1,2]. In this work we present a hybrid approach which combines a segment based classifier with a framebased speech endpoint detector [3]. We use uniformly spaced overlapping audio segments of 500 ms length during the first classification stage. Mean and standard deviation of MFCCs have been used to parameterize every segment. We have also evaluated two different methods of spectrogram computation before MFCCs extraction. Classification is performed using SVMs [4]. During next stage, only segments characterized as speech are processed on a frame basis (10 ms). Spectrum entropy is the feature we use for the detection of silent frames within speech segments. The organization of the paper is as follows: we review the segment based speech/non speech classification algorithm and the speech endpoint detection method in section 2. In section 3 we

describe experimental setup, the database and the experimental results. Finally in section 4 we present our conclusions. 2. Description of the method 2.1. Segment parameterization and classification Mel frequency cepstral coefficients are the most commonly used features in speech and speaker recognition systems. They also have been successfully applied in audio indexing tasks [1,2]. Here we extract 13 th order MFCCs from audio frames of 25 ms with a frame rate of 10 ms, i.e. every 10 ms the signal is multiplied using a Hamming window of 25 ms duration. We perform critical band analysis of the power spectrum with a set of triangular band pass filters as usual. For comparison purposes, we also derive an auditory like spectrogram by applying equal loudness pre emphasis and cube root intensity loudness compression according to Hermansky [5]. In each case, Mel scale cepstral coefficients are computed every 10 ms from the filterbank outputs. We define each segment as a sequence of 50 frames of 10ms each. We estimate the mean and standard deviation of MFCCs over 50 frames, resulting in a 26 element feature vector per segment. We extract evenly spaced overlapping segments every 25 frames (250 ms overlap) for the test dataset whereas for the training dataset segments are extracted every 5 frames (for maximizing training data). Support vector machines (SVMs) are used for the classification of segments. We have used SVM light [4] with a Radial Basis Functions kernel all the other parameters have been set to default values. We also define an hierarchy of classes similar to [2] for resolving conflicts that arise due to the overlap of segments: frames are classified as non speech if they are part of any segment that was classified as non speech; otherwise, they are classified as speech. 2.2. Spectral entropy based speech detector The speech detection method is based on calculation of the information entropy of the signal spectrum as the measure of uncertainty or disorder in a given distribution [6]. The distinction between entropy for speech segments and entropy for background noise is used for speech endpoint detection. Such criterion is less sensitive to the variations of the signal amplitude than the energybased methods. The method is a modification of the speech detection approach proposed by J. L. Shen [7] and includes new levels into the analysis of speech signal (Figure 1). s Fast S p Speech spectrum Fourier normalization Transformation Calculation of the spectral entropy R Logical temporal processing h Median smoothing g Noise Speech Noise Fig. 1. The algorithm for speech detection based on analysis of the entropy of signal spectrum The audio signal is divided into short segments with duration 11.6 ms each with overlapping 25%. Short time signal spectrum is computed using FFT, and normalization of the calculated spectrum over all frequency components is fulfilled giving the probability density function p i. Acceptable values of probability density function are upper and lower bounded. This restriction allows us to

exclude noises concentrated in a narrow band as well as noises approximately equally distributed among the frequency components (for instance, white noise). Thus: p i = 0, if p i < δ 2 or p i > δ 1 (1) where δ1 andδ 2 are the upper and lower values of probability density, respectively. They have been experimentally determined to be δ 1 = 0.3 and δ 2 = 0.01. At the next stage the information spectral entropy h is estimated, and median smoothing in a window of 5 9 segments is applied. Finally, a logical temporal processing of h (Figure 2) takes into account the possible durations of speech and non speech fragments. entropy, h threshold r function h speech nonspeech speech a b c d time, t Fig. 2. Logical temporal processing of the spectral entropy function An adaptive threshold r for the detection of speech endpoints is calculated as follows: max(h) - min(h) r = + min(h) * m (2) 2 where µ is a coefficient empirically chosen depending on the recording conditions. Employing the adaptive threshold we can obtain alternate speech and non speech regions given the function h and apply two criteria to process: (1) R minimal duration of a speech fragment in a phrase; (2) S maximal duration of a non speech fragment in a phrase. These criteria values were experimentally determined taking into account that a human cannot produce very short speech fragments as well as that there are always some pauses in speech (for instance, before explosive consonants). So if the number of consecutive speech segments is greater than R and non speech interval between them is shorter than S then all these segments are considered belonging to speech class. Such logicaltemporal processing is applied iteratively to the whole spectral entropy function automatically segmented for speech/non speech portions. 3. Experiments and Results We tested the algorithms described in section (2) on audio data collected from Greek TV programs (TV++) and music CDs. Speech data consists of broadcast news and TV shows recorded in different conditions such as studios or outdoors; also, some of the speech data have been transmitted over telephone channels. Non speech data consists of music (25%), outdoors noise (moving cars, crowd noise, etc), claps, and very noisy unintelligible speech due to many speakers talking simultaneously (speech babble). Music content consists of the audio signals at the beginning and the end of TV shows as well as songs from music CDs. Audio data are all mono channel and 16 bit per sample, with 16 khz sampling frequency. The database has been manually segmented and labeled at Computer Science Department, UoC. Speech signals have been partitioned into 30 minutes for training and 90 minutes for testing.

3.1. Speech / non speech classification results We evaluate system performance using the detection error trade off curve (DET) [8]. DET plot clearly presents detection performance tradeoff between false rejection rate (or speech miss probability) and false acceptance rate (or false alarm probability). Detection error probabilities are plotted on a nonlinear scale which transforms them by mapping to their corresponding Gaussian deviate. Thus DET curves are straight lines when the underlying distributions are Gaussian [8]. We also report the minimum value of the detection cost function for each detection error trade off curve according to [8]. For the speech/non speech segment based classification, the target is speech class having prior probability P t arg et = 50% in our data set. Here the costs of miss and false alarm probabilities are considered equally important ( C miss = C false =1) although they actually depend on the task. For speaker and language recognition C false > C miss, i.e. we should accurately reject non speech audio (low false alarm probability) whereas speech miss probability is less important. For speech transcription on the other hand C false < C miss, i.e. accurate detection of speech is rather more important. The minimum value of the detection cost function (DCF) for the DET curve [8] then, is: DCF opt = min( C miss * P miss * P t arg et + C false * P false *(1- P t arg et )) (3) In the case of common MFCC features, DCF opt = 9.54% and corresponds to P missopt = 6.24% and =12.84%. For the case of MFCC features extracted after loudness equalization and cube P falseopt root compression, a remarkable improvement in all aspects is noticed: DCF opt = 4.96%, P missopt = 4.07% and P falseopt = 5.84%. Another commonly used measure of accuracy is the EER (Equal Error Rate) which corresponds to the decision threshold θ EER at which false rejection rate ( P miss ) equals false acceptance rate ( P false ). Since P miss and P false are discrete, we set: and (4) θ EER = argmin q P miss (q) - P false (q) EER(q) = P miss (q) + P false (q) 2 (5)

Figure 3: DET curves for speech/non speech segment based classification. Mean and variance of MFCCs are computed over each segment, with (solid line) or without (dashed line) equal loudness pre emphasis and cube root intensity loudness compression [5]. The minimal values of the corresponding detection cost functions (DCF) are also presented (circles). We report in Table 1 the results for the speech/non speech segment based classification and present in Figure 3 the corresponding DET curves. Since in this case P t arg et = 50% and C miss = C false =1, both values of EER and DCF opt are quite close. MFCC features extracted after loudness equalization and compression are clearly superior according to EER, too. Table 1: Speech/non speech segment based classification results System DCF opt P miss P false EER MFCCs baseline 9.54% 6.24% 12.84% 9.91% equal loudness+compression 4.96% 4.07% 5.84% 5.01% 3.2. Speech endpoint detection results Audio segments classified as speech at the first detection stage are further processed using the entropy based method for speech endpoint detection with 10 ms accuracy (after rounding). This is a pre processing step required for subsequent broadcast speech transcription. In this case, the total number of silence frames is much lower than the total number of speech frames: prior probability of speech class is P t arget = 88.96% for our dataset where speech is the target. If the costs of miss and false alarm probabilities are considered of equal importance, then the minimum value of the detection cost function ( DCF ) for the DET curve is = 6.47% corresponding to = 4.48% and = 22.52 %. We report in Table 2 the results for speech/silence classification and present in Figure 4 the corresponding DET curve. We can see that EER is significantly higher than DCF opt in this case since it doesn t take into account the highly unequal prior probabilities of speech and silence.

Figure 4: DET curve for speech endpoint detection with 10 ms accuracy applied onto extracted speech segments. The minimal value of the corresponding detection cost function (DCF) is presented as circle. Table 2: Speech/silence classification results based on spectrum entropy DCF opt P miss P false EER 6.47% 4.48% 22.52% 10.83% 4. Conclusions In this paper we have applied a two stage speech detection system. During the first stage, segmentbased speech/non speech classification is performed based on MFCC features and Support Vector Machines within 250 ms accuracy. An improvement is reported if we use loudness equalization and cube root compression to the power spectrogram after critical band analysis. Extracted speech segments are further processed through an entropy based method for speech endpoint detection within 10 ms accuracy. The proposed system can successfully address the two fold requirement for robustness and accuracy during the pre processing stages preceding broadcast speech transcription or speaker diarization. Acknowledgements This work has been supported by the General Secretariat of Research and Technology (GGET) in Greece and Russian Foundation for Basic Research in the framework of the project # 07 07 00073 а. The collaborative research was part of the PhD exchange program of the SIMILAR Network of Excellence project # FP6 507609.

References 1. L. Lu, H.J. Zhang, Stan Li. Content based audio classification and segmentation by using support vector machines. Multimedia Systems 8: 482 492, 2003. 2. H. Aronowitz. Segmental modeling for audio segmentation. Proc. ICASSP 2007, Hawaii, USA, 2007. 3. A. Karpov. A robust method for determination of boundaries of speech on the basis of spectral entropy. Artificial Intelligence Journal. Donetsk, Vol.4. pp. 607 613, 2004. 4. T. Joachims. Making large scale SVM learning practical. In Advances in Kernel Methods Support Vector Learning, MIT Press, 1999. 5. H. Hermansky, B. Hanson, H. Wakita. Perceptually based linear predictive analysis of speech. Proc. ICASSP 1985, pp. 509 512, 1985. 6. J. Ajmera, I. McCowan, H. Bourlard. Speech/music segmentation using entropy and dynamism features in a HMM classification framework. Speech Communication, 40, pp. 351 363, 2003. 7. J. L. Shen, J. W. Hung, L. S. Lee. Robust Entropy based Endpoint Detection for Speech Recognition in Noisy Environments. Proc. ICSLP 1998, Sydney, Australia, paper 0232, 1998. 8. The NIST Year 2004 Speaker Recognition Evaluation Plan, http://www.nist.gov/speech/tests/spk/2004/