Speaker Change Detection using Support Vector Machines

Similar documents
Human Emotion Recognition From Speech

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speech Emotion Recognition Using Support Vector Machine

A study of speaker adaptation for DNN-based speech synthesis

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Modeling function word errors in DNN-HMM based LVCSR systems

Assignment 1: Predicting Amazon Review Ratings

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

WHEN THERE IS A mismatch between the acoustic

Modeling function word errors in DNN-HMM based LVCSR systems

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Recognition at ICSI: Broadcast News and beyond

Word Segmentation of Off-line Handwritten Documents

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Lecture 1: Machine Learning Basics

Australian Journal of Basic and Applied Sciences

Python Machine Learning

Reducing Features to Improve Bug Prediction

Probabilistic Latent Semantic Analysis

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Learning Methods in Multilingual Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speaker Identification by Comparison of Smart Methods. Abstract

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Support Vector Machines for Speaker and Language Recognition

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Calibration of Confidence Measures in Speech Recognition

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Rule Learning With Negation: Issues Regarding Effectiveness

Automatic Pronunciation Checker

Using dialogue context to improve parsing performance in dialogue systems

Affective Classification of Generic Audio Clips using Regression Models

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Speaker recognition using universal background model on YOHO database

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Segregation of Unvoiced Speech from Nonspeech Interference

Learning Methods for Fuzzy Systems

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Linking Task: Identifying authors and book titles in verbose queries

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Spoofing and countermeasures for automatic speaker verification

Robot manipulations and development of spatial imagery

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

arxiv: v2 [cs.cv] 30 Mar 2017

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Data Fusion Models in WSNs: Comparison and Analysis

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Lecture 1: Basic Concepts of Machine Learning

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

A Case Study: News Classification Based on Term Frequency

Rule Learning with Negation: Issues Regarding Effectiveness

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

On the Formation of Phoneme Categories in DNN Acoustic Models

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

GACE Computer Science Assessment Test at a Glance

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Speech Recognition by Indexing and Sequencing

12- A whirlwind tour of statistics

AUTOMATED FABRIC DEFECT INSPECTION: A SURVEY OF CLASSIFIERS

CS Machine Learning

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

An Online Handwriting Recognition System For Turkish

Semi-Supervised Face Detection

CSL465/603 - Machine Learning

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Exposé for a Master s Thesis

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A survey of multi-view machine learning

Voice conversion through vector quantization

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

A Bootstrapping Model of Frequency and Context Effects in Word Learning

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

(ALMOST?) BREAKING THE GLASS CEILING: OPEN MERIT ADMISSIONS IN MEDICAL EDUCATION IN PAKISTAN

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Bug triage in open source systems: a review

Generative models and adversarial training

arxiv: v1 [math.at] 10 Jan 2016

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

arxiv: v1 [cs.lg] 3 May 2013

Indian Institute of Technology, Kanpur

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

Transcription:

ISCA Archive http://www.isca-speech.org/archive ITRW on Nonlinear Speech Processing (NOLISP 05) Barcelona, Spain April 19-22, 2005 Speaker Change Detection using Support Vector Machines V. Kartik and D. Srikrishna Satish and C. Chandra Sekhar Speech and Vision Laboratory Department of Computer Science and Engineering Indian Institute of Technology Madras, Chennai - 600 036 India Email:fkartik,satish,chandrag@cs.iitm.ernet.in Abstract. Speaker change detection is important for automatic segmentation of multispeaker speech data into homogeneous segments with each segment containing the data of one speaker only. Existing approaches for speaker change detection are based on the dissimilarity of the distributions of the data before and after a speaker change point. In this paper, we propose a classification based technique for speaker change detection. Patterns extracted from the data around the speaker change points are used as positive examples. Patterns extracted from the data between the speaker change points are used as negative examples. The positive and negative examples are used in training a support vector machine for speaker change detection. The trained SVM is used to scan the continuous speech signal of multispeaker data and hypothesize the points of speaker change. We consider two methods for extraction of fixed length patterns that are given as input to the support vector machine. In the first method, the spectral feature vectors of a fixed number of frames are concatenated to derive a pattern vector. In the second method, the sequence of feature vector frames is considered as a trajectory, and the outerproduct matrix of the trajectory matrix is vectorized to derive a pattern vector. The performance of the proposed approach for speaker change detection and the two methods for pattern extraction is studied on the extended data of the NIST 2003 speaker recognition evaluation database. 1 Introduction The task of speaker change detection involves determining the points at which there is a speaker turn in the multispeaker speech data as in audio recordings of conversations, broadcast news and movies. Speaker change detection is the first step in the speaker based segmentation of multispeaker speech data into homogeneous segments such that each segment has the data of one speaker only. Speaker segmentation is important for tasks such as audio indexing [1], speaker tracking [2] and speaker adaptation in automatic transcription of conversational speech. Speaker change detection should be done without the knowledge of the number of speakers and the identity of speakers [3]. Therefore, a speaker change detection system should be speaker independent. The existing approaches for speaker change detection are based on the dissimilarity in the distributions of data before and after the points of speaker change. Dissimilarity measurement is commonly based on comparison of the parametric statistic model of the

distributions such as Mahalanobis distance, weighted Euclidean distance [4], Bayesian information criteria [1]. In these approaches for speaker change detection, the dissimilarity is measured for the data between two adjacent windows of fixed length. The points at which the dissimilarity is above a threshold are hypothesized as the speaker change points. We propose an approach in which a classification model is trained to detect the speaker change points. The proposed approach does not use any threshold. In section 2, we present the classification based approach for speaker change detection. In section 3, we describe the speaker change detection system using support vector machines. In section 4, we present the experimental studies on speaker change detection using the proposed approach. 2 Classification based approach for speaker change detection Speaker change can be considered as an event in the multispeaker speech data. We develop a classification model for the detection of speaker change events in continuous speech. A speaker change event is characterized by the end of speaking by the current speaker and the start of speaking by a different speaker. Therefore the speech data around a speaker change point includes the data of two speakers. The pattern extracted from the speech data around a speaker change point is considered as a positive example. The speech data between two consecutive speaker change points include the data of one speaker only. Therefore the patterns extracted from the speech data between two consecutive speaker change points are considered as negative examples. The positive and negative examples can be used to train a classification model for detection of speaker change points. The main issue in the classification based approach for speaker change detection is the duration of the speech signal to be considered for pattern extraction. Let t i 1,ti and ti+1 be the (i 1) th, i th and (i +1) th speaker change points in a multispeaker conversation. The duration of the i th speaker turn is given by di = ti+1 ti. The information necessary for identifying ti as a speaker change point is present in the speech data between ti 1 and ti+1 as this segment contains the data of two speakers. As the speaker turn durations vary, it is difficult to determine a suitable length of the window of the speech signal to be processed for detection of speaker turns with different durations. A short window may not have enough data to capture the speaker change information of long speaker turns. A very long window may include the data of more than one speaker turn, and therefore may not be suitable for detection of short speaker turns. We study the effect of window size on the performance of the classification based approach to speaker change detection. The window of a chosen size includes a number of short-time analysis frames extracted from the speech signal in the window. Therefore the dimension of the pattern vector derived for a window by concatenating the feature vectors of frames in the window is very high, typically in the range of 150-800. We consider the support vector machine (SVM) model for binary classification of large dimensional pattern vectors extracted from the windows of the speech signal in multispeaker speech data. For training the model, the positive examples are obtained by processing the fixed length window around the manually marked speaker change points. The negative examples are obtained by processing the fixed length windows of the signal between the manu-

ally marked speaker change points. The sliding window method is used for detection of speaker change points using the trained SVM. In the sliding window method, a window of fixed length is processed to obtain a test pattern. Then the window is slided by a frame. The test patterns obtained using the sliding window method are classified using the trained SVM to give the speaker change hypotheses. In another method for extraction of patterns to train a classification model for speaker change detection, we consider the sequence of frames in a speech segment as a trajectory in the multidimensional space of feature vectors of frames [5]. Let l be the dimension of the feature vectors and m be the number of frames for a given segment. The trajectory matrix for the segment consists of the m frames, x 1 ; x 2 ; :::; x m,asits columns. The trajectory matrix of l-by-m trajectory matrix, X, is given as : X =[x 1 x 2 :::::x m ] (1) For speech segments of different durations, the value of m will be different. The outerproduct matrix, Z, of a trajectory matrix X is given by: Z = XX T (2) The outerproduct matrix Z is an l-by-l matrix. It is noted that the dimension of the outerproduct matrix is independent of the number of frames in the trajectory. The outerproduct matrix is vectorized to obtain a fixed dimension pattern that can be used as input to a support vector machine. The outerproduct based method for extraction of fixed dimension patterns is used to obtain the positive and negative examples of speaker change points. Let ti 1, ti and ti+1 be the (i 1) th, i th and (i +1) th speaker change points. Let Xb be the trajectory for the segment before the i th speaker change point, i.e., the segment between ti 1 and ti. Let Xa be the trajectory for the segment after the i th speaker change point, i.e., the segment between t i and ti+1. The outerproduct matrices Zb and Za are computed for the trajectory matrices X b and Xa respectively. The two outerproduct matrices are vectorized and concatenated to derive a fixed dimension pattern vector that represents the data between ti 1 and ti+1, and therefore represents the i th speaker change point. This method is used to obtain positive examples of speaker change points. Each positive example includes the representation of complete data of two speakers around the speaker change point. For deriving the negative examples that contain the data of one speaker only, the speech segment from t i to ti+1 is split into two subsegments. An outerproduct matrix is computed for the trajectory matrix of each segment. The two outerproduct matrices are vectorized and concatenated to derive a fixed dimension pattern vector. Thus the complete data of one speaker between two consecutive speaker change points is represented in the pattern vector of a negative example. A support vector machine is trained with the patterns extracted using the outerproduct matrix method. This method can be used for classification of manually marked segments in the test data. However, it cannot be used for on-line detection of speaker change points. In Section 4, we compare the classification performance of the SVM trained with the pattern extracted using a fixed length window and the SVM trained with the patterns extracted using the outerproduct matrix method. In the next section, we describe the speaker change detection system using the support vector machine trained with the fixed length window based patterns.

3 Speaker change detection system The input to the speaker change detection system is a continuous speech signal of multispeaker speech data as in audio recording of a conversation or broadcast news. The multispeaker speech data typically consist of many silence regions due to the pauses while speaking. It is necessary to remove the pauses from the speech signal so that the fixed length windows around the speaker change points include the data of two speakers. We train a support vector machine for detection of silence region in the continuous speech signal. The manually marked silence regions are processed to extract the positive examples of silence. The manually marked speech regions are processed to extract negative examples of silence. A pattern vector is obtained by concatenating three frames in a silence region or in a speech region. The sliding window method with a window width of three frames is used to detect the silence regions in the continuous speech signal using the SVM trained with the positive and negative examples of silence [6]. The continuous speech signal after the detection and removal of silence regions is given as input to the speaker change detection SVM. For a chosen length of window of n frames, the sliding window method is used to derive the test patterns. The test patterns with the positive output of the SVM are hypothesized as the speaker change points. As the chosen window length is not suitable for different durations of speaker turns several hypotheses may be spurious. We consider two methods for reducing the number of false alarms. In the first method, a threshold of five frames is used on the duration of speaker turns. When there are multiple hypotheses in a window of five frames, the hypothesis with the maximum output of SVM is retained and the other hypotheses are removed. Thus the SVM output is smoothed to eliminate the redundant hypotheses with very short speaker turn durations. For further reduction of the number of false alarms, we evaluate the performance of the speaker change detection SVM on a validation data set. The false hypotheses for the negative examples in the validation data set are identified. These false hypotheses are used as the negative examples in training an SVM for reducing the number of false alarms. The positive examples used in training the speaker change detection SVM are also used as the positive examples for the false alarm reduction SVM. The false alarm reduction SVM is helpful in further discrimination of correct hypotheses and false hypotheses given by the speaker change detection SVM. The block diagram of the proposed speaker change detection system is given in Figure 1. 4 Studies on speaker change detection For our experiments on speaker change detection, we use the extended data of NIST 2003 speaker recognition evaluation database. The extended data consists of two-speaker conversations, each of about five minute duration. A total dataset of 9 conversations is used in our studies. This dataset includes 3 conversations for each of male-male, malefemale and female-female speaker conversations. The speaker change points in all the 9 conversations are manually marked. The total dataset is divided into a training dataset, a validation dataset and a test dataset. Each of these datasets include one male-male, one male-female and one female-female conversation.

Fig. 1. Block diagram of speaker change detection system The speech data is processed using a frame size of 20ms. Each frame is represented by a 39 dimensional feature vector consisting of 12 cepstral coefficients, energy, the first order derivatives and the second order derivatives. The speech data of the conversations in the training dataset is processed to obtain positive and negative examples for training the speaker change detection SVM. The speech data of conversations in the validation dataset is used for obtaining the negative examples to train the false alarm reduction SVM. The speech data of the conversations in the test dataset is used for evaluating the performance of the speaker change detection system. The test dataset includes a total of about 16000 frames and 282 speaker change points. The Gaussian kernel is used for building the SVMs. The speech data of a conversation is given as the input to the speaker change detection system. The sliding window method is used to obtain the hypotheses from the speaker change detection SVM. The output of the SVM is smoothed to eliminate the short duration speaker turns. The hypotheses after removal of short speaker turns are processed by the false alarm reduction SVM to give the speaker change detection points. The speaker change detection performance is measured as the missed detection rate (MDR) and the false alarm rate (FAR). The missed detection rate is defined as the ratio of the number of speaker change points missed (M) and the number of actual speaker change points (A);

The false alarm rate is defined as follows: MDR = M Λ 100 (3) A FAR = F Λ 100 (4) T A where F is the number of false hypotheses and T is the number of test patterns. The MDR and FAR are determined at different stages of the speaker change detection system. The performance for different window lengths is given in Table 1. It is seen that the window length of 20 frames (i.e., 400msec duration) gives the lowest missed detection rate of 3.19%, i.e., 9 speaker change points are not detected. The number of false alarms is 5571, leading to a false alarm rate of 32.76%. The smoothing of SVM output is helpful in reducing the false alarm rate significantly to 8.26%. The false alarm reduction SVM is helpful in further reduction of the false alarm rate to 6.78%. It may be noted that as the false alarm rate is reduced, there is an increase in the missed detection rate. However, in a given conversation the number of speaker change points is significantly less than the total number of test patterns (frames). The number of missed detections has increased from 9 to 33, whereas the number of false alarms has reduced from 5571 to 1092. The results show the effectiveness of the methods for false alarm reduction. Table 1. Performance of the speaker change detection system at various stages Window size After speaker change After smoothing After false alarm (Frames) hypothesization reduction MDR FAR MDR FAR MDR FAR 5 4.96 15.45 9.22 5.94 15.25 5.19 10 9.57 14.09 13.12 5.63 20.21 4.18 15 10.64 20.35 13.83 6.79 16.31 5.49 20 3.19 32.76 8.51 8.26 11.70 6.78 Finally we compare the classification performance of the different methods of pattern extraction. The outerproduct matrix based method that uses the complete data around a speaker change gives a missed detection rate of 11.51%. The fixed length window of 20 frames gives approximately the same MDR of 11.7%. 5 Summary and Conclusions In this paper, we have proposed a classification based approach for speaker change detection in multispeaker speech data. We have considered different methods for extraction of fixed dimension patterns representing the positive and negative examples used

to train a support vector machine for speaker change detection. We have also proposed two methods for reducing the large number of false alarms. The effectiveness of the proposed methods is demonstrated on the NIST 2003 speaker recognition evaluation data. References 1. P. Delacourt and C. J. Wellekens, DISTBIC: A speaker based segmentation for audio data indexing, in Speech Communication, 2000, vol. 32, pp. 111 126. 2. Lie Lu and Hong-Jiang Zhang, Speaker change detection and tracking in real-time news broadcasting analysis, in Proc. of 10th ACM International Conference on Multimedia, Dec 1-6 2002, pp. 602 610. 3. A. Adami, S. Kajarekar, and H. Hermansky, A new speaker change detection method for twospeaker segmentation, in Proceedings of the International Conference on Acoustics Speech and Signal Processing, May 2002. 4. S. Kwon and S. Narayanan, Speaker change detection using a new weighted distance measure, in Proceedings of the International Conference on Spoken Language Processing, 2002, vol. 4, pp. 2537 2540. 5. C. Chandra Sekhar and M.Palaniswami, Classification of multidimensional trajectories for acoustic modeling using support vector machines, in Int. Conf. on Intelligent Sensing and Information Processing, Jan 2004, pp. 153 158. 6. V. Kartik, Speaker turn detection system using support vector machines, M.Tech. Thesis Report, Indian Institute of Technology, Madras, 2004.