Celebrity Voices. Paulo Eduardo dos Santos Veloso Braga. Instituto Superior Técnico Av. Rovisco Pais, Lisboa, Portugal

1 Celebrity Voices Paulo Eduardo dos Santos Veloso Braga Instituto Superior Técnico Av. Rovisco Pais, 1049-001 Lisboa, Portugal paulobraga@ist.utl.pt Abstract This paper described a text-independent speaker verification system applied to finding voices of well-known persons in broadcast news shows. Two different classifiers were trained and tested with test segments manually defined by annotators. In these conditions, the SVM-GSV classifier performed better than the GMM-UBM one, particularly for longer segments. The performance difference for short and long segments led us to create a GMM-UBM+SVM-GSV system which classifies a test segment using one of two classifiers, depending on the duration of the test segment. This system was used to identify target speakers in recent news shows, for which segments were automatically defined by an Audio Pre-Processing module. The performance of the GMM- UBM+SVM-GSV system was lower, which led to new successful experiments with further tuning. The results obtained are integrated into the existing media monitoring system and presented in a web page, where it is possible to view a video of each segment assigned to each speaker identified by the system. Keywords Speaker Recognition, Text-independent Verification, Celebrity Voices, Gaussian Mixture Models, Support Vector Machines, Supervector, News Shows I. INTRODUCTION With the increase of applications that take advantage of recent developments in telecommunications and Internet areas, there is also an increasing need to recognize someone by his/her physical characteristics as a way to uniquely identify a person. This interest spans different areas, such as access control systems [1], authentication for long distance calls or banking access [2], personalized answers by answering machines [3], automatic recognition of speakers in large numbers of recorded files [4], forensics [5], etc. The voice is one of the features that can be used to identify a human being since each person has unique characteristics. Depending on the application, speaker recognition is usually divided into two areas: identification and verification. The goal in speaker identification is to determine who among a group of known speakers matches the segment analyzed. This is also referred to as closed-set speaker identification. In the second area, speaker verification, the goal is to determine whether a segment belongs to a speaker or not. This is sometimes referred to as the open-set problem. Input Speech Audio Pre-Processing Feature Extraction Classification System Normalization Accept or Reject Fig. 1. Basic block diagram of a speaker verification system. A speaker recognition system may also be text-dependent or text-independent depending on the type of data used for training and testing the system. In a text-independent system, different speech materials can be used for training and testing. A text-dependent system, on the other hand, assumes that the speech materials used in the test are the same as the ones used during the enrollment or training phase (e.g. keywords, digits, predetermined sentences, etc.). Text-independent speaker verification is the basis of most speaker recognition applications, with the additional difficulty of not controlling what the speakers say. The goal of this work was to develop a text-independent speaker verification system applied to finding famous speakers in broadcast news shows. Figure 1 shows the block diagram of a generic speaker verification system. The first module tries to segment the

2 incoming signal into homogeneous segments, from the point of view of acoustic background conditions, and speakers. The second module extracts relevant features from the segments classified as speech. These features are used for training and testing classifiers typically based on statistical machine learning methods. The last modules perform score normalization and provide an accept/reject decision. Collecting a large corpus for training and testing the celebrity voices recognizer is of primary importance, and will be described in Chapter II. The next chapter covers the main abovementioned modules. Chapter IV describes the experimental results. The paper also includes a short summary of the web interface before concluding. TABLE II Training time of speakers identified by the system Speaker Training Time (s) Almeida Santos 86.04 António Guterres 1415.79 Durão Barroso 1070.33 Ferreira do Amaral 514.08 Freitas do Amaral 107.65 Jaime Gama 151.31 João Vale e Azevedo 974.93 Jorge Coelho 371.63 Jorge Sampaio 783.83 José Mourinho 267.51 II. CORPORA The corpus used in this work consists mainly of broadcast news shows collected from the Portuguese public television (RTP) [6, 7, 8]. It is divided into eight sets, one for training (Train), two for development (Devel and Pilot) and five for testing (Eval, Jeval, 11march, Rtp07 and Rtp08). Table I provides an overview of the different subsets, and corresponding purpose. TABLE I Corpus Sets Set Year News Shows Audio (h) Purpose Train 2000 99 46.48 Training Devel 2000 13 6.60 Development Pilot 2000 11 4.79 Development Eval 2001 12 4.53 Evaluate Jeval 2001 14 13.52 Evaluate 11march 2004 7 5.33 Evaluate Rtp07 2007 6 4.79 Evaluate Rtp08 2008 5 3.69 Evaluate José Saramago 51.01 José Sócrates 295.07 Mário Soares 88.38 Paulo Portas 1565.93 Santana Lopes 114.51 Xanana Gusmão 168.48 A final test set of news shows was used to test the system in real conditions. This set was collected in the Spoken Language Systems Laboratory, INESC-ID, through a cable television service, during 2011. This set consists of 4 news shows, as shown in Table III, which include at least 3 different target speakers. TABLE III Final test set News Show Segments Average Duration (s) 2011_04_04-Telejornal-1 194 15.90 2011_05_06-Telejornal-1 153 20.24 2011_05_12-Telejornal-1 211 13.79 2011_05_25-Telejornal-1 256 11.27 For the sake of illustrating the potential of the system, 16 celebrity speakers have been selected. The information about the speakers and the corresponding training time is shown in Table II. These sets were manually annotated and provide word level transcription and speaker characterization (gender, identification). Using this information a set of target speakers and a set of training segments was defined, as well a set of test segments. An additional 200 speakers were chosen in order to carry out normalization of results obtained by the speaker models. These speakers were chosen based on their training time, with a total average of about 8 minutes. The test set consists of 6407 imposter segments, with an average of 17.43 seconds, and 180 authentic segments, with an average of 20.34 seconds. The number and average duration of the segments by target speakers for each news show is indicated in Table IV. TABLE IV Segments by target speakers in the final test set News Show Segments Average Duration (s) 2011_04_04-Telejornal-1 15 30.02 2011_05_06-Telejornal-1 6 20.81 2011_05_12-Telejornal-1 6 24.50 2011_05_25-Telejornal-1 19 13.84 III. SPEAKER VERIFICATION BLOCKS A. Audio Pre-Processing The audio pre-processing module [11] is part of the Audimus framework, and involves blocks for acoustic change

3 detection, speech/no-speech classification, gender classification, and speaker clustering. B. Feature extraction For feature extraction, a 19-dimensional PLP coefficients [10] plus energy are extracted from speech signal every 10ms using a 20ms Hamming window. First order delta and deltadelta are then appended, resulting in a 60-dimensional final vector for every frame. C. GMM-UBM The Gaussian Mixture Model (GMM) [11] is one of the reference methods in speaker recognition, and hence had to be included in the set of classifiers that were integrated in our system. The Universal Background Model (UBM) was trained with the Expectation-Maximization (EM) algorithm from features extracted from the entire training set, with a total of 47 hours. The 16 speaker models and the 200 imposter models, necessary for the score normalization, were derived by adapting only the mean parameters of the mixtures of the UBM, using the speaker's training speech and a maximum a posteriori (MAP) estimation. The speaker and imposter models included 1024 mixtures. Only the diagonal covariance matrices where used. D. SVM-GSV The second type of classifier is based on Gaussian supervectors (GSV) [12]. The linear kernel was derived by bounding the Kullback-Leibler (KL) divergence measure between GMMs. The supervector of each speaker model is represented only by the concatenation of all the means of the Gaussian mixtures model obtained in the GMM-UBM classifier, resulting in a supervector with 61440 dimensions (1024 x 60). The kernel is linear in the supervector, i.e., the mapping from the supervector to SVM expansion space is a diagonal linear transform. For SVM training, every supervector model was used, that is, the data from the target speaker is used as positive examples, and the data from the other 15 speakers and the 200 imposters is used as negative examples. Given a test segment, the GMM training is performed by MAP adaptation of the means (similar to the GMM-UBM classifier). A supervector is formed from this adapted model. The output score is obtained by computing a single inner product between the speaker model and the supervector of a test segment. E. GMM-UBM+SVM-GSV The third classifier was implemented to optimize the performance obtained by the two previous classifiers, depending on the duration of the test segment. The result of the GMM-UBM+SVM-GSV system is a combination of results obtained by individual GMM-UBM and SVM-GSV systems. This combination is done depending on the duration of the test segment. The GMM-UBM classifier was used for the test segments with duration less than 8 seconds. The SVM-GSV classifier is used for the longer segments. This threshold was optimized in the development set. The GMM-UBM and SVM-GSV modules maintain all the features previously implemented and are used without changing the way the results are calculated. F. Score normalization Score normalization is used to enhance the performance of the system and relies on 200 imposter models to perform ZT score normalization [13]. A. Evaluation metrics IV. RESULTS The evaluation metrics are the detection error tradeoff (DET), equal error rate (EER), and minimum decision cost function (DCF min ) [14]. The latter is defined as where and are the probability of miss and false alarm, respectively, at the threshold. B. Results with manually corpora The first set of results was obtained by first concatenating all the adjacent segments which were manually classified as belonging to the same speaker, whenever the gap between segments was less than 1 second. This was done in order to increase the average duration of each test segment. The time boundaries were manually defined. The systems were tested using 6407 imposter segments, with an average duration of 17.43 seconds, and 180 target speaker segments with an average duration of 20.34 seconds Results for the three systems are shown in Figure 2. The SVM-GSV outperforms the GMM-UBM classifier, and the best performance is obtained by the GMM-UBM+SVM-GSV classifier. The results in terms of EER and DCF min for the individual systems (GMM-UBM, SVM-GSV) and for the combination of both (GMM-UBM+SVM-GSV) are shown in Figure 3. C. Results with automatically segmented corpora The second set of results corresponds to automatically segmented corpora, in which the start and end times are obtained by the Audio Pre-Processing module, and adjacent segments by the same speaker are concatenated whenever the gap between segments was less than 2.5 seconds. The results involve only the GMM-UBM+SVM-GSV classifier. The threshold of the system that defines whether a segment (1)

4 belongs to a speaker or not was calculated based on the value of EER. Thus, a segment is assigned to a speaker if the score obtained by his model is greater than 1.5. If a segment is identified as belonging to two or more speakers, it is assigned to the speaker for which a higher score is obtained. TABLE V Performance in terms of DCFmin and EER News Show EER DCF min 2011_04_04-Telejornal-1 0.067 0.028 2011_05_06-Telejornal-1 0.167 0.023 2011_05_12-Telejornal-1 0.139 0.038 2011_05_25-Telejornal-1 0.105 0.052 Fig. 2. Comparison between DET curves for GMM-UBM, SVM-GSV and GMM-UBM+SVM-GSV systems. 0,1 0,08 0,06 0,04 0,02 0 0,095 0,083 0,034 0,033 0,067 0,030 GMM-UBM SVM-GSV GMM-UBM+SVM-GSV ERR DCFmin Fig. 3. Performance in terms of DCFmin and EER for GMM-UBM, SVM- GSV and GMM-UBM+SVM-GSV systems. Figure 4 presents the DET curve of GMM-GSV+SVM- GSV system when tested with the segments obtained by the Audio Pre-Processing module for the four news shows. The EER and DCF min obtained by each news show are shown in Table V. In this case the GMM-UBM+SVM-GSV system presents an EER of 8.7% and a DCF min of 0.046. For a decision threshold of 1.5 this system has a false positive rate of 7.0% and a false negative rate of 15.2%. 2011_05_06- Telejornal-1 and 2011_05_12-Telejornal-1 news show present a higher EER due to insufficient authentic segments. Fig. 4 DET curve of GMM-UBM+SVM-GSV system. The performance of the GMM-UBM+SVM-GSV system in this test set of 2011 was lower than the one obtained with the previous test sets, as expected. This greater difficulty in detecting the segments by the target speakers was mainly attributed to the following reasons: Speaker models were trained with characteristics of the year 2000 and tested with recent segments, collected during the year 2011. It is well known that voices change with age, and so do recording conditions. The decision threshold was chosen on the basis of results obtained by speaker models, tested with segments collected during year 2001, 2004, 2007 and 2008, representing respectively 56%, 17%, 15% and 12% of the useful duration of the test set. The segments of the 2011 test set were obtained using the Audio Pre-Processing module, and therefore, are not perfectly segmented, i.e., the target speaker segments also now and then contain other speakers. This analysis led us to tuning the threshold of the GMM- UBM+SVM-GSV system with results obtained when tested with segments from the APP module. After this tuning, the false negative rate was reduced from 15.2% to 8.7%, at the cost of increasing the rate of false positives by 1.7 %.

5 V. WEB INTERFACE The results obtained by the GMM-UBM+SVM-GSV system for each news show analyzed are displayed in a web page, using the same style of the web interface designed for selective dissemination of multimedia information [6, 9]. This interface, illustrated in Figure 5, makes it possible to view all segments assigned to a given speaker, as well as information about the segment duration and the score obtained by the speaker model. The correct attribution of a segment to a speaker can be confirmed visually, since it is possible to see the video of each segment or of the entire television news show. [5] W. Campbell, D. Reynolds, J. Campbell and K. Brandy, Estimating and evaluating confidence for forensic speaker recognition, Proc. ICASSP 05, pp. 717-720. [6] J. Neto, H. Meinedo, R. Amaral and I. Trancoso, A System for Selective Dissemination of Multimedia Information, Proceedings of the ISCA MSDR 2003. [7] H. Meinedo, D. Caseiro, J. P. Neto and I. Trancoso, AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language, PROPOR 2003. [8] H. Meinedo, A. Abad, T. Pellegrini, I. Trancoso and J. P. Neto, The L2F Broadcast News Speech Recognition System, Fala2010. [9] H. Meinedo, Audio Pre-processing and Speech Recognition for Broadcast News. Lisboa, 2008. [10] H. Hermansky, Perceptual linear prediction (PLP) analysis for speech, Journal of the Acoustic Society of America 87, pp. 1738-1752, 1990. [11] D. Reynolds,T. Quatieri and R. Dunn, Speaker verification using adapted gaussian mixture models, Digital Signal Processing, pp. 19-41, 2000. [12] W. Campbell, D. Sturim and D. Reynolds, Support vector machines using GMM supervectors for speaker verification, IEEE Signal Processing Letteres, pp. 308-311, 2006. [13] R. Zheng, S. Zhang and B. Xu, A Comparative Study of Feature and Score Normalization for Speaker Verification.Springer, 2005. [14] The NIST Year 2010 Speaker Recognition Evaluation Plan http://www.itl.nist.gov/iad/mig/tests/sre/2010/index.html, 2010. Fig. 5. HTML page with the segments belonging to a target speaker. VI. CONCLUSIONS This paper described a speaker verification system applied to finding voices of well-known persons in broadcast news shows. Two different classifiers were tested with test segments defined by annotators. In these conditions, the SVM-GSV classifier performed better than the GMM-UBM one, particularly for longer segments. The performance difference for short and long segments led us to create a GMM- UBM+SVM-GSV system which classifies a test segment depending on its duration. This system was used to identify target speakers in recent news shows, for which segments were automatically defined by an Audio Pre-Processing module. The performance of the GMM-UBM+SVM-GSV system was lower, which led to new successful experiments with further tuning. VII. REFERENCES [1] J. Naik and G. Doddington, Evaluation of a high performance speaker verification system for access control, Proc. ICASSP 1987, pp. 2392-2395. [2] J. Naik, L.Netsch, G.Dodding, Speaker verification over long distance telephone lines, Proc. ICASSP 1989, pp. 524-527. [3] C. Schmandt and B. Arons, A conversational telephone messaging system, IEEE Trans. Consumer Electron. 30(3), xxi-xxiv, 1984. [4] L. Wilcox, F. Chen, D.Kimber and V. Balasubramanian, Segmentation of speech using speaker identification, Proc. ICASSP 1994, pp. l.161- l.164.