Gender-Dependent Acoustic Models Fusion Developed for Automatic Subtitling of Parliament Meetings Broadcasted by the Czech TV

Similar documents
Learning Methods in Multilingual Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Modeling function word errors in DNN-HMM based LVCSR systems

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Emotion Recognition Using Support Vector Machine

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Investigation on Mandarin Broadcast News Speech Recognition

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A study of speaker adaptation for DNN-based speech synthesis

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

WHEN THERE IS A mismatch between the acoustic

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Calibration of Confidence Measures in Speech Recognition

CS Machine Learning

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Human Emotion Recognition From Speech

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Deep Neural Network Language Models

Multi-Lingual Text Leveling

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speaker recognition using universal background model on YOHO database

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Artificial Neural Networks written examination

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Improvements to the Pruning Behavior of DNN Acoustic Models

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Edinburgh Research Explorer

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Support Vector Machines for Speaker and Language Recognition

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Automatic Pronunciation Checker

Using dialogue context to improve parsing performance in dialogue systems

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

(Sub)Gradient Descent

What do Medical Students Need to Learn in Their English Classes?

Generative models and adversarial training

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Affective Classification of Generic Audio Clips using Regression Models

Segregation of Unvoiced Speech from Nonspeech Interference

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Learning Methods for Fuzzy Systems

Lecture Notes in Artificial Intelligence 4343

Australian Journal of Basic and Applied Sciences

Speaker Identification by Comparison of Smart Methods. Abstract

SARDNET: A Self-Organizing Feature Map for Sequences

Lecture 1: Machine Learning Basics

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Lecture 9: Speech Recognition

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

The Structure of the ORD Speech Corpus of Russian Everyday Communication

Switchboard Language Model Improvement with Conversational Data from Gigaword

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Constructing Parallel Corpus from Movie Subtitles

Proceedings of Meetings on Acoustics

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Radius STEM Readiness TM

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Rule Learning With Negation: Issues Regarding Effectiveness

Voice conversion through vector quantization

Mandarin Lexical Tone Recognition: The Gating Paradigm

Word Segmentation of Off-line Handwritten Documents

Probabilistic Latent Semantic Analysis

Language Model and Grammar Extraction Variation in Machine Translation

Characterizing and Processing Robot-Directed Speech

Problems of the Arabic OCR: New Attitudes

Letter-based speech synthesis

Transcription:

Gender-Dependent Acoustic Models Fusion Developed for Automatic Subtitling of Parliament Meetings Broadcasted by the Czech TV Jan Vaněk and Josef V. Psutka Department of Cybernetics, West Bohemia University, Pilsen, Czech Republic {vanekyj,psutka_j}@kky.zcu.cz http://www.kky.zcu.cz Abstract. Gender-dependent (male/female) acoustic models are more acoustically homogeneous and therefore give better recognition performance than single gender-independent model. This paper deals with a problem how to use these gender-based acoustic models in a real-time LVCSR (Large Vocabulary Continuous Speech Recognition) system that is for more than one year used by the Czech TV for automatic subtitling of Parliament meetings that are broadcasted on the channel ČT24. Frequent changes of speakers and the direct connection of the LVCSR system to the TV audio stream require switching/fusion of models automatically and as soon as possible. The paper presents various techniques based on using the output probabilities for quick selection of a better model or their combinations. The best proposed method achieved over 11% relative WER reduction in comparision with the GI model. 1 Introduction In recent years, there appeared some projects for hearing-impaired people to help them to access to the information contained in acoustic signal especially of mass media. One of those projects is automatic subtitling of live broadcasted teleview. Recently, we introduced the system for automatic subtitling of Parliament meetings that are broadcasted by the Czech Television (ČT). This system is now used for more than one year by the ČT on the channel ČT24 (see details in [1]). Frequent changes of speakers and the direct connection of the LVCSR system to the TV audio stream brings interesting challenges. This paper describes our effort to build and use gender dependent acoustic models. The gender-dependent acoustic modeling is a very efficient way how to increase the accuracy over a gender independent modeling in LVCSR and has been previously considered in the literature [2]. The most typical applications work in two-passes where in the first pass a gender-detection method is used (based on GMMs or on multilayer perceptrons-mlp) and in the second pass the speech is recognized with the corresponding gender-specific acoustic model [3]. In this paper we proposed a new combination methods for fusion of the acoustic models. These methods were applied on the level of acoustic models output probabilities. In recent years, the huge amount of computations related to acoustic model become negligible due to the increasing computer speed and capacity of computer memory. From P. Sojka et al. (Eds.): TSD 2010, LNAI 6231, pp. 431 438, 2010. c Springer-Verlag Berlin Heidelberg 2010

432 J. Vaněk and J.V. Psutka that point of view it is possible to compute several acoustic models simultaneously and switch or even combine their output probabilities in real-time applications. We would like to discuss and compare such methods with methods commonly used. 2 Methods Various techniques for acoustic models switching/fusion were proposed. All techniques were designed for the real-time applications therefore only a small history for actual processed frame is needed. The first two methods are based on pure switching of individual acoustic models. The third method switches output probabilities for each time/state independently through all acoustic models. The other methods are based on evaluated total probability of the actual frame for all acoustic models. Some of the proposed methods use exponential forgetting to smooth probability volatility. The detailed description of the methods follows. 2.1 Frame Arg Max This method marked as Frame_argmax chooses for the actual frame the acoustic model that maximizes given criterion. This criterion can be defined in several ways. The commonly used criterion is output probability from GMM or MLP. Because it was necessary to compute the output probabilities for all states in all acoustic models for other switching and fusion methods anyway, we used the total probability of all states of the acoustic model for the actual frame as our criterion: P. k jo t / D IX id1 P k.s i jo t /; (1) where the total probability is the sum of the all I states s i of the acoustical model k and P k.s i jo t / is an output probability of the state s i of the k-th acoustical model and the feature vector o t in time t. This criterion has, according to our experiments, similar results as the commonly used criterion based on GMMs. Method Frame_argmax chooses for actual frame model with the highest total probability. It means that at first the k max is evaluated as k max D arg max P. k jo t / (2) k21:::m and thus the new probabilities are OP.s i jo t / D P k_max.s i jo t /: (3) where M is number of acoustic models and probability. O P.s i jo t / is the new evaluated state s 2.2 Frame Arg Max with Exponential Forgetting Because the time behavior of the total probability is volatile, some kind of smoothing should be used. The exponential forgetting is a good choice for the real-time applications. The total probabilities for all models are computed as P t. k / D ffp t 1. k / C.1 ff/p. k jo t /; (4)

GD AMs Fusion for Parliament Subtitling 433 where ff parameter was set to 0.95. This value was in the center of optimal region in preliminary experiments. Relation between ff value and word error rate were examined and results are shown in Section 5. This method marked as Frame_argmax_exp is practically the same as the previous method except for using smoothed total probability P t. k / instead of P. k jo t /. 2.3 Independent Maximum The method marked as Maximum puts as the new probability of the state s i the highest value of all M acoustic models. OP.s i jo t / D max P k.s i jo t /: (5) k21:::m It means that the highest output probabilities are searched for each state s i though all M acoustic models at every time t. 2.4 Independent Multiplication The following methods, contrary to the previous ones, are fusion of the output probabilities for states across all available acoustic models. The first method marked as Multiply is a simple multiplication of M acoustic models likelihoods for individual state: OP.s i jo t / D M v uut MY kd1 P k.s i jo t /; (6) where P k.s i jo t / is an output probability of the state s i of the k-th acoustical model. The M-th root is there used to normalize probability back into original range. This approach is implemented internally as an average in log-likelihood domain. 2.5 Independent Average The second fusion method marked as Average is a simple average of M acoustic models likelihoods for individual state: OP.s i jo t / D 1 M MX kd1 2.6 Weighted Multiplication with Exponential Forgetting P k.s i jo t /: (7) Similar to the the switching methods some kind of smoothing should be used. The last two methods use smoothing via weighted sum or multiplication of all probabilities. The weights in time t are computed as w k t D P t. k / P MlD1 P t. l / : (8)

434 J. Vaněk and J.V. Psutka The method marked as W_mult_exp evaluates new probabilities as OP.s i jo t / D MY kd1 P k.s i jo t / wkt : (9) In log-likelihood domain this approach can be implemented more simple as weighted sum of the log-likelihoods with precomputed weights w k t. 2.7 Weighted Sum with Exponential Forgetting The method with exponential forgetting is the last fusion method which is proposed in this paper. It is marked as W_sum_exp and it evaluates new probabilities as weighted sum OP.s i jo t / D MX kd1 w k t P k.s i jo t /: (10) In summary, the three switching and four fusion methods were proposed. All of them are fitted to real-time processing and do not pose any restriction to the number of acoustic models being used. There s no need to compute all probabilities of all acoustic models for the first two switching methods. It is necessary to compute only one model in actual time if we have some estimate of total probability of individual models. This estimate can be done via much smaller GMM or with some algorithm using Gaussians pruning of the evaluated HMM model. For fusion methods all state s probabilities of all models need to be evaluated but pruning or other fast HMM evaluation technique can be used. In addition, in the first stage just single acoustic model can be evaluated and, in the second stage, only small number of relevant states can be evaluated for other acoustic models. By using this scenario the computation burden increases over single-model only slightly. 3 Train Data Description For acoustic model training a microphone-based high-quality speech corpus was used. This corpus of read-speech consists of the speech of 800 speakers (384 males and 416 females). Each speaker read 170 sentences. The database of text prompts from which the sentences were selected was obtained in an electronic form from the web pages of Czech newspaper publishers [4]. Special consideration was given to the sentences selection, since they provide a representative distribution of the more frequent triphone sequences (reflecting their relative occurrence in natural speech). The corpus was recorded in the office where only the speaker was present. The recording sessions yielded totally about 220 hours of speech. 4 Experimental Setup 4.1 Acoustic Processing The digitization of an analogue signal was provided at 22.05 khz sample rate and 16-bit resolution format. The aim of the front-end processor was to convert continuous speech

GD AMs Fusion for Parliament Subtitling 435 into a sequence of feature vectors. Several tests were performed in order to determine the best parameterization settings of the acoustic data (see [5] for methodology). The best results were achieved using PLP parameterization [6] with 27 filters and 12 PLP cepstral coefficients with both delta and delta-delta sub-features (see [7] for details). Therefore one feature vector contained 36 coefficients. Feature vectors were computed each 10 milliseconds (100 frames per second). 4.2 Acoustic Modeling The individual basic speech unit in all our experiments was represented by a three-state HMM with a continuous output probability density function assigned to each state. As the number of Czech triphones is too large, phonetic decision trees were used to tie states of Czech triphones. Several experiments were performed to determine the best recognition results according to the number of clustered states and also to the number of mixtures. In all presented experiments, we used 16 mixtures of multivariate Gaussians for each of the 4,922 states. The prime single mixture triphone acoustic model trained by Maximum Likelihood (ML) criterion was made using HTK-Toolkit v.3.4 [8]. Further, three 16 mixtures models were trained from the prime model: gender-independent, male and female. The training procedure has two stages. At first, 16 mixtures models were trained with HTK using ML criterion. At second, final models were obtained via two iterations of MMI-FD discriminative training [9,10]. 4.3 Gender Based Splitting As was presented in [9], the splitting via manual male/female markers need not to be optimal due to several masculine female and feminine male voices occurring in the training corpora and also because of possible errors in manual annotations. Therefore, an initial splitting (achieved via manual markers) was realigned via automatic clustering algorithm. After this process, two more acoustically homogeneous classes were available for gender-dependent acoustic modeling which was described in previous subsection. 4.4 Test Conditions The test set consists of 100 utterances from 100 different speakers (64 male and 36 female speakers), which were not included in training data. There were no cross talking or speaker changes during each utterance. This portion of utterances was randomly separated to 10 sets so that each set contains at least one male and one female speaker. This multi-utterances were created in order to simulate real-time speaker changes. All recognition experiments were performed with a bigram back-off language model with Good-Turing discounting. The language model was trained on about 10M tokens of normalized Czech Parliament transcriptions. The SRI Language Modeling Toolkit (SRILM) [11] was used for training. The model contains 186k words and the perplexity of the recognition task was 12,362 and OOV was 2.4% (see [12] and [13] for details).

436 J. Vaněk and J.V. Psutka 5 Results To follow up our last year paper [9], the same three acoustic models were used: genderindependent (GI), male and female. At first, all these models were tested stand alone. At second, all switching and fusion method were evaluated. All the results are in Table 1. Table 1. The results of recognition experiments Stand alone models WER [%] Gender-independent 16.92 Male 22.08 Female 30.07 Switching or fusion WER [%] Multiply 17.50 Average 15.54 Maximum 15.47 Frame_argmax 17.36 Frame_argmax_exp 16.41 W_mult_exp 16.83 W_sum_exp 14.96 From the Table 1 it is clear that Multiply and Frame_argmax methods gave even higher WER than GI model. On the other hand, some methods gave significantly lower WER than GI. The lowest WER has been obtained via W_sum_exp method and its relative WER reduction is 2% absolutely and more than 11% relatively. 16.5 16 WER [%] 15.5 15 14.5 0 0.2 0.4 0.6 0.8 1 Alpha value Fig. 1. Relation of ff value and WER

GD AMs Fusion for Parliament Subtitling 437 Proper setting of ff parameteris needed for methodswith the exponentialforgetting. For all these methods the optimal value range was very similar. The advisable ff region is between 0.9 and 0.99. The relation between ff value and word error rate is depicted in Figure 1. 6 Conclusion Various methods of employing gender-dependent acoustic models to the LVCSR system were tested in this paper. The methods had to be designed for real-time automatic subtitling task which is connected to the live TV audio stream. Three switching and four fusion methods were proposed, described and tested. Some of them gave significantly better results than the gender-independent modeling. The lowest WER has been obtained with weighted sum of the HMM state probabilities of all acoustic models (method marked as W_sum_exp) and its relative WER reduction is 2% absolutely and more than 11% relatively. All proposed methods are able to combine even higher number of acoustic models than they were tested with. Acknowledgements This research was supported by Grant Agency of the Czech Republic, No. 102/08/0707 and by the the Ministry of Education of the Czech Republic, project No. 2C06020. References 1. Pražák, A., Psutka, J., Hoidekr, J., Kanis, J., Müller, L., Psutka, J.: Automatic Online Subtitling of the Czech Parliament Meetings. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 501 508. Springer, Heidelberg (2006) 2. Olsen, P. A., Dharanipragada, S.: An Efficient Integrated Gender Detection Scheme and Time Mediated Averaging of Gender Dependent Acoustic Models. In: 8th European Conference on Speech Communication and Technology (EUROSPEECH 2003), Geneva, Switzerland (2003) 3. Neto, J., Meinedo, H., Viveiros, M., Cassaca, R., Martins, C., Caseiro, D.: Broadcast News Subtitling System in Portuguese. In: Proceedings of the ICASSP, Las Vegas, USA (2008) 4. Radová, V., Psutka, J.: UWB-S01 Corpus: A Czech Read-Speech Corpus. In: Proceedings of the 6th International Conference on Spoken Language Processing ICSLP 2000, Beijing, China (2000) 5. Psutka, J., Müller, L., Psutka, J. V.: Comparison of MFCC and PLP Parameterization in the Speaker Independent Continuous Speech Recognition Task. In: 7th European Conference on Speech Communication and Technology (EUROSPEECH 2001), Aalborg, Denmark (2001) 6. Hermansky, H.: Perceptual Linear Predictive (PLP) Analysis of Speech. J. Acoustic. Soc. Am. 87 (1990) 7. Psutka, J.: Robust PLP-Based Parameterization for ASR Systems. In: SPECOM 2007 Proceedings. Moscow State Linguistic University, Moscow (2007) 8. Young, S., et al.: The HTK Book (for HTK Version 3.4), Cambridge (2006) 9. Vaněk, J., Psutka, J.V., Zelinka, J., Pražák, A., Psutka, J.: Discriminative Training of Gender- Dependent Acoustic Models. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS (LNAI), vol. 5729, pp. 331 338. Springer, Heidelberg (2009)

438 J. Vaněk and J.V. Psutka 10. Vanek J.: Discriminative Training of Acoustic Models. Ph.D. Thesis, West Bohemia University, Department of Cybernetics (2009) (in Czech) 11. Stolcke, A.: SRILM An Extensible Language Modeling Toolkit. In: International Conference on Spoken Language Processing (ICSLP 2002), Denver, USA (2002) 12. Pražák, A., Ircing, P., Švec, J., et al.: Efficient Combination of N-gram Language Models and Recognition Grammars in Real-Time LVCSR Decoder. In: 9th International Conference on Signal Processing, Beijing, China, pp. 587 591 (2008) 13. Pražák, A., Müller, L., Šmídl, L.: Real-Time Decoder for LVCSR System. In: 8th World Multi-Conference on Systemics, Cybernetics and Informatics, Orlando FL, USA (2004)