BENEFIT OF MUMBLE MODEL TO THE CZECH TELEPHONE DIALOGUE SYSTEM

Similar documents
Learning Methods in Multilingual Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

WHEN THERE IS A mismatch between the acoustic

A study of speaker adaptation for DNN-based speech synthesis

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speech Emotion Recognition Using Support Vector Machine

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Calibration of Confidence Measures in Speech Recognition

Human Emotion Recognition From Speech

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Speaker recognition using universal background model on YOHO database

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

An Online Handwriting Recognition System For Turkish

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Investigation on Mandarin Broadcast News Speech Recognition

Improvements to the Pruning Behavior of DNN Acoustic Models

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Automatic Pronunciation Checker

Proceedings of Meetings on Acoustics

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Support Vector Machines for Speaker and Language Recognition

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Edinburgh Research Explorer

16.1 Lesson: Putting it into practice - isikhnas

Speaker Identification by Comparison of Smart Methods. Abstract

Lecture 9: Speech Recognition

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Radius STEM Readiness TM

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Using dialogue context to improve parsing performance in dialogue systems

Speaker Recognition. Speaker Diarization and Identification

Voice conversion through vector quantization

user s utterance speech recognizer content word N-best candidates CMw (content (semantic attribute) accept confirm reject fill semantic slots

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

CEFR Overall Illustrative English Proficiency Scales

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Statewide Framework Document for:

INPE São José dos Campos

Switchboard Language Model Improvement with Conversational Data from Gigaword

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Why Did My Detector Do That?!

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Parsing of part-of-speech tagged Assamese Texts

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Metadiscourse in Knowledge Building: A question about written or verbal metadiscourse

FOR TEACHERS ONLY. The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION PHYSICAL SETTING/PHYSICS

Characteristics of the Text Genre Realistic fi ction Text Structure

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

University of Groningen. Systemen, planning, netwerken Bosman, Aart

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

On the Formation of Phoneme Categories in DNN Acoustic Models

Word Segmentation of Off-line Handwritten Documents

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Multi-Lingual Text Leveling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Rule Learning With Negation: Issues Regarding Effectiveness

Introduction to Moodle

Modeling user preferences and norms in context-aware systems

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Speech Recognition by Indexing and Sequencing

1 Use complex features of a word processing application to a given brief. 2 Create a complex document. 3 Collaborate on a complex document.

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Reducing Features to Improve Bug Prediction

Body-Conducted Speech Recognition and its Application to Speech Support System

Summary results (year 1-3)

Transcription:

BENEFIT OF MUMBLE MODEL TO THE CZECH TELEPHONE DIALOGUE SYSTEM Luděk Müller, Luboš Šmídl, Filip Jurčíček, and Josef V. Psutka University of West Bohemia, Department of Cybernetics, Univerzitní 22, 306 14 Pilsen, Czech Republic, muller@kky.zcu.cz, smidl5@kky.zcu.cz, filip@kky.zcu.cz, psutka_j@kky.zcu.cz Abstract This paper discusses a usage of a mumble model in a Czech telephone dialogue system designed and constructed at the Department of Cybernetics, University of West Bohemia, and describes benefits of the mumble model to speech recognition, namely to a rejection method. Firstly, the overview of the Czech telephone dialogue system and its recognition engine is given. The recognition is based on a statistical approach. The triphones are used and modeled by tree state left to right HMMs with an output probability density function expressed as a multivariate Gaussian mixture. The stochastic regular grammars are used as a language model to reduce a task perplexity. Secondly, the mumble model is introduced as a recursive network of Czech phones HMM models connected in parallel, and an implementation of a rejection and a key word spotting method, both based on the mumble model, is explained. Finally, the experimental results providing the 19.4 % equal error rate (EER) of the rejection and 16.7 % EER of the key word spotting method are discussed. 1. Introduction Current speech recognition systems usually work with a vocabulary of a limited size. Also a finite state grammar is often used as a language model which efficiently restricts a number of acceptable utterances. Thus a problem arises when the incoming utterance does not respect the recognition grammar rules. The simplest example is if the speaker says an out of vocabulary word. It is required that in this case the speech recognition engine must not select any suitable sentence from the sentence set defined by the grammar and should inform an application that no sentence matches the input utterance and recognition result is rejected. In this paper a new rejection method based on a time local distance between the mumble score and the word score is presented. In the next part of the article an implementation of a key word spotting method using a mumble model and a finite state grammar is described. The mumble model is here used to capture and absorb non key word parts of an utterance. 2. Dialogue System Our dialogue system consists of three main parts a speech engine, a dialogue manager, and a dialogue application. The dialogue application is a task oriented module keeping knowledge on a lexicon, dialogue structure etc., and the dialogue manager controls a communication between a user and the system. The speech engine contains only a speech recognition module today; a speech synthesis module will be added to the engine in the near future. Figure 1 illustrates the dialogue system architecture.

Figure 1. Telephone dialogue system 3. Speech Recognition Engine The core of the speech engine is implemented in C++ and designed as platform independent. The platform specific implementation layer was built for an implementation on MS Windows NT/95/98/2000. Our goal was to design a fast recognition module without decreasing recognition accuracy. Several instances of the speech engine can operate on one PC in real time. Furthermore, each engine module can be implemented as a set of several tasks, each of them generally running as an individual process. Figure 2 shows how the tasks of the speech recognition engine cooperate. 3.1. Recognition module The recognition module incorporates a front end, an acoustic model, a language model (represented by a stochastic regular grammar) and a decoding block providing searching for the best word sequence matching the incoming acoustic signal with respect to the grammar. As it was mentioned above, the recognition module is split up into three tasks: the front end, the labeler and the decoder. The front end is responsible for converting a continuous acoustic speech signal into a sequence of feature vectors. The digitization of an input analog telephone signal and/or a generation of a synthesized speech signal are provided by a telephone interface board. The front end is now equipped by a DIALOGIC D/21D board that supports two telephone lines. This enables us to run two speech recognition engines at the same time on one computer. The speech signal is first digitized (by DIALOGIC board) at a sampling rate of 8kHz. A 25 ms Hamming window shifted with 10 ms steps and a pre emphasis factor of 0.97 are then used to calculate 13 mel frequency cepstral coefficients (MFCCs) (including c(0) coefficient) and their first order and second order derivatives. To make the telephone speech recognition more robust we use a RASTA like band pass filter [7] that suppresses slowly varying channel distortions. The RASTA filter is applied to the first 13 MFCCs and can be described by equation 4 yn ( ) = G ( i 2) xn ( + i ) + µ yn ( 1), i= 0 where x(n) and y(n) are an input and an output speech signal respectively, n = 0, 1,..., T 1, and T is the number of frames. Parameters μ and G are set to μ = 0.94 and G = 0.1 respectively. The front end also contains a silence speech detector. If an observation feature vector is Figure 2. Speech recognition engine

marked as silence then the labeler task and the decoder task are waiting for speech data. The labeler is responsible for computing a large number of log likelihood scores (LLSs) of observation feature vector. There are 2510 divers tied states. Each state is represented in 39 dimension space (13 MFCCs + 13 first order derivatives + 13 second order derivatives). To implement a real time recognizer it is important to reduce the number of LLSs calculations. This leads to an approximate computation of LLSs. We solve this problem by applying a new method, which establishes relatively exactly (in the original space of dimension 39) the first 50 or 150 best (most probable) LLSs. The proposed method uses efficiently relevant statistical properties of the Gaussian mixture densities combining them with a priory hit technique and the knn method. This approach allows more than 90% reduction of a computation cost without substantial decrease of recognition accuracy. The decoder is responsible for finding the best word sequence that matches incoming acoustic signal. The decoder uses a crossword context dependent HMM state network. The whole network consists of one or more (generally in run time) connected grammars. A considerable part of the net is generated before the decoder starts, but every part of the net can be generated on demand in run time. The decoder uses a Viterbi search technique with an efficient beam pruning. 3.2. The Mumble Model The mumble model is constructed as a set of HMM models connected in a parallel fashion. Each HMM model is 3 state left to right model which represents one context independent phone. The structure of the mumble model is depicted in Figure 3. Actually the probabilities of emission of an observation vector in a given state are evaluated as the maximal emission probability of all corresponding states of context dependent triphones. Thus neither additional HMM models nor additional training is required. The value of the backward loop probability BPr causes a various length of the phone sequence recognized by the network in Figure 3. While the higher value of BPr produces more insertions, the smaller value induces more deletions in the resulting phone sequence. Figure 3: Mumble model 3.3. The Rejection Method At each time frame, the log likelihood score of the mumble model is evaluated as the maximum log likelihood score from all mumble model states. Similarly, the log likelihood score of the recognition network is taken as the maximum log likelihood score from all network model states. Then the difference between these two maximal values is computed and saved into a buffer keeping score differences of the last N time frames. Let us denote M as a suitable chosen value, 0 < M < N. In each time of recognition the difference between the last buffer element B[t] and the buffer

element B[t M] is evaluated and compared to some predefined threshold. If the threshold is exceeded, then the recognition result is rejected. 3.4. The Key Word Spotting Method The decoder can also run in a key word spotting mode. The recognition network contains both a set of key words in a parallel connection and a mumble model also connected to the key word network in parallel. An example of a key word recognition network with a mumble model is depicted in Figure 4. finite state grammar was used as the language model in all experiments. The rejection method was tested on a telephone yellow page database. The speech corpus comprised 357 speakers and 357 utterances (each utterance was spoken by a different speaker). We used the vocabulary of size 577 words and the grammar that accepted 716 different two words sentences (person s names). In order to test the case when an out of vocabulary phrase is spoken a test was also performed with a grammar which does not accept any utterance from the test speech corpus. The both grammars were tested with the test speech corpus, and false acceptance and false rejection error rates were obtained for different values of the rejection thresholds. The results are shown in Figure 5. The word error rate is 5.4%. The intersection point between the false acceptance curve and the false rejection curve denotes the ERR (equal error rate). 40 35 false rejection false acceptance Figure 4: Key word recognition network During recognition a Viterbi search finds out the best path through the recognition network. For any non key word part of an utterance the mumble model should have a better acoustic match score than any key word model; thus a mumble word will be assigned to the non key word part of the utterance. In this way the mumble model catches the non key word parts. Finally, the mumble words are omitted from the resulting recognized word sequence and only the key words remind in the output. 4. Experimental Results The proposed methods have been tested using two Czech telephone databases [5]. A error rate 30 25 20 15 10 55000 55500 56000 56500 57000 57500 58000 58500 threshold Figure 5: Rejection result (716 sentences) This rather high value is caused by a relatively small width of beam pruning during recognition. The ERR is 19.4%. Furthermore we performed tests with the same test corpus on a person s names grammar with a larger vocabulary size and found that the ERR increases with the number of sentences accepted by the recognizer and with the vocabulary size. The ERR 25.1 % was

achieved for a vocabulary containing 2864 sentences. The key word spotting method was tested using the Czech telephone database from an economic area. 50 words were chosen as 50 key words and 97 utterances from different speakers were utilized as a test speech corpus. error rate 35 30 25 20 15 10 5 key-phrase deletion errorrate key-phrase insertion errorrate 250000 300000 350000 400000 450000 500000 550000 600000 backw ard loop tran sition cost(-log likelihood ofbpr) Figure 6: Key words spotting result The average utterance length is 14 words and all utterances contain 450 different words. Results for the backward loop probability are shown in Figure 6 (ERR = 16.7). 5. Conclusions This paper describes a mumble model method incorporated into the Czech telephone dialogue system and discusses its benefits to speech recognition. The various test results for both the rejection method of out of grammar utterances and the word spotting method are given. In our experiments with the rejection technique the ERRs for a small vocabulary (716 sentences) and for a large vocabulary (2 864 sentences) were 19.4% and 25.1%, respectively. The results for the key-word spotting method showed the ERR of 16.7%. 6. Acknowledgments This work was supported by the Ministry of Education of Czech Republic, projects no. LN00B096 and MSM235200004. 7. References [1] Lin, Q., Das, S., Lubensky, D., Picheny, M.: A new confidence measure based on rank ordering subphone scores, In: ICSLP 1998, Sydney. [2] Neti, C., Roukos, S., Eide, E.: Confidence measure as a guide for stack search in speech recognition, In: ICASSP96, pp.883 887, Germany. [3] Lin, Q., Lubensky, D., Roukos, S.: Use of recursive mumble models for confidence measuring, In: Eurospeech99, pp.53 56; Budapest. [4] Young, S.J., Russell, N.H., Thornton, J. H.S.: Token Passing: a Simple Conceptual Model for Connected Speech Recognition Systems, Cambridge University Engineering Department, July 31, 1989. [5] Radová, V., Psutka, J., Šmídl, L., Vopálka, P., Jurčíček, F.: Czech Speech Corpus for Development of Speech Recognition Systems, In: Proceedings of the Workshop on Developing language resources for minority languages, Athens 2000. [6] Müller, L., Psutka, J., Šmídl, L.: Design of Speech Recognition Engine, In: Text, Speech and Dialogue 2000 (TSD2000), 3 rd International Workshop, Brno, Czech Republic, 2000. [7] Han, J., Han, M., Park, G.B., Park, J., Gao, W.: Relative Mel Frequency Cepstral Coefficients Compensation for Robust Telephone Speech Recognition In: Eurospeech97, pp. 1531-1534.