HTK vs. SPHINX for SPEECH Recognition

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Learning Methods in Multilingual Speech Recognition

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Modeling function word errors in DNN-HMM based LVCSR systems

A study of speaker adaptation for DNN-based speech synthesis

Lecture 9: Speech Recognition

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Speech Emotion Recognition Using Support Vector Machine

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

WHEN THERE IS A mismatch between the acoustic

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Masters Thesis CLASSIFICATION OF GESTURES USING POINTING DEVICE BASED ON HIDDEN MARKOV MODEL

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Investigation on Mandarin Broadcast News Speech Recognition

Human Emotion Recognition From Speech

Speaker recognition using universal background model on YOHO database

A Neural Network GUI Tested on Text-To-Phoneme Mapping

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Python Machine Learning

Letter-based speech synthesis

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Speaker Recognition. Speaker Diarization and Identification

Lecture 1: Machine Learning Basics

Automatic Pronunciation Checker

Speech Recognition by Indexing and Sequencing

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

Calibration of Confidence Measures in Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Artificial Neural Networks written examination

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Detecting English-French Cognates Using Orthographic Edit Distance

Edinburgh Research Explorer

CS Machine Learning

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

21st Century Community Learning Center

On the Formation of Phoneme Categories in DNN Acoustic Models

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

(Sub)Gradient Descent

A Hybrid Text-To-Speech system for Afrikaans

Effect of Word Complexity on L2 Vocabulary Learning

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Statewide Framework Document for:

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Switchboard Language Model Improvement with Conversational Data from Gigaword

Word Segmentation of Off-line Handwritten Documents

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Speaker Identification by Comparison of Smart Methods. Abstract

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Assignment 1: Predicting Amazon Review Ratings

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

An Online Handwriting Recognition System For Turkish

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

CEFR Overall Illustrative English Proficiency Scales

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

CSL465/603 - Machine Learning

Improvements to the Pruning Behavior of DNN Acoustic Models

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Online Updating of Word Representations for Part-of-Speech Tagging

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Large vocabulary off-line handwriting recognition: A survey

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Using Synonyms for Author Recognition

Software Maintenance

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Transcription:

HTK vs. SPHINX for SPEECH Recognition Juraj Kačur Department of telecommunication, FEI STU Ilkovičová 3, Bratislava Slovakia Email: kacur@ktl.elf.stuba.sk Abstract The submitted article gives a general user overview about two most widely used systems HTK and SPHINX for automatic speech recognition based on HMM. Apart of the description of basic functionality provided by these systems, main differences are depicted as well. Both the advantages and main disadvantages over each other are discussed. Finally, extending platforms aimed at their real time application, like ATK or SPHINX decoders 3 and 4 are mentioned. 1. Introduction The area of automatic speech recognition has been intensively studied for several decades. However, the major advances have been made since the introduction of the statistical modeling of speech using HMM [1]. HMM approach operates on the probability bases unlike DTW method, which is concerned with acoustical distance measure. As the true acoustic distance measure is still unknown, this approach doesn t work well in the task of speaker independent recognition. Furthermore, this method is very ineffective for connected word or continual speech recognition [1]. The introduction of statistical modeling of speech using HMM, suppressed most of the abovementioned drawbacks, however, new challenges emerged like huge sets of training samples and robust estimation methods. Unfortunately, there is no analytical solution either for ML or MAP estimation criteria [2]. The most commonly used algorithm based on ML criteria is Baum-Welche. It is an iterative process which unfortunately finds only local maxima and if applied numerously to limited data, can cause overtraining. However its main advantages are: it secures improvements in each iteration, the probability consistency of all features is maintained, the structure of models during the iteration process is preserved, etc. As there is no analytical solution and only local maximum can by found, it is vital to properly initialize all models, preprocess the data, add and change the structure of models during the course of training etc. Thus many separate actions are needed during the training phase. A mature system should also provide: adaptation utilities, wide spectrum of tools for signal preprocessing, language modeling, grammar construction, dictionary management, simple work with

description files, editing facilities for HMM models, context independent and context dependent models with mapping and parameter s sharing (tight states) options, and many more. Now it is clear that to build and train robust models a complex system has to be used. So far there are 2 widely spread systems to enable many of the mentioned facilities and these are HTK and SPHINX. 2. HTK system HTK is the most advanced and widely used system for modeling non-stationary data using HMM models. It is free for educational or academicals purposes and can be downloaded after a registration. System was specially designed to cope with the task of automatic speech recognition; however it can be applied to other areas as well. Its current version is 3.3, but there is already an alpha version of 3.4. The main advantages of the HTK are: it is a complex system that covers all development phases of a recognition system, system is regularly updated to catch up with the latest advances in the field of recognition, and it is well documented both theoretically and practically. In the remaining part of this paragraph main features of the HTK systems are listed. In the phase of signal processing HTK provides these kinds of feature: filter bank, LPC, LPreflexC, CLPC, IREFC, MFCC, MELSPEC and PLP. Except these basic kinds additional features and options are available like: first, second and third order differences, energy normalization, cepstral mean normalization, cepstral weighting, zero mean subtraction, preemphase, vocal tract length normalizations, etc. It supports many input formats ranging from raw data to formats like WAV, TIMIT, NIST, OIG, AIFI, etc. All these facilities are available through command HCopy. In the data (transcription and lexicon) preparation process HTK provides 2 flexible and quite useful tools represented by HDMan (dictionary processing) and HLed (transcription file management). Dictionary can contain both speech models as well as models of background. Each phrase can be expressed as a sequence of context dependent or independent phonemes optionally extended by non verbal models. Each word can have multiple alternatives, the first listed is default one. HDMan enables to use and merge several different dictionaries. HTK supports 2 descriptions files: LAB and MLF. LAB is a description of a single speech file whereas MLF contains multiple lab file concentrated in a single file. Basic description (lexical) of recorded speech files can be extended using time labels for various levels (phrase, word or phoneme). In the initialization stage HTK offers 2 possibilities depending on the existence of time labels. If there is time information of models an effective Viterbi training can be applied to bootsrap data, which is implemented in HInit tool. If such data doesn t exist (which is the most usual case) a flat start method must be applied which is provided by HComV, which may calculate also variance floor. The training phase implements Baum- Welche algorithm for multiple recordings. Again there are 2 versions depending on the existence of time labels. If there are,

single model training can be used which is provided in HRest tool, else embedded training must be executed by invoking HERest tool. Both tools are applicable for any kind of models, however single so cold T model can t be treated by HInit and HRest tools. In the decoding process HVite tool is used, which calculates the best path and its probability across concatenated models (no full probability is calculated). It can return multiple hypotheses where more tokens per state are allowed. It has several utilizations like: recognition, models (speech unit) time alignment that can be used in speech syntheses [8], selection of alternative pronunciation if there are any in the dictionary and it supports some kind of real time speech recognition where the input is a direct audio. Even if this is quite a useful option, it is usually not eligible for real dialog or continual speech recognition systems. Furthermore HTK provides evaluation tools represented by HREsults. It can performe several statistic and evaluation methods. Some of them are: SER (sentence error rate), WER (word error rate), confusion matrix, etc. It is also possible to mark synonyms and neglect background models. In addition HTK offers huge amount of tools for various auxiliary tasks like: HMM model management, grammar construction, language model construction, model adaptation. Very important is model editing tool HHEd which performs basic and automatic editing of models like: splitting, merging, adding, etc. Except those it provides powerful functions for tying models (usually context dependent), where it is possible to tie any feature. It provides 2 method to do so, one is data driven clustering where the possible groups of similar phonemes are manually selected, but the final grouping is done based on train data, the second one is called decision tree clustering, where questions are build for left and right context of each phoneme. These questions are used to split the group of models for a central phoneme and the split that cause highest increase of modeling probability is chosen. The process goes till some limits are met. The main advantage of this process is construction of decisions threes that can be further used to synthesize unseen triphones. Quite important part is devoted to language models. HTK supports finite state grammar represented by BNF or SLF formats. However it accepts statistical language models as well, currently bigrams and trigrams. It provides extensive statistical language tools for construction of statistical language models, their maintenance, smoothing and updating. Supported word language model formats are: ARPA MIT, binary language format, and modified ARPA MIT. For detailed description please see [3]. Finally in the adaptation task HTK supports maximum likelihood linear regression method and MAP based adaptation, both are invoked by HEAdapt tool. At the end of this short description it should be emphasized that HTK supports different model structures to coexist together (left- right, ergodic, with different numbers of states- minimum 3). Models have 2 non-emitting states at their beggings and ends, which enables to

construct so called T model that is used for modeling of short pauses between words. Furthermore it supports discrete HMM models and performs vector quantization process by HQuant. As it is possible to tie any model features it is implicitly possible to construct and train semi-continues HMM models as well. 3. SPHINX SPHINX system has been developed at Carnegie Mellon University. Currently there are SPHINX 2, 3, 3.5 and 4 decoder versions and SphinxTrain (used for training purposes). Unfortunately the documentation of these systems is relatively poor regarding HTK system, so features mentioned next are only those listed in manuals or those that were already tested, which may not be a complete set. SphinxTrain (3 version) can be used to train continues or semi-continues models for SPINX decoder versions 3 or 4 (conversion for version 2 is needed). SphinxTrain supports MFCC coefficients with delta or delta-delta features. Transcription file is a plain text file that contains words from dictionaries (neither multilevel description nor time labels are supported).there are 2 dictionaries: main where words are listed and translated into sequence of phonemes (alternative pronunciations are allowed but in the training process these are ignored- first listing is accepted). There is additional so called filler dictionary where non-verbal models are listed and these won t be included as a context for triphones. Silence models at beginning and end of utterance (<s>,</s>) and general background model SIL are obligatory. Main drawback is presented by the prescribed structure of HMM models. These are defined in model definition file. Only one structure is allowed for all models (verbal and non verbal as well). There can by only 3 of 5 state models either left right or with one state skip. At the end of each model there is only one non emitting state, thus no T models are supported. Components of models are stored separately (variances, means, weights, transition matrices). There is only embedded training (bw) and flat start initialization (mk_flat, init_gau, init_mixw) supported (no time information). The initialization and training cycle is not in invoked by a single command; rather several steps must be taken (bw, norm). Multi mixture Gaussian models are supported and the process of increasing number of mixtures is invoked by inc_comp (each time 2x). SphinxTrain performs triphone tying by constructing decision trees; however no phone classification file is needed. Questions are somehow formed but this particular mechanism is somehow obscured. Instead of setting stoppage condition for further splitting, the number of tied states is manually set up by the designer. Furthermore only states can be tied (not mean, variances, etc.). Apart of SphinxTrainer there is CMU statistical modeling tool that can be used to construct words counts, bigrams and trigrams counts, various backoff bigram and trigram language models, perplexity, out of vocabulary ratios, etc. SphinxTrain can further performs semicontinues model training with vector quantization.

4. ATK and SPHINX decoder versions 3 and 4 For real life as well time HTK based applications, ATK system has recently been introduced [4]. It is beads on objects (C++) like: source, coding, viterbi decoding, buffers etc. This allows many possibilities and flexible use for the final designer. It defines several states and modes of operations that can be cosen to fit current application. It accepts input from file or microphone, and except probability, it supports calculation of confidence interval, background model, adaptation of cepstral mean, it has improved version of VAD algorithm, etc. It can use either BNF or SLF grammar formats or statistical bigram models. In the case of SPHINX there are currently these decoders 2, 3, 3.5, and 4. All decoders except version 4, supported only statistical language models, which is rather troublesome for dialog like applications. Version 2 uses only semicontinues models which is nowadays not very common. Instead of T model for short pauses or grammatically modeled fillers, SPHNIX decoders use probability of silence insertion. They accept input either from files (as well as outputs) or from microphones. The latest SPHINX 4 [7] is written in JAVA, and eliminated some of main drawbacks mention later. Main theoretical improvements are: support for finite grammar called Java Speech API grammar, it doesn t impose the restriction using the same structure for all models, etc. 5. Conclusion HTK system is more complex, very flexible, provides up to date functionality, it is regularly updated and well documented. Introduction of ATK greatly enabled its real time application. However, SPHINX system despite its limitation in model structure, with limited functionality in the training process and not well documented features, is very competitive, especially as it is free for use (even commercial). With the platform independent, well structured decoder SPHINX 4 for real applications, its importance has greatly increased. Acknowledgement This article has been supported by the project: VEGA 1/3110/06 References: [1] L. Rabiner, Biing-Hwang Juan : Fundamentals of speech recognition, Prentice Hall PTR, 1993 [2] X. D. Huang, Y. Ariki, M. A. Jack, Hidden Markov Models for Speech Recognition, Edinburg university press, 1990 [3] S. Young, G. Evermann, T. Hain, The HTK Book V.3.2.1, Cambridge University Engineering Department, Dec. 2002 [4] Steve Young ATK Version 1.3 Cambridge University Engineering Department, January 2004, http://mi.eng.cam.ac.uk/~sjy/software.htm [5] http://www.speech.cs.cmu.edu/sphinx/tutorial.html [6] http://www.speech.cs.cmu.edu/sphinxman/fr4.html [7] http://cmusphinx.sourceforge.net/sphinx4 [8] Turi Nagy, M.; Cepko, J.; Rozinaj, G.: Concatenation of Speech Units in TTS Synthesis with Utilization of SN Model, Proceedings of the 5th EURASIP Conference, EC-SIP-M 2005, Smolenice, Slovak Republic, 29 June - 02 July 2005, pp.: 376-381, ISBN 80-227-2257-X