HTK vs. SPHINX for SPEECH Recognition Juraj Kačur Department of telecommunication, FEI STU Ilkovičová 3, Bratislava Slovakia Email: kacur@ktl.elf.stuba.sk Abstract The submitted article gives a general user overview about two most widely used systems HTK and SPHINX for automatic speech recognition based on HMM. Apart of the description of basic functionality provided by these systems, main differences are depicted as well. Both the advantages and main disadvantages over each other are discussed. Finally, extending platforms aimed at their real time application, like ATK or SPHINX decoders 3 and 4 are mentioned. 1. Introduction The area of automatic speech recognition has been intensively studied for several decades. However, the major advances have been made since the introduction of the statistical modeling of speech using HMM [1]. HMM approach operates on the probability bases unlike DTW method, which is concerned with acoustical distance measure. As the true acoustic distance measure is still unknown, this approach doesn t work well in the task of speaker independent recognition. Furthermore, this method is very ineffective for connected word or continual speech recognition [1]. The introduction of statistical modeling of speech using HMM, suppressed most of the abovementioned drawbacks, however, new challenges emerged like huge sets of training samples and robust estimation methods. Unfortunately, there is no analytical solution either for ML or MAP estimation criteria [2]. The most commonly used algorithm based on ML criteria is Baum-Welche. It is an iterative process which unfortunately finds only local maxima and if applied numerously to limited data, can cause overtraining. However its main advantages are: it secures improvements in each iteration, the probability consistency of all features is maintained, the structure of models during the iteration process is preserved, etc. As there is no analytical solution and only local maximum can by found, it is vital to properly initialize all models, preprocess the data, add and change the structure of models during the course of training etc. Thus many separate actions are needed during the training phase. A mature system should also provide: adaptation utilities, wide spectrum of tools for signal preprocessing, language modeling, grammar construction, dictionary management, simple work with
description files, editing facilities for HMM models, context independent and context dependent models with mapping and parameter s sharing (tight states) options, and many more. Now it is clear that to build and train robust models a complex system has to be used. So far there are 2 widely spread systems to enable many of the mentioned facilities and these are HTK and SPHINX. 2. HTK system HTK is the most advanced and widely used system for modeling non-stationary data using HMM models. It is free for educational or academicals purposes and can be downloaded after a registration. System was specially designed to cope with the task of automatic speech recognition; however it can be applied to other areas as well. Its current version is 3.3, but there is already an alpha version of 3.4. The main advantages of the HTK are: it is a complex system that covers all development phases of a recognition system, system is regularly updated to catch up with the latest advances in the field of recognition, and it is well documented both theoretically and practically. In the remaining part of this paragraph main features of the HTK systems are listed. In the phase of signal processing HTK provides these kinds of feature: filter bank, LPC, LPreflexC, CLPC, IREFC, MFCC, MELSPEC and PLP. Except these basic kinds additional features and options are available like: first, second and third order differences, energy normalization, cepstral mean normalization, cepstral weighting, zero mean subtraction, preemphase, vocal tract length normalizations, etc. It supports many input formats ranging from raw data to formats like WAV, TIMIT, NIST, OIG, AIFI, etc. All these facilities are available through command HCopy. In the data (transcription and lexicon) preparation process HTK provides 2 flexible and quite useful tools represented by HDMan (dictionary processing) and HLed (transcription file management). Dictionary can contain both speech models as well as models of background. Each phrase can be expressed as a sequence of context dependent or independent phonemes optionally extended by non verbal models. Each word can have multiple alternatives, the first listed is default one. HDMan enables to use and merge several different dictionaries. HTK supports 2 descriptions files: LAB and MLF. LAB is a description of a single speech file whereas MLF contains multiple lab file concentrated in a single file. Basic description (lexical) of recorded speech files can be extended using time labels for various levels (phrase, word or phoneme). In the initialization stage HTK offers 2 possibilities depending on the existence of time labels. If there is time information of models an effective Viterbi training can be applied to bootsrap data, which is implemented in HInit tool. If such data doesn t exist (which is the most usual case) a flat start method must be applied which is provided by HComV, which may calculate also variance floor. The training phase implements Baum- Welche algorithm for multiple recordings. Again there are 2 versions depending on the existence of time labels. If there are,
single model training can be used which is provided in HRest tool, else embedded training must be executed by invoking HERest tool. Both tools are applicable for any kind of models, however single so cold T model can t be treated by HInit and HRest tools. In the decoding process HVite tool is used, which calculates the best path and its probability across concatenated models (no full probability is calculated). It can return multiple hypotheses where more tokens per state are allowed. It has several utilizations like: recognition, models (speech unit) time alignment that can be used in speech syntheses [8], selection of alternative pronunciation if there are any in the dictionary and it supports some kind of real time speech recognition where the input is a direct audio. Even if this is quite a useful option, it is usually not eligible for real dialog or continual speech recognition systems. Furthermore HTK provides evaluation tools represented by HREsults. It can performe several statistic and evaluation methods. Some of them are: SER (sentence error rate), WER (word error rate), confusion matrix, etc. It is also possible to mark synonyms and neglect background models. In addition HTK offers huge amount of tools for various auxiliary tasks like: HMM model management, grammar construction, language model construction, model adaptation. Very important is model editing tool HHEd which performs basic and automatic editing of models like: splitting, merging, adding, etc. Except those it provides powerful functions for tying models (usually context dependent), where it is possible to tie any feature. It provides 2 method to do so, one is data driven clustering where the possible groups of similar phonemes are manually selected, but the final grouping is done based on train data, the second one is called decision tree clustering, where questions are build for left and right context of each phoneme. These questions are used to split the group of models for a central phoneme and the split that cause highest increase of modeling probability is chosen. The process goes till some limits are met. The main advantage of this process is construction of decisions threes that can be further used to synthesize unseen triphones. Quite important part is devoted to language models. HTK supports finite state grammar represented by BNF or SLF formats. However it accepts statistical language models as well, currently bigrams and trigrams. It provides extensive statistical language tools for construction of statistical language models, their maintenance, smoothing and updating. Supported word language model formats are: ARPA MIT, binary language format, and modified ARPA MIT. For detailed description please see [3]. Finally in the adaptation task HTK supports maximum likelihood linear regression method and MAP based adaptation, both are invoked by HEAdapt tool. At the end of this short description it should be emphasized that HTK supports different model structures to coexist together (left- right, ergodic, with different numbers of states- minimum 3). Models have 2 non-emitting states at their beggings and ends, which enables to
construct so called T model that is used for modeling of short pauses between words. Furthermore it supports discrete HMM models and performs vector quantization process by HQuant. As it is possible to tie any model features it is implicitly possible to construct and train semi-continues HMM models as well. 3. SPHINX SPHINX system has been developed at Carnegie Mellon University. Currently there are SPHINX 2, 3, 3.5 and 4 decoder versions and SphinxTrain (used for training purposes). Unfortunately the documentation of these systems is relatively poor regarding HTK system, so features mentioned next are only those listed in manuals or those that were already tested, which may not be a complete set. SphinxTrain (3 version) can be used to train continues or semi-continues models for SPINX decoder versions 3 or 4 (conversion for version 2 is needed). SphinxTrain supports MFCC coefficients with delta or delta-delta features. Transcription file is a plain text file that contains words from dictionaries (neither multilevel description nor time labels are supported).there are 2 dictionaries: main where words are listed and translated into sequence of phonemes (alternative pronunciations are allowed but in the training process these are ignored- first listing is accepted). There is additional so called filler dictionary where non-verbal models are listed and these won t be included as a context for triphones. Silence models at beginning and end of utterance (<s>,</s>) and general background model SIL are obligatory. Main drawback is presented by the prescribed structure of HMM models. These are defined in model definition file. Only one structure is allowed for all models (verbal and non verbal as well). There can by only 3 of 5 state models either left right or with one state skip. At the end of each model there is only one non emitting state, thus no T models are supported. Components of models are stored separately (variances, means, weights, transition matrices). There is only embedded training (bw) and flat start initialization (mk_flat, init_gau, init_mixw) supported (no time information). The initialization and training cycle is not in invoked by a single command; rather several steps must be taken (bw, norm). Multi mixture Gaussian models are supported and the process of increasing number of mixtures is invoked by inc_comp (each time 2x). SphinxTrain performs triphone tying by constructing decision trees; however no phone classification file is needed. Questions are somehow formed but this particular mechanism is somehow obscured. Instead of setting stoppage condition for further splitting, the number of tied states is manually set up by the designer. Furthermore only states can be tied (not mean, variances, etc.). Apart of SphinxTrainer there is CMU statistical modeling tool that can be used to construct words counts, bigrams and trigrams counts, various backoff bigram and trigram language models, perplexity, out of vocabulary ratios, etc. SphinxTrain can further performs semicontinues model training with vector quantization.
4. ATK and SPHINX decoder versions 3 and 4 For real life as well time HTK based applications, ATK system has recently been introduced [4]. It is beads on objects (C++) like: source, coding, viterbi decoding, buffers etc. This allows many possibilities and flexible use for the final designer. It defines several states and modes of operations that can be cosen to fit current application. It accepts input from file or microphone, and except probability, it supports calculation of confidence interval, background model, adaptation of cepstral mean, it has improved version of VAD algorithm, etc. It can use either BNF or SLF grammar formats or statistical bigram models. In the case of SPHINX there are currently these decoders 2, 3, 3.5, and 4. All decoders except version 4, supported only statistical language models, which is rather troublesome for dialog like applications. Version 2 uses only semicontinues models which is nowadays not very common. Instead of T model for short pauses or grammatically modeled fillers, SPHNIX decoders use probability of silence insertion. They accept input either from files (as well as outputs) or from microphones. The latest SPHINX 4 [7] is written in JAVA, and eliminated some of main drawbacks mention later. Main theoretical improvements are: support for finite grammar called Java Speech API grammar, it doesn t impose the restriction using the same structure for all models, etc. 5. Conclusion HTK system is more complex, very flexible, provides up to date functionality, it is regularly updated and well documented. Introduction of ATK greatly enabled its real time application. However, SPHINX system despite its limitation in model structure, with limited functionality in the training process and not well documented features, is very competitive, especially as it is free for use (even commercial). With the platform independent, well structured decoder SPHINX 4 for real applications, its importance has greatly increased. Acknowledgement This article has been supported by the project: VEGA 1/3110/06 References: [1] L. Rabiner, Biing-Hwang Juan : Fundamentals of speech recognition, Prentice Hall PTR, 1993 [2] X. D. Huang, Y. Ariki, M. A. Jack, Hidden Markov Models for Speech Recognition, Edinburg university press, 1990 [3] S. Young, G. Evermann, T. Hain, The HTK Book V.3.2.1, Cambridge University Engineering Department, Dec. 2002 [4] Steve Young ATK Version 1.3 Cambridge University Engineering Department, January 2004, http://mi.eng.cam.ac.uk/~sjy/software.htm [5] http://www.speech.cs.cmu.edu/sphinx/tutorial.html [6] http://www.speech.cs.cmu.edu/sphinxman/fr4.html [7] http://cmusphinx.sourceforge.net/sphinx4 [8] Turi Nagy, M.; Cepko, J.; Rozinaj, G.: Concatenation of Speech Units in TTS Synthesis with Utilization of SN Model, Proceedings of the 5th EURASIP Conference, EC-SIP-M 2005, Smolenice, Slovak Republic, 29 June - 02 July 2005, pp.: 376-381, ISBN 80-227-2257-X