SCOPE CARE II Innovative

Size: px

Start display at page:

Download "SCOPE CARE II Innovative"

Marion Webb
5 years ago
Views:

1 RESEARCH & TECHNOLOGY FRANCE WP1 R1a ASR Software Evaluation Thibaut EHRETTE & Olivier GRISVARD THALES R&T France /14 European Organisation for the Safety of Air Navigation () June 2004 This document is published by in the interests of the exchange of information. It may be copied in whole or in modified without prior written permission from. makes no warranty, either implied or express, for the information contained in this document, neither does it

2 Versions Number Date Writing Review Correction 1.0 (Draft) T. EHRETTE & O. GRISVARD (THALES R&T France) T. EHRETTE & O. GRISVARD (THALES R&T France) M. BROCHARD () O. GRISVARD Circulation Addressee Version Date Marc BROCHARD () Célestin SEDOGBO (THALES R&T France) Marc BROCHARD () Validation Name Date Signing Written by: Agreed by: T. EHRETTE & O. GRISVARD THALES R&T France /14 European Organisation for the Safety of Air Navigation () June 2004 This document is published by in the interests of the exchange of information. It may be copied in whole or in modified without prior written permission from. makes no warranty, either implied or express, for the information contained in this document, neither does it

3 Table of Contents 1. Introduction ASR software presentation Nuance Sphinx ASR software theoretical comparison Criteria presentation Comparison results ASR software practical evaluation Nuance Sphinx Summary Conclusion s... 0 THALES R&T France /14 European Organisation for the Safety of Air Navigation () June 2004 This document is published by in the interests of the exchange of information. It may be copied in whole or in modified without prior written permission from. makes no warranty, either implied or express, for the information contained in this document, neither does it

4 1. Introduction This report is the first report (R1a) of the first work-package (WP1) of the CARE II Innovative project SCOPE (Safety of COntroller-Pilot dialogue). The work presented here has been done by THALES Research & Technology (THALES R&T) with input from IntuiLab and IRIT (Institut de Recherche en Informatique de Toulouse). In this report, we propose an evaluation of two Automatic Speech Recognition (ASR) tools in the context of Air Traffic Control (ATC) communications. The purpose of this evaluation is to determine which software can be best applied to the tracking of controller-pilot dialogue. Therefore we also present our conclusions regarding the selection of the software. 2. ASR software presentation As a preamble to the presentation of the ASR software, it must be noted that the two recognition tools retained for evaluation have been chosen on the basis of the background knowledge and experimentations of THALES R&T, IntuiLab and IRIT. Both ASR come from two application domains with distinct requirements but which offer the best ASR software available today. The first is the domain of interactive vocal servers which is historically the first application domain where ASR has been used as a commercial solution for the general public. As such, software coming from this domain is generally of very good quality and very robust for a specific application. The second is the domain of broadcast news transcription, which is the principal test-bed for ASR software. ASR solutions coming from this domain offer the best performances today for common speech. The selected recognition tools are among the best candidates in each of the two domains. Moreover, the preliminary selection has been made with regards of the requirements of the targeted ATC application. These requirements are described below. The two ASR tools retained are Nuance 8.0 by Nuance Communications and Sphinx 3 by Carnegie Mellon University (for other software, see : WP1 R1b ASR Software Recommendations report). In this section we present these tools Nuance 8.0 Nuance Communications ( is a company originally created as a spin-off of the SRI International (Stanford Research Institute The Nuance ASR software proposed by Nuance Communications is a commercial solution destined to be used in interactive vocal services such as vocal commerce, vocal information services, etc. This implies situations where the language and dialogue are constrained and the vocabulary limited. Nuance 8.0 is more than a simple vocal recognition tool. It offers a large number of functionalities dedicated to vocal services: THALES R&T France /14 European Organisation for the Safety of Air Navigation () June 2004 This document is published by in the interests of the exchange of information. It may be copied in whole or in modified without prior written permission from. makes no warranty, either implied or express, for the information contained in this document, neither does it

5 Vocal signature (speaker verification for high security and secure access); Voice XML integration (a standard from the W3C for vocal dialogue description Grammar construction and evaluation tools; Language models for English and 26 other languages. Moreover, as any commercial product, Nuance 8.0 is well packaged and documented, frequently updated, and comes with on-line support for use and integration. Nuance s performances are optimal in constraint vocabulary. Usage of grammars is made easier thanks to an integrated grammar editor and compiler. Recognition time is short enough for use in real-time applications. Nuance version 8.0 and previous are already in use at THALES R&T, IntuiLab and IRIT for experimentation and development of speech-based systems Sphinx 3 Sphinx originally started at CMU (Carnegie Mellon University and has recently been released as open source software ( Sphinx is a fairly large program that offers a lot of tools and information. It is still in development but already includes trainers, recognizers, acoustic models, language models and some limited documentation. Sphinx 3 works best on continuous speech and large vocabulary and has been tested during the NIST (National Institute of Standards and Technology evaluation campaigns. Sphinx s recognition time is between five and ten times the real-time. Documentation is not as well supplied as for Nuance but open sources generally make the comprehension of functionalities easier. Unlike Nuance 8.0, Sphinx 3 does not provide any interface in order to make the integration of all components easier. In addition to the recognition core and the acoustic model for American English, a dictionary and a language model are available for the same language. CMU supplies some useful software tools aiming at elaborating new Sphinx-formatted language models. Sphinx 3 is already in use at THALES R&T for experimentation and development of speech-based systems. 3. ASR software theoretical comparison Before presenting the results of the comparison of the two recognition tools, two remarks must be made. First, it is not the purpose of this study to present an exhaustive evaluation and comparison of the two ASR solutions in general. The time and resources allocated to this study does not allow such a work. The objective here is to evaluate and compare the software in the specific context of ATC communications. Second, the purpose is not to demonstrate that one of the two ASR tools is better than the other in general. This would not make sense given the differences in nature of the two tools. Once again, the objective is rather to determine which tool is the best in the context of the SCOPE project, that is, for the transcription of ATC communications. THALES R&T France /14 European Organisation for the Safety of Air Navigation () June 2004 This document is published by in the interests of the exchange of information. It may be copied in whole or in modified without prior written permission from. makes no warranty, either implied or express, for the information contained in this document, neither does it

6 In this section we present a paper comparison of the two recognition tools on the basis of a list of criteria relevant to ASR evaluation. We first introduce the various criteria retained and then present and comment the results of the comparison Criteria presentation When selecting a vocal recognition tool, comparison cannot be limited to the overall performance of the ASR, which is the quality of the transcription (transformation from speech to text), the error rate and the processing time. As the large offer in terms of recognition software and the variety of application domains show, the tool must be adapted to the purpose and requirements of the application. For what concerns controller-pilot communications, the most important criteria are robustness to noise and capacity of adaptation, which includes multi-speaker possibilities and absence or simplicity of training. Indeed, the communications which must be transcribed are issued by several distinct speakers, who are too numerous to allow training phases. Communications may also be interrupted or may overlap and are polluted with important noise (cockpit environment, bad quality transmission channels ). The selected ASR must therefore be robust enough in order to be able to distinguish noise from speech, accept large variability of pronunciation and accents, and enable overlapping utterances. In order to tackle these issues, the most important criteria when evaluating the ASR software for ATC are: 1. The acoustic model component for: a. Robustness to noise through speech band and noise processing; b. Multi-speaker requirements through speaker independence and speaker training; 2. The language model and dialogue component given the nature of the dialogue structure. Concerning the acoustic model, both retained systems are using HMM (Hidden Markov Models) based recognition cores. This format is today the most appropriate way to build robust speech recognition systems. The phonetic units used during the learning step are phones, mostly diphones or triphones for better performances. Senones could also be involved for constituting the phonetic learning base. A senone is a context-dependent sub-phonetic unit which is equivalent to a HMM state in a triphone (which can be easily extended to more detailed context-dependent phones, like quinphones). A senone could represent an entire triphone if a 1-state HMM is used to model each phoneme [Donovan, 1995]. Another important characteristic is the nature of the outputs of the ASR. Four kinds of outputs are possible: Best solution: only the best recognition solution is proposed; N-best solutions: the n best solutions are proposed, sorted by recognition score or confidence; Word lattice: a lattice of solutions with words as nodes is proposed; Phone lattice: same as word lattice but with phones rather than words. THALES R&T France /14 European Organisation for the Safety of Air Navigation () June 2004 This document is published by in the interests of the exchange of information. It may be copied in whole or in modified without prior written permission from. makes no warranty, either implied or express, for the information contained in this document, neither does it

7 Outputs of type n-best or lattice offer wider processing possibilities than a single solution, in particular for what concerns error repairing. The configuration possibilities, such as the tuning of the ASR (in particular the rejection threshold) are also important regarding the processing of the results of the recognition. Building a global application, which integrates an ASR system, can be made easier using an API (Application Programming Interface) allowing for example to access the recognition agent functionalities directly from a programming language. Moreover, the availability and ergonomics of tools, helping the designer to manage different resources such as grammars (edition, compilation and testing) and lexicons (transformation of unknown words into phone sequences), if it does not impact the recognition performance, is nonetheless an important criteria regarding development. Especially, in the SCOPE project, where the objective is to use a specific grammar of the ATC language in order to increase the performances of the ASR, the support to grammar definition and edition offered by the software cannot be overlooked. Finally, technical features are not the only way to assess a recognition system. Other ones can be taken into account, such as documentation, legal status (commercial product, shareware, freeware), or even the fact that it is or not an open source, for adaptation or extension purposes. The criteria for the theoretical comparison of the ASR software are summarized in Table 1 below. Main Application Domain Miscellaneous Legal Status Open Source Scientific Publications Input/Output Input Signal Outputs Format Speech Band Acoustic Model Noise Processing Speaker Independence Technical Characteristics Language Model Dialogue Component Dictionary Other Speaker Training Multilingualism Vocabulary Format Formats Type Programming Language API Architecture Platform Availability THALES R&T France /14 European Organisation for the Safety of Air Navigation () June 2004 This document is published by in the interests of the exchange of information. It may be copied in whole or in modified without prior written permission from. makes no warranty, either implied or express, for the information contained in this document, neither does it

8 Integration Configuration Tools Evaluation Ergonomics Performances Decoder Public Evaluation Results Documentation Mastering Installation X*Real-Time Campaign name Table 1: Theoretical comparison criteria summary 3.2. Comparison results The results of the theoretical comparison of Nuance 8.0 and Sphinx 3 are summarized in Table 2 on page 0. When it applies, the criterion is given a mark out of 5. These marks come from our various experimentations with the software. Comments on this table are given below. Nuance 8.0 is HMM-based with bi-gram, tri-gram and quadri-gram language models. Recognition time increases with the complexity of the model. Sphinx 3 is composed of an acoustic analysis module and a decoder. Within Sphinx 3, several types of decoders are proposed. We have chosen the S3.3 decoder which optimizes the signal processing time, since the ATC application requires fast processing. The other decoders could possibly have better results but in a much longer time and therefore are not applicable to live ATC communications. The S3.3 decoder uses HMMs and search techniques such as the Viterbi algorithm. It produces three kinds of outputs: best transcription, word lattice and phone lattice. Nuance 8.0 and Sphinx 3 both include numerous parameters used to control the behavior of the various components of the recognition system and of the whole ASR application. We will not describe them here in details, but they allow very fine tuning on both systems. It must be noted that the tuning process can be very long and demanding. One can refer to the documentation of the software for more details. Nuance 8.0 is a well-packaged tool which comes with more than fifty tools, which is much greater than Sphinx 3. Both provide developers with tools for the creation of statistical language models. Moreover, some Nuance tools are dedicated to the automatic creation and learning of Specific Language Models (SLM) i.e. formal grammars. In order to release the constraints of purely formal grammars, these are associated with a robust natural speech recognizer taking into account some language particularities such as hesitations or reformulations. Nuance carries just-in-time grammars that can be modified or added at runtime. Sphinx 3 exclusively allows the creation of statistical acoustic and language models. Nuance 8.0 provides Java and C++ classes defining methods that can THALES R&T France /14 European Organisation for the Safety of Air Navigation () June 2004 This document is published by in the interests of the exchange of information. It may be copied in whole or in modified without prior written permission from. makes no warranty, either implied or express, for the information contained in this document, neither does it

9 be implemented to enable access to the recognition engine on a specific platform, including access to dynamic grammars and the speaker verification functionality for example. Concerning ergonomic aspects, both systems present a text-command interface to access and use the different tools. Each command is associated with a specific help on general purposes and detailed syntax. Nuance 8.0 provides very precise guides that can be used by novices as well as more experimented users in any steps of the global application design. These guides are dealing with the installation of the Nuance system, the grammar development, the system verification and integration on a specific platform. Sphinx 3 also provides some help about its different tools but it is not as didactic as the Nuance one. Finally, considering the installation process, Nuance 8.0 proposes a single setup program which automatically installs all the necessary components, which makes it an off-the-shelf product. Default installed language model is an American English one. Other models can be easily added just by copying folders without additional compilation. Sphinx is also relatively easy to install, although it requires a minimum knowledge about software configuration. Sphinx only comes with an American English language model. Nuance 8.0 Sphinx 3 Main Application Domain interactive vocal servers keyword spotting, broadcast news transcription Miscellaneous Legal Status commercial product freeware Open Source no yes Scientific Publications [Murveit, 1993] [Stolcke, 2000] [Seymore, 1997] [Ravishankar, 1999] Technical Characteristics Input/Output Acoustic Model Language Model Input Signal continuous continuous best transcription best transcription Outputs n-best transcriptions word lattice phone lattice Format HMM / 25,000 triphones HMM / 6,000 senones Speech Band 8kHz / 16kHz 8kHz / 16kHz Noise Processing yes yes Speaker Independence yes yes Speaker Training no no Multilingualism 27 languages English Vocabulary limited (but extensible) large vocabulary Format native ARPA 1 compiled ARPA Dialogue Component Formats GSL, VoiceXML N/A Dictionary Type compiled editable 1 ARPA: standard defined by the U.S. Defense Advanced Research Projects Agency (DARPA). THALES R&T France /14 European Organisation for the Safety of Air Navigation () June 2004 This document is published by in the interests of the exchange of information. It may be copied in whole or in modified without prior written permission from. makes no warranty, either implied or express, for the information contained in this document, neither does it

10 Programming Language C, C++, Java C Other API yes no Architecture distributed distributed Platform Availability Solaris, Win2000 Linux, Unix, Windows Integration 5/5 2/5 Configuration 4/5 5/5 Tools 5/5 4/5 Evaluation Ergonomics 3/5 3/5 Documentation 4/5 3/5 Mastering 4/5 4/5 Installation 5/5 5/5 Performances Decoder Public Evaluation Results X*Real-Time Campaign name x1 (command vocabulary) NIST x[5-10] (continuous speech) NIST: broadcast news recognition, topic detection & tracking Table 2: Nuance 8.0 vs. Sphinx 3 theoretical comparison THALES R&T France /14 European Organisation for the Safety of Air Navigation () June 2004 This document is published by in the interests of the exchange of information. It may be copied in whole or in modified without prior written permission from. makes no warranty, either implied or express, for the information contained in this document, neither does it

11 4. ASR software practical evaluation In order to evaluate the performances of the ASR software, two series of tests have been run, with two different corpuses. The first corpus is constituted of news excerpts from the information channel CNN. Such a corpus contains unconstrained speech but with a clear and precise pronunciation and often good quality audio files. It is a good way to evaluate an ASR tool in general, but it is somehow different from the requirements of the ATC context, since ATC communications are constituted of constrained speech with potentially poor audio quality as well as a very rapid delivery, confusing pronunciation and bad accents. Therefore, the second series of tests has been run on a corpus of military orders uttered by officers in training, which is closer to the characteristics of ATC communications (noisy environment, low quality transmission channels and stressed speech). The two corpuses are constituted of English utterances. All tests have been performed on the same standard up-to-date PC with an Intel Pentium IV processor (frequency of 2GHz) and 512MO of RAM memory. Both ASR were run under Microsoft Windows2000. The sound card used was a standard up-to-date Creative SoundBlaster. Neither Nuance 8.0 nor Sphinx 3 had any other special requirement concerning the runtime environment. In this section, we present the results of the tests for both ASR tools. All the results are summarized in Table 3 on page Nuance 8.0 Nuance 8.0 comes with default acoustic and language models but also offers the possibility to build specific grammars in order to constrain the recognition and improve its performances. Therefore, each series of test could be performed twice, one with the default models, one with a grammar adapted to the language of the corpus. The first test concerned the CNN corpus with default models. Results were very unsatisfactory, with an error rate of about 80%, even when tuning the parameters. The second test also concerned the CNN corpus but using a specific grammar for the news domain, the language remaining relatively unconstrained given the nature of the domain. Three language models have been tested: bi-gram, tri-gram and quadri-gram. Performances were proportional to the complexity of the model. Using the quadri-gram with refined parameters, we obtained an error rate of 50% considering the best solution proposed by the ASR software. The second test used the military corpus with a specific grammar, given that a test with the default models would have been irrelevant due to the nature of the language, which is very specific to the domain. This is the same for ATC communications. For example, callsigns are only use in ATC and nowhere else, and it is very unlikely that a default model would be able to recognize callsigns. The results for this test where much better, with an error rate of less than 5%. This is due to the fact that Nuance 8.0 works better for small vocabularies and formal grammars (see Nuance s characteristics in Table 2 above). Moreover, Nuance has performed the second series of tests in real-time. THALES R&T France /14 European Organisation for the Safety of Air Navigation () June 2004 This document is published by in the interests of the exchange of information. It may be copied in whole or in modified without prior written permission from. makes no warranty, either implied or express, for the information contained in this document, neither does it

12 4.2. Sphinx 3 Sphinx 3 comes with default acoustic and language models for large vocabulary speech recognition. Using the additional CMU toolkit, one can build language models in order to constrain the recognized language. For the first series of tests, Sphinx 3 gives much better results than Nuance 8.0 using the default models (error rate of about 20%). With a domain adapted language model, the error rate is about 15%. This is due to the fact that Sphinx 3 has been specially designed for large vocabulary recognition on the broadcast news domain. Therefore, its performances are intrinsically good on this domain but cannot be much bettered even with a refined language model. On the second corpus, and using a specific language model, Sphinx 3 never got as good as Nuance (error rate of about 15%). This can be explained by the nature of both tools, one explicitly dedicated to constrained languages and small vocabularies (Nuance 8.0) and the other designed for large vocabularies and unconstrained languages (Sphinx 3). Moreover, the definition of a specific ATC language for Sphinx would require a corpus, since Sphinx s language models are only statistical. Nevertheless, Sphinx 3 performed the tests in between 5 to 10 times the real-time on the same computer as for Nuance, which is a problem regarding the processing of ATC communications, in particular for overlapping utterances Summary This section is a summary of the practical comparison results described above. Nuance 8.0 Sphinx 3 Unconstrained Language w/ Default Models Unconstrained Language w/ Domain-Specific Models Constrained Language w/ Domain-Specific Models Error Rate 80% 20% Time (X*Real-Time) 1 5 Error Rate 50% 15% Time (X*Real-Time) 1 10 Error Rate 3% 15% Time (X*Real-Time) 1 7 Table 3: Nuance 8.0 vs. Sphinx 3 practical comparison THALES R&T France /14 European Organisation for the Safety of Air Navigation () June 2004 This document is published by in the interests of the exchange of information. It may be copied in whole or in modified without prior written permission from. makes no warranty, either implied or express, for the information contained in this document, neither does it

13 5. Conclusion The results of the theoretical and practical comparisons of the two ASR tools can be summarized as such: Nuance 8.0 is better than Sphinx 3 regarding: o The error rate when used with a constrained language; o The overall recognition time; o The API and integration facilities; o The easiness of acoustic and language models definition and tuning; Nuance 8.0 and Sphinx 3 are equivalent regarding: o The type of outputs offered (n-best transcription, word lattice); o The noise processing capabilities; o The speaker independence and the absence of training; o The installation in stand-alone installation and the ergonomics; Nuance 8.0 is not as good as Sphinx 3 regarding: o The freedom of configuration; o The cost (commercial vs. open source). Therefore, given the nature of the targeted application, that is ATC communications with limited vocabulary and constrained language, and the associated strong constraints (noisy environment, low quality transmission channels, overlapping utterances and stressed speech). Nuance 8.0 appears to be the appropriate choice. Nuance is perfectly designed for limited vocabulary, and its grammar construction, edition and integration facilities make it much more readily usable for a constrained language such as the one used by controllers and pilots. The results obtained with Nuance 8.0 on such a language are much better than those obtained with Sphinx 3, and Nuance is able to perform recognition for this kind of application in real-time where Sphinx will take 5 to 10 times the realtime. Since both systems are speaker independent, do not require training, have a noise processor and apply to the same speech band, Sphinx 3 has no key advantage compared to Nuance 8.0 given the use that will be made of the ASR software in the SCOPE project. We therefore argue that Nuance 8.0 will be much more appropriate than Sphinx 3.0 regarding the objectives of the project. THALES R&T France /14 European Organisation for the Safety of Air Navigation () June 2004 This document is published by in the interests of the exchange of information. It may be copied in whole or in modified without prior written permission from. makes no warranty, either implied or express, for the information contained in this document, neither does it

14 6. s Murveit H., Butzberger J., Digalakis V. & Weintraub M.M. Large-Vocabulary Dictation Using SRI s Decipher Speech Recognition System: Progressive-Search Techniques. In Proc. of ICASSP93, Volume II, pages , IEEE, Stolcke A., Bratt H., Butzberger J., Franco H., Rao Gadde V.R., Plauche M., Richey C., Shriberg E., Sonmez K., Weng F. & Zheng J. The SRI March 2000 Hub-5 Conversational Speech Transcription System. In Proc. of NIST Speech Transcription Workshop, College Park, Maryland, Ravishankar M., Singh R., Raj B. & Stern R.M. The 1999 CMU 10X Real Time Broadcast News Transcription System ( Seymore K., Chen S., Doh S., Eskenazi M., Gouvea E., Raj B. & Ravishankar M. The 1997 CMU Sphinx-3 English Broadcast News Transcription System, Donovan R.E. & Woodland P.C. Improvements in a HMM-Based Speech Synthesizer. Proceedings of EuroSpeech Conference, pages , Madrid, Spain, THALES R&T France /14 European Organisation for the Safety of Air Navigation () June 2004 This document is published by in the interests of the exchange of information. It may be copied in whole or in modified without prior written permission from. makes no warranty, either implied or express, for the information contained in this document, neither does it

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex