PERFORMANCE IMPROVEMENT OF AUTOMATIC SPEECH RECOGNITION SYSTEMS VIA MULTIPLE LANGUAGE MODELS PRODUCED BY SENTENCE-BASED CLUSTERING

Similar documents
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

SARDNET: A Self-Organizing Feature Map for Sequences

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Switchboard Language Model Improvement with Conversational Data from Gigaword

A Case Study: News Classification Based on Term Frequency

Speech Recognition at ICSI: Broadcast News and beyond

Linking Task: Identifying authors and book titles in verbose queries

Deep Neural Network Language Models

Using dialogue context to improve parsing performance in dialogue systems

Modeling function word errors in DNN-HMM based LVCSR systems

Python Machine Learning

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Modeling function word errors in DNN-HMM based LVCSR systems

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Calibration of Confidence Measures in Speech Recognition

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Investigation on Mandarin Broadcast News Speech Recognition

Learning Methods for Fuzzy Systems

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Radius STEM Readiness TM

Learning Methods in Multilingual Speech Recognition

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

The taming of the data:

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Artificial Neural Networks written examination

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Lecture 1: Machine Learning Basics

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Postprint.

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Reinforcement Learning by Comparing Immediate Reward

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Probabilistic Latent Semantic Analysis

Improvements to the Pruning Behavior of DNN Acoustic Models

A Comparison of Two Text Representations for Sentiment Analysis

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Corpus Linguistics (L615)

CS Machine Learning

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Matching Similarity for Keyword-Based Clustering

Software Maintenance

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

arxiv:cmp-lg/ v1 22 Aug 1994

CEFR Overall Illustrative English Proficiency Scales

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Assignment 1: Predicting Amazon Review Ratings

Reducing Features to Improve Bug Prediction

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Problems of the Arabic OCR: New Attitudes

Cross-Lingual Text Categorization

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

AQUA: An Ontology-Driven Question Answering System

Human Emotion Recognition From Speech

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Language Model and Grammar Extraction Variation in Machine Translation

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Implementing a tool to Support KAOS-Beta Process Model Using EPF

Textbook Evalyation:

Noisy SMS Machine Translation in Low-Density Languages

Language Acquisition Chart

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Axiom 2013 Team Description Paper

Some Principles of Automated Natural Language Information Extraction

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Knowledge Transfer in Deep Convolutional Neural Nets

Using Moodle in ESOL Writing Classes

Graduate Program in Education

Using Web Searches on Important Words to Create Background Sets for LSI Classification

BENCHMARK TREND COMPARISON REPORT:

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

BYLINE [Heng Ji, Computer Science Department, New York University,

Large vocabulary off-line handwriting recognition: A survey

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Abstractions and the Brain

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Grade 11 Language Arts (2 Semester Course) CURRICULUM. Course Description ENGLISH 11 (2 Semester Course) Duration: 2 Semesters Prerequisite: None

PH.D. IN COMPUTER SCIENCE PROGRAM (POST M.S.)

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Transcription:

PERFORMANCE IMPROVEMENT OF AUTOMATIC SPEECH RECOGNITION SYSTEMS VIA MULTIPLE LANGUAGE MODELS PRODUCED BY SENTENCE-BASED CLUSTERING Sushil Kumar Podder, Khaled Shaban, Jiping Sun, Fakhri Karray, Otman Basir, and Mohamed Kamel [spodder, kshaban, jsun, karray, obasir, mkamel]@uwaterloo.ca PAMI Lab, University of Waterloo, Waterloo, CANADA ABSTRACT Grammar-based speech recognition systems exhibit performance degradation as their vocabulary sizes increase. Data clustering is deemed to reduce the proportionality of this problem. We introduce an approach to data clustering for automatic speech recognition systems using Kohonen Self-Organized Map. Clustering results are used further to build a language model for each of the clusters using CMU- Cambridge toolkit. The approach was implemented as a prototype for a large vocabulary and continuous speech recognition system and about 8% performance improvement was achieved in comparison with the performance achieved using the language model and dictionary provided by Sphinx3. In this paper we present the experimental results along with discussions, analysis and potential future directions. Keywords: Automatic Speech Recognition, Self Organized Map, Language model 1. INTRODUCTION With the continuing advances made in speech technology, more information and services will become readily available to users. A simple cell phone will be enough to hook into the information age. Automatic Speech Recognition (ASR) systems are already being deployed to help people finding out flight information, trading stocks, accessing emails, and finding out weather conditions. Over the past few years a number of significant (commercial and research oriented) engines and toolkits have been developed to offer building ASR systems, e.g., Nuance [6], Microsoft Speech Engine [7], and CMU-Cambridge Toolkit [2]. Despite the success of grammar-based engines in control and command applications, they suffer acute performance degradation once their vocabulary sizes exceed some sizes. Therefore, it seems very difficult to have this approach working to implement free and domain independent ASR applications, e.g., dictation systems. A practical solution to this problem is to break down the large language model (LM) and construct multiple LMs with acceptable vocabulary sizes. These models should group words and phrases that are used coherently together in the natural use of the language. In this paper we present an approach to improve speech recognition systems and get one step closer to have spontaneous and domain independent dictation systems.

The breakdown of the paper is as follows: Section one, the introduction, presents some basic background material, demarcates the subject, and discusses some existing problems and how this work can tackle them. Section two gives a literature survey and state-of-the-art review regarding the application of data clustering for the creation of language models. The next four sections constitute the main essay of the work: Section three presents a description of the proposed solution including data modeling, clustering technique, and language model building. Section four offers a thorough explanation of the experimental setup by means of its data set, and algorithms parameters tuning and settings. Section five contains results collected from an implemented prototype of a dictation system that makes use of the proposed clustering technique. Section six includes analysis and discussion about the results and highlights the key issues covered. And section seven recommends further developments concerning the presented study. 2. OVERVIEW OF RELATED WORK Here is an overview including state-of-the-art of three related areas of studies; language modeling, data clustering, and the use of clustering to improve speech recognition systems 2.1 Language Modeling Probabilistic n-gram language model is the most widely used in large vocabulary ASR systems. The job of the n-gram language model is to define which words follow at each point in the model and the transition probability from one word to the next [10] [13] [14]. For large vocabulary recognizers, word lattice is not constructed beforehand, it is built as the search progresses and the language model is used to determine the overall probability of the search paths under consideration. However, as the size of language model increases, search path also increases proportionally and consequently model perplexity increases. So, the reduction of the size of language model is important. In the ASR system, the size of the active vocabulary of a speaker in a context is rather small. Usually, speakers like to confine themselves in some specific domain. So, the generation of domain specific language (sub-language) model is the most challenging and demanding task in ASR. 2.2 Data Clustering Clustering Algorithms partition a set of objects into groups or clusters [11] [15]. Data representation modeling is an essential part of the clustering process. It is the means by which objects are described using a set of features and values. Multiple objects may or may not have the same representation in this model, and the goal is to place similar objects to different groups. There are many different clustering algorithms, and they can be classified into a few basic types. There are two types of structures produced by clustering algorithms, hierarchical clustering and flat or nonhierarchical clustering. Flat clustering simply consists of a certain number of clusters and relation between clusters is often undetermined. Most algorithms that produce flat clustering are iterative. They start with a set of initial clusters and improve them by iterating a reallocation operation that reassigns objects. A hierarchical clustering is a hierarchy with the usual interpretation that each mode stands for a subclass of its mother s node. The leaves of the tree are the single objects of the clustered set. Each node represents the cluster that contains all the objects of its descendant. Another important distinction between clustering algorithms is whether they perform a soft clustering or hard clustering. In a hard assignment, each object is assigned to one and only one cluster. Soft assignments allow degrees of membership and membership in multiple clusters.

2.3 Clustering for better Recognition Brown et al. [8] have taken the approach of clustering believing that can play an important role in improving language modeling. Recently, the speech community has begun to address the use of clustering in building better language model, and thus improve recognition accuracy. Florian and Yarowsky [3] utilized hierarchical and dynamic topic-based clustering to build language models. Although this work, and other topic-based clustering approaches [4] could bring in perplexity improvements and word error rates reductions, they would have difficulties handling the issue of topic-insensitive words. As Mangu [5] observed that closed-class function words, such as the, of, and with, have minimal probability variation across different topics while most openclass content words exhibit substantial topic variation. Moreover, it would require the recognizer an additional overhead of identifying the topic/domain of the users first utterances in order to use the appropriate language model. 3. THE PROPOSED SOLUTION For the purpose of language modeling, clustering can be done based on documents or segment of documents. This can be obtained either manually (by topic-tagging the documents) or automatically, by using an unsupervised algorithm to group similar documents in topic-like clusters. We have utilized the latter approach, for its generality and extensibility, and because there is no reason to believe that the manually assigned topics are optimal for language modeling. Here are some descriptions of the main steps taken towards solving the problem. Figure 1 depicts all steps and presents it in a flow diagram. In brief, a large corpus is used by a sentence-based clustering technique to produce smaller collection of related sentences. The clusters are used further to build multiple language models. User s utterance would be passed to all language models and initial recognition results are collected. These results are aggregated to create a final language model that convert-spoken text into written form. Processed BNC Clustering (SOM) CMU decoder Test speech Clusters Rec ognition result Language model & dictionary generator Figure 1: System Flow diagram S ub language models & Dictionary Final LM +dictionary CMU decoder Final Result 3.1 Data Modeling Clustering sentences of text corpora are believed to yield the closest and the most similar words and common phrases together. Extracting sentences and representing them was the first step to deal with this issue. Most data clustering algorithms expect the data set to be clustered in the form of a set of vectors X = x, x,..., x }, where the vector, i = 1,,n { 1 2 n corresponds to a single object in the data set and is called the feature vector. Extracting the proper features to represent objects through the feature vector is a crucial factor on the running time of the algorithm and hence its scalability. When clustering sentences we overcame the problem of high dimensionality as follows: each sentence is represented by a vector x, such that x = f, f,..., f }, where, i = 1,,m, is the { 1 2 m word i indexed in a global dictionary of all existing words. The vector size m is the average number of words found in a sentence. 3.2 Data Clustering There exists a multitude of clustering techniques in the literature, each adopting a certain strategy for detecting the grouping in the data. The method that f i x i

was chosen here is the Kohonen Self-Organizing Map (SOM) [1]. SOM can be visualized as a sheetlike neural-network array, the cells (or nodes) of which become specifically tuned to various input signal patterns or classes of patterns in an orderly fashion. The learning process is competitive and unsupervised, meaning that no teacher is needed to define the correct output (or actually the cell into which the input is mapped) for an input. In the basic version, only one map node (winner) at a time is activated corresponding to each input. The locations of the responses in the array tend to become ordered in the learning process as if some meaningful nonlinear coordinate system for the different input features were being created over the network [1]. All sentence vectors were fed to the network, as one epoch, and for a number of iterations the network would be trained to map them to a specific number of clusters. 3.3 Language Model Building Using unsupervised learning methodology, SOM; the whole training data (sentence sequences) was clustered. The sentences belonging to each of the clusters were collected together to generate individual sub language model. Language models were generated using statistical language modeling toolkit provide by CMU [2]. Each of the language models generates 1-gram, 2-gram and 3-gram word sequences as well as their probabilities. Using these probabilities we computed the probabilities of appearing of each of the testing sentences from each of the sub models as well as the CMU language model. 4. EXPERIMENT SETUP For our experiments, we chose BNC (British National Corpus). It comprises of approximately 100 million words and about 6 millions sentences. After some pre-processing, viz., filtering out all of the nonwords, punctuations, abbreviations and numerals, and removing stop and unimportant words, we chose 5,845,344 sentences containing 49,233 unique words for the clustering purpose. Each of the sentences was represented as a vector of fixed dimension as 60, and each of the components of the vector was represented by the index (normalized) of the wordlist. If the sentence size was less than 60, the rest of the components were filled with 0 s, and for the sentences larger than 60 words were just truncated to 60. Note that dimension was selected as 60 through analyzing a vast amount of text data of the BNC. We used SOM_PAK [12] for sentence clustering. Here are some parameters values of the SOM algorithm: 5 is the initial radius of the training area in SOM algorithm decreases linearly to one during training. Number of clusters that SOM would produce was set to 5, 10, 20, 20, and 40 clusters. Hexa was the topology type used in the map. Possible choices were hexagonal lattice (hexa) and rectangular lattice (rect). The neighborhood function type bubble was used. Running length (number of iterations) in training was chosen to be 30000 after a number of trials. For testing, we randomly selected 100 VOA (voice of America) broadcast utterances and ran the CMU engine incorporating clustering, and without clustering. The results are presented in the flowing section. 5. RESULTS We computed the language models perplexity using CMU toolkit for the reference sentences. For the best sub language model, average perplexity is 8.6 bits and for the big language model is 8.8 bits. Figure 2, shows the perplexities of generating each of the 16 14 12 pe rpl10 exi ty 8 6 4 2 sub language model big language model 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 10 0 sentence number Figure 2. Language model perplexity (in bits )

sentences from the sub language model and the big language model. The recognition results without clustering and with clustering for different sizes of clusters are presented in Table 1. No. of Clusters Recognition (Without clustering) Recognition (With clustering) Table 1: Recognition Performance 6. DISCUSSION Performance improvement 5 70.8% 75.3% 4.5% 10 70.8% 75.4% 4.6% 20 70.8% 78.0% 7.2% 30 70.8% 78.7% 7.9% 40 70.8% 78.0% 7.2% To improve the performance of ASR and to be able to have free and unlimited dictation systems, we proposed an approach that incorporated clustering technique, SOM. From a set of experiments, we have achieved consistently better recognition performance for most of the test utterances. In some cases, around 8% performance improvement has been obtained compared with the one CMU language model. The improvement in recognition performance should be due to the SOM s ability to organize sentences according to textual similarities and these textual similarities reduce the perplexity of the model and consequently enhance the recognition results. The results show the gradual improvement in performance with the increased number of clusters, but once it exceeds 20 the performance does not change significantly, rather it starts degrading. Since language generator relies on the uni-gram, bi-gram, and tri-gram counts, due to the reduced size of cluster, model experiences inadequate tokens of bigram and tri-gram from the training data, and consequently the decoder is biased to pickup most of the uni-gram. Hence, there is always a trade-off between the cluster size and performance. Although, sub-language model shows a significant improvement in recognition performance as well as in perplexity, problem of automatic identification of suitable sub language model has not yet been solved. Following section shows our ongoing effort. 7. FUTURE WORK Although sub-language model generated from the clustered data improved the recognition efficiency, there is still a room for further improvement in language modeling through modifying the encoding scheme using semantics. We have investigated that conventional representation (encoding scheme) of the sentences used for clustering is not sufficient enough to grab the inherent semantics of the sentence. As the conventional sentence representation treats the sentence as a bag of words without having any conceptual similarity of terms such as defined in terminological resources like WordNet [8], clustering solutions only relate the sentences that use identical terminology. In our ongoing effort, we are trying to find the appropriate language through semantic clustering Using big language model and dictionary, CMU decoder usually provides one single best hypothesis for each of the utterance. From our present experiment, we have found that ASR engine with CMU language model is able to predict about 70% correct words. Since SOM algorithm is able to cluster the data based on the inherent redundancy or similarity in the data, we postulate that cluster(s) closest to the best hypothesis might contain more relevant words. Based on the assumption, we followed the following steps (depicted in Figure 3): Pass the utterance to the CMU decoder using CMU language model Represent the best hypothesis as a feature vector Select the cluster(s) most similar to it using some sort of distance measure (usually Euclidian)

Accumulate all the sentences in the specific cluster(s) and generate final language model Pass the utterance once again to the CMU decoder using final language model C lustering with SOM Encoding Scheme Processed BNC CMU Decoder Test speech References Cluste r s s WordNet Best Hypothesis Language Model and Dictionary Generator CMU LM Dictionary Decision - Making Figure 3: Two stage ASR System Language & Dictionary dl Final LM Dictionary CMU decoder Final Result [1] Kohonen, T., Self-Organizing Maps, Springer, Berlin, Heidelberg, 1995. [2] P.R. Clarkson and R. Rosenfeld. Statistical Language Modeling Using the CMU-Cambridge Toolkit From Proceedings ESCA Eurospeech 1997. [3] Radu Florian, and David Yarowsky. Dynamic Nonlocal Language Modeling via Hierarchical Topic-Based Adaptation, 37th Annual Meeting of the Association for Computational Linguistics, 1999. [4] S. Lowe. An attempt at improving recognition accuracy on switchboard by using topic identification, Johns Hopkins Speech Workshop, Language Modeling Group, Final Report, 1995. [5] Lidia Mangu. Hierarchical topic-sensitive language models for automatic speech recognition, Techincal report, Computer Scinece Department, Johns Hopkins University, 1999. [6] http://www.nuance.com. [7] http://www.microsoft.com/speech. [8] Brown, Peter F., Vincent J. Della Pietra, Peter V. desouza, Jenifer C. Lai, and Robert L. Mercer. Class-based n-gram models of natural language. Computational Linguistics 18:467-479, 1992. [9] Miller, G.A. 1995. WordNet: A Lexical Database for English. Communications of the ACM 11. [10] Chritopher D. Manning, and Hinrich Schutze. Foundations of Statistical Natural Language Processing, The MIT Press, 1999. [11] Jain, Anil K., and Richard C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice Hall. [12] Kohonen, T., Hynninen, J., Kangas, J., and Laaksonen, J. (1996). SOM_PAK: The self organizing map program package. Technical Report A21, Helsinki University of Technology, Laboratory of Computer and Information Science, Espoo, Finland. [13] Rosenfeld, R. Adaptive Statistical Language Modeling. A maximum entropy approach. PhD Thesis, Computer Science Department, Carnegie Mellon University, 1994. [14] Stolcke, A. et. al., Structure and Performance of a Dependency Language Model. Proceeding of the EuroSpeech, 1997. [15] Kaufmann, Leonard, and Peter J. Rousseeuw. Finding groups in data. New York: Wiley, 1990.