Investigation of Indian English Speech Recognition using CMU Sphinx

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Learning Methods in Multilingual Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Investigation on Mandarin Broadcast News Speech Recognition

A study of speaker adaptation for DNN-based speech synthesis

Speech Emotion Recognition Using Support Vector Machine

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Human Emotion Recognition From Speech

Word Segmentation of Off-line Handwritten Documents

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Mandarin Lexical Tone Recognition: The Gating Paradigm

WHEN THERE IS A mismatch between the acoustic

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Edinburgh Research Explorer

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Calibration of Confidence Measures in Speech Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Speech Recognition by Indexing and Sequencing

CEFR Overall Illustrative English Proficiency Scales

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Problems of the Arabic OCR: New Attitudes

English Language and Applied Linguistics. Module Descriptions 2017/18

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Using dialogue context to improve parsing performance in dialogue systems

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Speaker Identification by Comparison of Smart Methods. Abstract

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Letter-based speech synthesis

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Deep Neural Network Language Models

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

On the Formation of Phoneme Categories in DNN Acoustic Models

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

First Grade Curriculum Highlights: In alignment with the Common Core Standards

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Phonological Processing for Urdu Text to Speech System

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Progressive Aspect in Nigerian English

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Noisy SMS Machine Translation in Low-Density Languages

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Word Stress and Intonation: Introduction

Speaker recognition using universal background model on YOHO database

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

CS 598 Natural Language Processing

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Multi-Lingual Text Leveling

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Automatic Pronunciation Checker

Language Model and Grammar Extraction Variation in Machine Translation

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Mining Association Rules in Student s Assessment Data

Rule Learning With Negation: Issues Regarding Effectiveness

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

South Carolina English Language Arts

A Case Study: News Classification Based on Term Frequency

Segregation of Unvoiced Speech from Nonspeech Interference

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

SARDNET: A Self-Organizing Feature Map for Sequences

Cross Language Information Retrieval

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

AQUA: An Ontology-Driven Question Answering System

Florida Reading Endorsement Alignment Matrix Competency 1

Chapter 5: Language. Over 6,900 different languages worldwide

SIE: Speech Enabled Interface for E-Learning

Transcription:

Investigation of Indian English Speech Recognition using CMU Sphinx Disha Kaur Phull School of Computing Science & Engineering, VIT University Chennai Campus, Tamil Nadu, India. G. Bharadwaja Kumar School of Computing Science & Engineering, VIT University Chennai Campus, Tamil Nadu, India. Abstract- In the recent years, research on speech recognition has given much diligence to the automatic transcription of speech data such as broadcast news (BN), medical transcription, etc. Large Vocabulary Continuous Speech Recognition (LVCSR) systems have been developed successfully for Englishes (American English (AE), British English (BE), etc.) and other languages but in case of Indian English (IE), it is still at infancy stage. IE is one of the varieties of English spoken in Indian subcontinent and is largely different from the English spoken in other parts of the world. In this paper, we have presented our work on LVCSR of IE video lectures. The speech data contains video lectures on various engineering subjects given by the experts from all over India as part of the NPTEL project which comprises of 23 hours. We have used CMU Sphinx for training and decoding in our large vocabulary continuous speech recognition experiments. The results analysis instantiate that building IE acoustic model for IE speech recognition is essential due to the fact that it has given 34% less average word error rate (WER) than HUB-4 acoustic models. The average WER before and after adaptation of IE acoustic model is 38% and 31% respectively. Even though, our IE acoustic model is trained with limited training data and the corpora used for building the language models do not mimic the spoken language, the results are promising and comparable to the results reported for AE lecture recognition in the literature. Keywords: CMU Sphinx, Indian English, Lecture Recognition. Introduction Automatic Speech Recognition (ASR) is the state of art technology that allows converting speech into text, making it easier both to create and use information. The ultimate goal of ASR research is to allow a computer to recognize all words that are intelligibly spoken by any person in real-time with 100% accuracy and it should be achieved independent of vocabulary size, noise, speaker characteristics or accent. During the past few decades, drastic developments have been reported in ASR for many languages such as English, Finnish and German, etc. Recently, there is a growing interest towards large vocabulary continuous speech recognition (LVCSR) research for Indian Languages (IL). There are several works that have been carried out for the Indian languages such as Tamil, Telugu, Bengali and Hindi. However, Speech recognition work on Indian English (IE) hasn t got that much attention when compared to other languages. The languages spoken in India belong to four major language families: Indo-Aryan, Dravidian, Austro-Asiatic, and Sino- Tibetan. In accordance with India s vast population, the figures relating to languages are also very impressive. The Indian constitution has given official status to 22 Indian languages as well as English. Apart from these, there are many other languages spoken in India. Linguists believe that there are nearly 150 different languages and there are about 2000 dialects in India [1]. Here, dialect refers to the variations at all linguistic levels, i.e., vocabularies, idioms, grammars and pronunciation. Differences among dialects are mainly due to regional and social factors and these differences vary in terms of pronunciation, vocabulary and grammar [2]. Accent refers to the variety in pronunciations of a certain language and refers to the sounds that exist in a person s language [3]. The term IE is commonly used to refer to English which is spoken as a second language in India [4]. IE plays the role of lingua franca [5]. IE has a lot of distinctive pronunciations, some distinctive syntax and quite a bit of lexical variation. Any linguistic description seeking to characterize IE must take cognizance of its highly variable nature, as it comes in a range of varieties, both regional and social [6]. Indian English accents vary greatly. The pronunciation is greatly influenced by their native language and educational background. Another major reason for variation is that IE rhythm is in accordance with the rhythm of Indian languages [7] i.e. syllable-timed (the time taken to utter each syllable). But, English is known to be a stress-timed language (both syllable and word) where only certain words in a sentence or phrase are stressed and this is an important feature of Received Pronunciation (RP). Stressing syllables and words correctly is often an area of great difficulty for speakers of IE. The extent to which Indian features of pronunciation will occur in the speech of an individual varies from person to person. In [8], Peri Bhaskararao has compared Indian English with British English (BE) Pronunciation. Diphthongs in BE correspond to pure long vowels in Indian pronunciation (e.g. cake and poor pronounced as /ke:k/ and /pu:r/, respectively); the alveolar sounds /t/ and /d/ of British Received Pronunciation (BRP) pronounced as retroflex (harsher sounds); the dental fricatives θ and δ are replaced by soft th and soft d (e.g. thick is pronounced as /thik/ rather than /ik/); /v/ and /w/ in BRP are both pronounced somewhat similar to /w/ in many parts of India and they are usually merged with b in Bengali, Assamese and Oriya 4167

pronunciations of English. Some words that are not found in Englishes elsewhere are used in IE. These are either innovations or translations of some native words or phrases. Examples of these instances include cousin brother (for male cousin), prepone (advance or bring forward in time), and foreign-returned (returned from abroad). There are Indianisms in grammar, such as the pluralization of non-count nouns (e.g. breads, foods, advices) and the use of the present progressive for the simple present (I am knowing). In IE, there is lack of aspiration in the word-initial position: cat /k/ but not /kh/, because of the phonemic contrast between unvoiced unaspirated velar /k/ and unvoiced aspirated velar /kh/. Also some fricatives are changed into bilabial; there is a lack of interdental in IE. In this paper, we present our experiments on large vocabulary continuous speech recognition for Indian English. The Indian English speech data is extracted from the videos of NPTEL. NPTEL is a government funded project that provides E- learning through online Web and Video courses in Engineering, Science and humanities streams [9]. The vision of this project is to provide lectures from the experts from prominent educational institutions for the benefit of students in various educational institutions in different parts of India. Currently, there are lectures of 130 speakers on various subjects. The organization of the paper is as follows: Section 2 summarizes briefly about ASR work on Indian English as well as other Indian languages and also about accent based ASR works for some other languages. Section 3 describes the experimental set up and methodology for IE speech recognition. Section 4 describes our experiments and results. A Brief Survey on Indian Language Speech Recognition In India, the early work on large vocabulary speech recognition started with Hindi language around late 90 s. Samudravijaya et al. [10] proposed a speech recognition system for Hindi which follows a hierarchical approach to speech recognition. Kumar et al. [11] proposed a method large vocabulary continuous speech recognition system for Hindi based on IBM Via Voice speech recognizer. For a vocabulary size of 65000 words, the system gives a word accuracy of 75% to 95%. Gopalakrishna et al. [12] carried medium vocabulary speech recognition using Sphinx for three different languages such as Marathi, Telugu and Tamil on different environments like landline and cellphone. They have got word error rates (WER) around 20.7%, 19.4% and 15.4% on landline and 23.6%, 17.6% and 18.3% over cellphone data for Marathi, Tamil and Telugu respectively. Pratyush Banerjee et al. [13] used Hidden Markov Model (HMM) toolkit for Bengali continuous speech recognition. They obtained an average recognition rate of 76.33% for male speakers and 52.34% for female speakers. Thangarajan et al. [14] carried out experiments using triphone based models for Tamil speech recognition and achieved 88% of accuracy over limited data. They have also tried context independent syllable models [2] for Tamil speech recognition which underperformed over context dependent phone models. Lakshmi Sarada et al. [15] tried group delay based algorithm to automatically segment and label continuous speech signal into syllable-like units for Indian languages with new feature extraction technique that uses features extracted from multiple frame sizes and frame rates. They achieved recognition rates 48.7% and 45.36%, for Tamil and Telugu respectively. Ma et al. [16] classified three accents of English language recorded from three main ethnicities in Malaysia namely Malay, Chinese and Indian. They used only the statistical descriptors of Mel-bands spectral energy and neural network as the recognizer engine. They investigated these experiments on three different independent datasets of 20%, 30%, and 40% of total samples and on average 95.59% classification rate was achieved. Huang et al. [17] carried out extensive experiments to evaluate the effect of accent on speech recognition using Microsoft Mandrian speech engine for three different Mandarin accents Beijing, Shanghai and Guangdong. They found that there is about 40-50% increase in character error rate for cross-accent speech recognition. Herman Kamper et al. [18] investigated a way to combine speech data of five South African accents of English in order to improve overall speech recognition performance. Three acoustic modeling approaches such as separate accent specific models, accentindependent models and multi-accent models are considered in this work. They found that multi-accent models obtained by introducing accent-based questions in the decision tree clustering outperformed the other modeling approaches in both phone and word recognition experiments. Only, a little amount of work has been carried out in case of Indian English speech recognition. Here, we have described few ASR works that have been carried out for IE. Kulkarni et al. [19] have studied the effect of accent variability on the performance of the Indian English ASR task. They have carried out this work on LILA Indian English database using Siemens SpeechAdvance ASR server. It consists of 10 different Indian accents. They trained three different HMMs namely accent specific, accent pooled (combines all the accent specific training data) and reduced set of the accent pooled training data as part of training the ASR system. They found that the accent pooled training set performed well on phonetically rich isolated speech recognition task. Deshpande et al. [20] distinguished between AE and IE using the second and third formant frequencies of specific accent markers. A simple Gaussian Mixture Model (GMM) was used for the classification purpose. The second and third formant frequency was calculated by using LPC roots, imposing constraints on the bandwidth and the ranges of each formant. Their results show that only the formant frequencies of these accent markers are enough to achieve classification for those two accent groups. Olga et al. [21] have done experiments on acoustic phonetic analysis of vowels produced by North Indians, whose second language is English. They concluded that North Indian English is a separate variety of IE. Srikant Joshi et al. [22] observed that IE speech is better represented by Hindi speech models for vowels common to the two languages rather than by AE models. The study of Wiltshire et al. [23] revealed that both phonemic and phonological influence of native language proficient speakers accent in IE appears in segmental and supra-segmental properties of speech. The investigation of Hema et al. [24] on the sound structure of Indian English pointed that L1(native language) effect in the IE might have been reflected either on the 4168

incomplete acquisition of the target phonology, the influence of sociolinguistic factors on the use and the evolution of IE. Experimental Setup There are three basic steps in building our Indian English LVCSR system. They are phonetic dictionary creation, acoustic modeling and language modeling. Creating Phonetic Dictionary Since, most of the Indian languages are phonetic in nature; Grapheme to Phoneme (G2P) conversion needs only mere mapping tables & rules for the lexical representation. However, IE pronunciation varies largely from American and other English pronunciation as well as varies according to regional and educational background within India itself. Hence, phonetic dictionary creation is a non-trivial task for IE. Initially, we have manually created the phonetic dictionary that contains around 20000 words which includes words from training corpus and other most frequent words of English language. Our phonetic dictionary contains 41 phones that are specific to Indian accent. Then, we have built the basic pronunciation models on these 20K words using Sequitur G2P software which is a data-driven grapheme-to-phoneme converter based on joint-sequence models [25]. Later, we have applied these models on Link Grammar Parser s dictionary [26] to get larger pronunciation dictionary and then, manually corrected the phonetic dictionary. Then, we have rebuilt our G2P models using this larger pronunciation dictionary. Finally, we have used these G2P models for producing the pronunciation dictionary used in our speech recognition experiments. Currently, the pronunciation in this dictionary matches mostly with IE pronounced in Andhra Pradesh region, since the dictionary has been created and corrected by Telugu language speakers. In future, we plan to build the G2P models for various accents of IE. Acoustic Modeling Acoustic modeling of speech typically refers to the process of establishing statistical representations for the feature vector sequences computed from the speech waveform. In the present work, we have used SphinxTrain [27] for building the acoustic model. The overall process followed by the SphinxTrain for creating the Acoustic Models is shown in Fig.1. Figure 1: The process involved in acoustic modeling NPTEL lecture videos have been used for building the Indian English acoustic models as shown in Fig.2. The video lectures contain various topics from science and engineering spoken lectures from IIT s and other premier institutes. These speakers are from various regions of India and they have spoken various accents of Indian English. We have considered 75 speakers lecture videos for transcribing the speech in order to train the acoustic model. The data has been video recorded in 44 khz sampling frequency and we converted them into wav format and down-sampled it to 16 khz and 16 bit mono file format. Then, we have manually transcribed the audio files. We have considered a minimum of 15 minutes of speech for each speaker while transcribing. The total speech data comprises to 23 hours. Mel frequency cepstral coefficients and their derivatives have been used as features. Then, we have built tri-state context dependent HMMs for each phone. After several experiments, we decided to have the number of Gaussians in GMM modeling as 32 and the number of Senones [28] considered for decision tree clustering is 1000. This is due to the fact that our speech data comprises only 23 hours and the vocabulary is also limited. Figure 2: The process followed for creating IE acoustic model. Adaptation The goal of adaptation techniques is to flex speaker independent models into speaker dependent using adequate data needed for full speaker-dependent training. Many stateof-the-art LVCSR systems use speaker-adapted models to improve robustness with respect to speaker variability. The HMM models of the ASR systems are adapted using Maximum Likelihood Linear Regression (MLLR) [29]. It transforms speaker independent models to the speaker dependent by capturing information specific to the speaker. MLLR adapts the observation probability of a HMM in a parametric way by finding a transform that maximizes the likelihood of the adaptation data given the transformed Gaussian parameters. 4169

Language Modeling Language models help any speech recognizer to figure out how likely a word sequence is independent of the acoustics. Furthermore, language models play a vital role in resolving acoustic confusions that arise due to co-articulation, assimilation and homophones while decoding. In addition, continuous speech recognition suffers from difficulties such as variation due to sentence structure (prosodies), interaction between adjacent words (crossword co-articulation), and no clear acoustic markers to delineate word boundaries. Hence, language models play a paramount role in guiding and constraining among large number of alternative word hypotheses in continuous speech recognition. N-gram language model is still the predominant choice in state-of-theart speech recognizers. Typically, N-gram models for large vocabulary speech recognizers are trained on hundreds of millions or billions of word strings. In constructing such kind of models, we usually face two problems. Firstly, the large amount of training data can lead to larger N-gram language model which consequently leads to excessively large hypothesis search space. Secondly, to train a domain specific model, we must deal with the data sparseness problem, because large amount of domain specific data are not available. Language modeling of speech extracted from lecture videos suffers due to inadequate training data, since the main source for such kind of text is audio transcriptions. In general, texts downloaded from web which is most often a primary source for collecting large amount of possible training data is not representative for the language encountered in lecture videos. Unfortunately, collecting large amounts of lecture videos and producing detailed transcriptions is very tedious. Also, the lecture speeches may contain dis-fluencies such as filled pause, repetition, and false start. In addition to dis-fluencies, there may be ungrammaticality and a language register different from the one that can be found in written texts. Even some speakers may use crutch words and foreign words within the lectures or during the conversations. In the present work, we have engineered language models (LMs) from text corpora obtained from web. Text standardization is one among the difficult tasks for building language models in case of large vocabulary speech recognition. Text must be divided into sentences. We have used a rule based sentence segmentation system for this task. All the punctuation marks, special symbols are removed except symbols associated with numerals. All the numerals are converted to orthographic type and conjointly to alpha numeric words. The abbreviations are also taken into consideration. All the words are converted into the lowercase. For the generation of the language models, three varieties of corpora have been considered. Firstly, the training transcription is used as the base variety of the language model to tally with the speech in lecture videos. Secondly, the Wikipedia dump [30] is used as the generic variety that contains words from various domains. We have downloaded the Wikipedia dump and converted into plain text using an open source tool called WP2TXT [31]. Thirdly, domain specific corpus pertaining to the lectures has been collected from the internet. Initially, we have built separate tri-gram language models for the base and topic specific corpora. Then, we have built bi-gram language model for the Wikipedia dump and we have considered most frequent first 64,000 words (words occurring more than 100 times in the corpus) to build language model by using varikn toolkit [32]. The varikn toolkit trains language models producing a compact set of high-order n-grams utilizing state-of-art Kneser-Ney smoothing. In Kneser-Ney smoothing, a lower order probability distribution is modified to take into account what is modelled by the higher order probability distributions. Hence, we have used Kneser-Ney smoothing. These three language models are merged together by using SRILM toolkit [33] as described in the Fig.3. The merged language models are used for our speech recognition tasks. Figure 3: The overall procedure for creating the Language Models. In our experiments, we have considered five different domains namely computer architecture (CA), computer networks (CN), computer organization (CO), database (DB) and operating systems (OS). The Language Models for these five different domains were generated individually for domain wise recognition. Table 1 shows the language models perplexity and Out of vocabulary (OOV) rates during the evaluation of the test transcripts. Table 1: Perplexity for the created Language Models. LM s Perplexity Words OOV (%) rates CA 217.34 2529 0.12 CN 195.228 2933 0.17 CO 75.414 2284 0.04 DB 188.66 3350 0.12 OS 130.534 1231 0.32 Decoding We have used Sphinx-4 as decoder which is freely available robust speech recognizer for our speech recognition tasks [34]. There are three primary modules in the Sphinx-4 framework [35]: the FrontEnd, the Decoder, and the Linguist. FrontEnd supports to extract features like MFCC, PLP, LPCC, etc. The Linguist translates any type of standard language model, along with pronunciation information from the dictionary and structural information from one or more sets of acoustic models, into a search graph. 4170

The most important component of the Decoder block is the search manager which may perform search algorithms such as frame synchronous Viterbi, A*, bi-directional, and so on. The search manager in the Decoder uses the features from the FrontEnd and the search graph from the Linguist to perform the actual decoding and for generating results. While setting the parameters in decoder s configuration file, absolute beam width, relative beam width and language weight values have been determined experimentally. Absolute beam width selects absolute amount of paths which are explored in every frame and relative beam width affects paths whose score is beam times smaller. Even though, smaller beam width speed-ups the search, we may miss potential solutions by restricting the search space. After experimentation, we have considered the absolute beam width as 30000 and relative beam width as 1E-80. Another important factor that has to be tuned during the decoding process is language weight because it decides how much relative importance will be given to the actual acoustic probabilities of the words in the hypothesis. A value between 6 and 13 is suggested as language weight [36]. A low language weight gives more leeway for words with high acoustic probabilities to be hypothesized at the risk of hypothesizing spurious words. In our experiments, we have considered language weight as 10. Experiments and Results In the initial experimentation stage, we have investigated the impact of pronunciation difference between AE, BE and IE on the speech recognition performance. This work is essential for understanding the necessity of building separate acoustic models for decoding IE speech rather than using models already available for English such as HUB-4(AE) and WSJCAM0 (BE). For this reason, we have observed the performance of the speech recognition system using HUB-4 [37], WSJCAM0 [38] and IE acoustic model (developed by us). HUB-4 corpus contains 104 hours of broadcast news collected in 1996 and 97 hours of news broadcasts collected in 1997 which is made available by Linguistic Data Consortium (LDC). In this task, we have used HUB-4 Acoustic Models trained using 140 hours of 1996 and 1997 HUB-4 training data [39]. The models are tri-state within-word and crossword triphone HMMs with no skips permitted between states. This model consists of 6000 senonically-tied states. The phone set for which models have been provided is that of the dictionary cmudict0.6d [40] available on the CMU website. British English acoustic models were trained using the WSJCAM0 corpus. The training corpus contains 90 sentences spoken by 92 speakers each. All recorded sentences were taken from the Wall Street Journal (WSJ) text corpus. All recordings were made in a quasi-soundproof room. The phone set used here is the same 40 phone set from the CMU dictionary. The total vocabulary of this corpus is around 5000 words. Results and Performance Analysis of Various English Acoustic Models We have performed the analysis on the test data that contains speech data of 14 different speakers. Even though, there are many variants of Indian English, two broad varieties of Indian English variants (North Indian and South Indian English) are considered in the present work. The test data consists of 20 minutes audio each for 4 different NPTEL video lectures. In the results, South Indian speakers of NPTEL video lectures are denoted as SI-6 and SI-7 and the North Indian speakers of NPTEL video lectures are denoted as NI-6 and NI-7. For the remaining 10 speakers, we manually recorded the speech data for the operating systems domain. The data recorded consists of five North Indian (NI-1 to NI-5) and five South Indian (SI-1 to SI-5) Speakers. The details of the test data set are shown in table 2. The total test data comprises of 3 hours. Table 2: Details of testing data set. Speakers Speech data(min) No. of Speakers SI (NPTEL) 40 2 NI (NPTEL) 40 2 SI(recorded) 50 5 NI(recorded) 50 5 In case of British English Acoustic models, the word error rates are very high (some times more than 100%) because they were trained on very small speech corpora recorded in noise free environment and hence not suitable for LVCSR experiments on Video lectures. Hence, we have considered only HUB-4 and IE acoustic models performance for comparative analysis. The comparative analysis of HUB-4 and IE acoustic models is shown in the fig.4 in terms of the WER. Figure 4: The difference in WER between HUB-4 and IE model. From the test results, one can observe that Indian English acoustic model has performed better than the HUB-4 model since there is a large difference in WER shown in fig.4. We have observed that an average WER of IE acoustic model (38%) is around 34% lesser than average WER of HUB-4 acoustic model (72%). This is because HUB-4 acoustic model was completely trained on American English accent which does not match with Indian English accent. Further, we have 4171

adapted HUB4 acoustic model for Indian speakers to observe any significant difference in WER after adaptation. The average WER of adapted HUB-4 acoustic model is 67% and IE acoustic model WER without adaptation is 38%. Even though, we have noticed a decrease of around 5% in WER on an average, the average WER for the adapted HUB-4 acoustic model is still not comparable to IE acoustic model without adaptation. From these results, we have concluded that building separate IE acoustic model is essential for decoding IE speech. So, we have carried out our further experiments with IE acoustic model alone. Performance Analysis of Adapted IE acoustic Model To improve the performance of the IE acoustic model, we have carried out MLLR adaptation process. The details of the data set for adapting the IE acoustic model is different from the data set used for testing and it is shown in table 3. The WER comparison of IE acoustic model before and after adaptation for all the speakers is shown in fig.5. It can be observed that after adaption, the average WER is 31% i.e. 7% less than the average WER before adaptation as shown in fig.6. Hence, we can conclude that the adaptation of the IE acoustic model helps in a better recognition of the IE lecture speech as it reduces the mismatches caused by the speaker s characteristics Figure 6: The difference in Average WER before and after adaptation. Table 3. Details of adaptation data set. Speakers Speech data(min) No. Of Speakers SI (NPTEL) 20 2 NI (NPTEL) 20 2 SI(recorded) 25 5 NI(recorded) 25 5 In fig.7, an example of our speech recognition system output is given for a better understanding of the results. From the example, one can observe the difference in WER differences for IE and HUB-4 acoustic models. It can also be understood that the lecture s transcription does not match with written language and it is very difficult to get such corpora for building the language models. Figure 7: An example IE ASR system output. Figure 5: The difference in WER before and after adaptation Comparative Analysis of North and South IE Variants Even though, the model that has been developed is referred as IE acoustic model in the present work, from [22] it is clear that IE has many variations due to different L1 inferences that leads to distinct coloration that gives rise to specific regional 4172

varieties of spoken. From figure 8, it can be observed that from the test data set the average WER of all SI speakers compared to the average WER of all NI speakers is 14% less without adaptation and 9% less after the adaptation of the IE acoustic model. This could be because of our pronunciation dictionary is more inclined towards the SI accent as the pronunciation model is built from the dictionary which is manually created by south Indian speakers. The result highlights the dissimilarity between the North and South Indian accents, which provides the need for multiple pronunciation dictionaries that will help in better speech recognition system for IE varieties. Figure 8: The difference in North Indian and South Indian WERs. Conclusion In the present work, we have carried out speech recognition experiments in IE video lectures In this work, we have investigated the need for building separate acoustic model for IE rather than using other existing English acoustic models such as HUB-4 (AE) or WSJCAM0 (BE) for IE speech recognition task. From the results, it is evident that IE acoustic model outperformed HUB-4 by having 34% lesser average WER for IE speech recognition. Hence, we can conclude that separate IE acoustic model is required for IE LVCSR experiments because the English pronunciation is much different from American and British English accents. Next, we have investigated the performance of our IE acoustic model on IE lecture recognition task. The average WER before and after adaptation is 38% and 31% respectively. Even though, our IE acoustic model is trained with limited training data (around 23 hours) and the corpora used for building the language models do not mimic the language spoken in the video lectures, the results are promising and comparable to the results reported for AE lecture recognition in the literature. Further, we have observed that South Indian speech is better recognized when compared to the North Indian speech. This is due to the fact that our pronunciation dictionary is inclined towards the South Indian accent. There are two possible future endeavors. One is to improve the performance of IE acoustic model by adding large vocabulary speech corpora for Indian English to the existing training set. Second is to deal with the discrepancies between variants of Indian English accents by building pronunciation models for different accents. References [1] Kavi Narayana Murthy and G Bharadwaja Kumar. Language identification from small text samples*. Journal of Quantitative Linguistics, 13(01):57-80, 2006. [2] Adrian Akmajian. Linguistics: An introduction to language and communication. MIT press, 2001. [3] Hamid Behravan. Dialect and accent recognition. PhD thesis, University of Eastern Finland, 2012. [4] John C Wells. Accents of English, volume 1. Cambridge University Press, 1982. [5] Braj B Kachru. The Indianization of English: the English language in India. Oxford University Press Oxford, 1983. [6] Andreas Sedlatschek. Contemporary Indian English: variation and change. John Benjamins Publishing, 2009. [7] Ravinder Gargesh. Indian English: Phonology. Bernd Kortmann et al. Varieties of English: Africa, South and Southeast Asia, Mouton de Gruyter,, pages 231-243, 2008. [8] Peri Bhaskararao. English in contemporary India. ABD (Asian/Pacific Book Development), 33(2002):5-7, 2002. [9] NPTEL. http://nptel.ac.in/. [10] K Samudravijaya, R Ahuja, N Bondale, T Jose, S Krishnan, P Poddar, PVS Rao, and R Raveendran. A feature-based hierarchical speech recognition system for Hindi. Sadhana (Academy Proceedings in Engineering Sciences), 23:313-340, 1998. [11] Mohit Kumar, Nitendra Rajput, and Ashish Verma. A large-vocabulary continuous speech recognition system for Hindi. IBM journal of research and development, 48(5.6):703-715, 2004. [12] Rohit Kumar, S Kishore, Anumanchipalli Gopalakrishna, Rahul Chitturi, Sachin Joshi, Satinder Singh, and R Sitaram. Development of Indian language speech databases for large vocabulary speech recognition systems. Proceedings of SPECOM, 2005. [13] Pratyush Banerjee, Gaurav Garg, Pabitra Mitra, and Anupam Basu. Application of triphone clustering in acoustic modeling for continuous speech recognition in Bengali. 19th International Conference on Pattern Recognition, 2008. ICPR 2008., pages 1-4, 2008. [14] R Thangarajan, AM Natarajan, and M Selvam. Word and triphone based approaches in continuous speech recognition for Tamil language. WSEAS transactions on signal processing, 4(3):76-86, 2008. [15] G Lakshmi Sarada, A Lakshmi, Hema A Murthy, and T Nagarajan. Automatic transcription of continuous speech into syllable-like units for Indian languages. Sadhana, 34(2):221-233, 2009. [16] Y Ma, MP Paulraj, S Yaacob, AB Shahriman, and SK Nataraj. Speaker accent recognition through statistical descriptors of mel-bands spectral energy and neural network model. IEEE Conference on Sustainable Utilization and Development in Engineering and Technology, pages 262-267, 2012. 4173

[17] Chao Huang, Tao Chen, and Eric Chang. Accent issues in large vocabulary continuous speech recognition. International Journal of Speech Technology, 7(2-3):141-153, 2004. [18] Herman Kamper, F élicien Jeje Muamba Mukanya, and Thomas Niesler. Multi-accent acoustic modelling of south African English. Speech Communication, 54(6):801-813, 2012. [19] Kaustubh Kulkarni, Sohini Sengupta, V Ramasubramanian, Josef G Bauer, and Georg Stemmer. Accented Indian English asr: Some early results. Spoken Language Technology Workshop, 2008., pages 225-228, 2008. [20] Shamalee Deshpande, Sharat Chikkerur, and Venu Govindaraju. Accent classification in speech. Fourth IEEE Workshop on Automatic Identification Advanced Technologies., pages 139-143, 2005. [21] Olga Kalasnhnik and Janet Fletcher. An acoustic study of vowel contrasts in north Indian English. Proceedings of the XVI International Congress of Phonetic Sciences, Germany, pages 953-956, 2007. [22] Shrikant Joshi and Preeti Rao. Acoustic models for pronunciation assessment of vowels of Indian English. International Conference on O- COCOSDA/CASLRE., pages 1-6, 2013. [23] Caroline R Wiltshire and James D Harnsberger. The influence of Guajarati and Tamil l1s on Indian English: A preliminary study. World Englishes, 25(1):91-104, 2006. [24] Sirsa Hema and Redford Melissa A. The effects of native language on Indian English sounds and timing patterns. Journal of phonetics, 41(6):393-406, 2013. [25] M. Bisani and H. Ney. Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(8):434-451, 2008. [26] Daniel DK Sleator and Davy Temperley. Parsing English with a link grammar. arxiv preprint cmplg/9508004, 1995. [27] SphinxTrain.http://cmusphinx.sourceforge.net/wiki /tutoria-lam. [28] Senones. http://cmusphinx.sourceforge.net/wiki/tutorial concepts. [29] Mark JF Gales. Maximum likelihood linear transformations for hmm-based speech recognition. Computer speech & language, 12(2):75-98, 1998. [30] Wikipedia. http://en.wikipedia.org/wiki/wikipedia:database download. [31] WP2TXT. https://github.com/yohasebe/wp2txt. [32] Vesa Siivola, Mathias Creutz, and Mikko Kurimo. Morfessor and varikn machine learning tools for speech and language technology. INTERSPEECH, 2007. [33] Andreas Stolcke and et.al. Srilm-an extensible language modeling toolkit. INTERSPEECH, 2002. [34] Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea, Peter Wolf, and Joe Woelfel. Sphinx-4: A flexible open source framework for speech recognition. Sun Microsystems, Inc., 2004. [35] Paul Lamere, Philip Kwok, William Walker, Evandro B Gouvea, Rita Singh, Bhiksha Raj, and Peter Wolf. Design of the cmu sphinx-4 decoder. INTERSPEECH, 2003. [36] SphinxTutorial.http:// www. speech. cs. cmu. edu/sphinx/tutorial.html. [37] Yonghong Yan Xintian Wu Johan Schalkwyk and Ron Cole. Development of cslu lvcsr: the 1997 darpa hub4 evaluation system. complexity, 24(14):7-27, 1998. [38] Jeroen Fransen, Dave Pye, Tony Robinson, Phil Woodland, and Steve Young. Wsjcamo corpus and recording description. 1994. [39 ]cmu. http://www.speech.cs.cmu.edu/sphinx/model s/hub4opensrc jan2002/info ABOUT MODELS. [40] SphinxDictionary. http://www.speech.cs.cmu.edu/cgi -bin/cmudict. 4174