Investigation of Indian English Speech Recognition using CMU Sphinx

Size: px
Start display at page:

Download "Investigation of Indian English Speech Recognition using CMU Sphinx"

Transcription

1 Investigation of Indian English Speech Recognition using CMU Sphinx Disha Kaur Phull School of Computing Science & Engineering, VIT University Chennai Campus, Tamil Nadu, India. G. Bharadwaja Kumar School of Computing Science & Engineering, VIT University Chennai Campus, Tamil Nadu, India. Abstract- In the recent years, research on speech recognition has given much diligence to the automatic transcription of speech data such as broadcast news (BN), medical transcription, etc. Large Vocabulary Continuous Speech Recognition (LVCSR) systems have been developed successfully for Englishes (American English (AE), British English (BE), etc.) and other languages but in case of Indian English (IE), it is still at infancy stage. IE is one of the varieties of English spoken in Indian subcontinent and is largely different from the English spoken in other parts of the world. In this paper, we have presented our work on LVCSR of IE video lectures. The speech data contains video lectures on various engineering subjects given by the experts from all over India as part of the NPTEL project which comprises of 23 hours. We have used CMU Sphinx for training and decoding in our large vocabulary continuous speech recognition experiments. The results analysis instantiate that building IE acoustic model for IE speech recognition is essential due to the fact that it has given 34% less average word error rate (WER) than HUB-4 acoustic models. The average WER before and after adaptation of IE acoustic model is 38% and 31% respectively. Even though, our IE acoustic model is trained with limited training data and the corpora used for building the language models do not mimic the spoken language, the results are promising and comparable to the results reported for AE lecture recognition in the literature. Keywords: CMU Sphinx, Indian English, Lecture Recognition. Introduction Automatic Speech Recognition (ASR) is the state of art technology that allows converting speech into text, making it easier both to create and use information. The ultimate goal of ASR research is to allow a computer to recognize all words that are intelligibly spoken by any person in real-time with 100% accuracy and it should be achieved independent of vocabulary size, noise, speaker characteristics or accent. During the past few decades, drastic developments have been reported in ASR for many languages such as English, Finnish and German, etc. Recently, there is a growing interest towards large vocabulary continuous speech recognition (LVCSR) research for Indian Languages (IL). There are several works that have been carried out for the Indian languages such as Tamil, Telugu, Bengali and Hindi. However, Speech recognition work on Indian English (IE) hasn t got that much attention when compared to other languages. The languages spoken in India belong to four major language families: Indo-Aryan, Dravidian, Austro-Asiatic, and Sino- Tibetan. In accordance with India s vast population, the figures relating to languages are also very impressive. The Indian constitution has given official status to 22 Indian languages as well as English. Apart from these, there are many other languages spoken in India. Linguists believe that there are nearly 150 different languages and there are about 2000 dialects in India [1]. Here, dialect refers to the variations at all linguistic levels, i.e., vocabularies, idioms, grammars and pronunciation. Differences among dialects are mainly due to regional and social factors and these differences vary in terms of pronunciation, vocabulary and grammar [2]. Accent refers to the variety in pronunciations of a certain language and refers to the sounds that exist in a person s language [3]. The term IE is commonly used to refer to English which is spoken as a second language in India [4]. IE plays the role of lingua franca [5]. IE has a lot of distinctive pronunciations, some distinctive syntax and quite a bit of lexical variation. Any linguistic description seeking to characterize IE must take cognizance of its highly variable nature, as it comes in a range of varieties, both regional and social [6]. Indian English accents vary greatly. The pronunciation is greatly influenced by their native language and educational background. Another major reason for variation is that IE rhythm is in accordance with the rhythm of Indian languages [7] i.e. syllable-timed (the time taken to utter each syllable). But, English is known to be a stress-timed language (both syllable and word) where only certain words in a sentence or phrase are stressed and this is an important feature of Received Pronunciation (RP). Stressing syllables and words correctly is often an area of great difficulty for speakers of IE. The extent to which Indian features of pronunciation will occur in the speech of an individual varies from person to person. In [8], Peri Bhaskararao has compared Indian English with British English (BE) Pronunciation. Diphthongs in BE correspond to pure long vowels in Indian pronunciation (e.g. cake and poor pronounced as /ke:k/ and /pu:r/, respectively); the alveolar sounds /t/ and /d/ of British Received Pronunciation (BRP) pronounced as retroflex (harsher sounds); the dental fricatives θ and δ are replaced by soft th and soft d (e.g. thick is pronounced as /thik/ rather than /ik/); /v/ and /w/ in BRP are both pronounced somewhat similar to /w/ in many parts of India and they are usually merged with b in Bengali, Assamese and Oriya 4167

2 pronunciations of English. Some words that are not found in Englishes elsewhere are used in IE. These are either innovations or translations of some native words or phrases. Examples of these instances include cousin brother (for male cousin), prepone (advance or bring forward in time), and foreign-returned (returned from abroad). There are Indianisms in grammar, such as the pluralization of non-count nouns (e.g. breads, foods, advices) and the use of the present progressive for the simple present (I am knowing). In IE, there is lack of aspiration in the word-initial position: cat /k/ but not /kh/, because of the phonemic contrast between unvoiced unaspirated velar /k/ and unvoiced aspirated velar /kh/. Also some fricatives are changed into bilabial; there is a lack of interdental in IE. In this paper, we present our experiments on large vocabulary continuous speech recognition for Indian English. The Indian English speech data is extracted from the videos of NPTEL. NPTEL is a government funded project that provides E- learning through online Web and Video courses in Engineering, Science and humanities streams [9]. The vision of this project is to provide lectures from the experts from prominent educational institutions for the benefit of students in various educational institutions in different parts of India. Currently, there are lectures of 130 speakers on various subjects. The organization of the paper is as follows: Section 2 summarizes briefly about ASR work on Indian English as well as other Indian languages and also about accent based ASR works for some other languages. Section 3 describes the experimental set up and methodology for IE speech recognition. Section 4 describes our experiments and results. A Brief Survey on Indian Language Speech Recognition In India, the early work on large vocabulary speech recognition started with Hindi language around late 90 s. Samudravijaya et al. [10] proposed a speech recognition system for Hindi which follows a hierarchical approach to speech recognition. Kumar et al. [11] proposed a method large vocabulary continuous speech recognition system for Hindi based on IBM Via Voice speech recognizer. For a vocabulary size of words, the system gives a word accuracy of 75% to 95%. Gopalakrishna et al. [12] carried medium vocabulary speech recognition using Sphinx for three different languages such as Marathi, Telugu and Tamil on different environments like landline and cellphone. They have got word error rates (WER) around 20.7%, 19.4% and 15.4% on landline and 23.6%, 17.6% and 18.3% over cellphone data for Marathi, Tamil and Telugu respectively. Pratyush Banerjee et al. [13] used Hidden Markov Model (HMM) toolkit for Bengali continuous speech recognition. They obtained an average recognition rate of 76.33% for male speakers and 52.34% for female speakers. Thangarajan et al. [14] carried out experiments using triphone based models for Tamil speech recognition and achieved 88% of accuracy over limited data. They have also tried context independent syllable models [2] for Tamil speech recognition which underperformed over context dependent phone models. Lakshmi Sarada et al. [15] tried group delay based algorithm to automatically segment and label continuous speech signal into syllable-like units for Indian languages with new feature extraction technique that uses features extracted from multiple frame sizes and frame rates. They achieved recognition rates 48.7% and 45.36%, for Tamil and Telugu respectively. Ma et al. [16] classified three accents of English language recorded from three main ethnicities in Malaysia namely Malay, Chinese and Indian. They used only the statistical descriptors of Mel-bands spectral energy and neural network as the recognizer engine. They investigated these experiments on three different independent datasets of 20%, 30%, and 40% of total samples and on average 95.59% classification rate was achieved. Huang et al. [17] carried out extensive experiments to evaluate the effect of accent on speech recognition using Microsoft Mandrian speech engine for three different Mandarin accents Beijing, Shanghai and Guangdong. They found that there is about 40-50% increase in character error rate for cross-accent speech recognition. Herman Kamper et al. [18] investigated a way to combine speech data of five South African accents of English in order to improve overall speech recognition performance. Three acoustic modeling approaches such as separate accent specific models, accentindependent models and multi-accent models are considered in this work. They found that multi-accent models obtained by introducing accent-based questions in the decision tree clustering outperformed the other modeling approaches in both phone and word recognition experiments. Only, a little amount of work has been carried out in case of Indian English speech recognition. Here, we have described few ASR works that have been carried out for IE. Kulkarni et al. [19] have studied the effect of accent variability on the performance of the Indian English ASR task. They have carried out this work on LILA Indian English database using Siemens SpeechAdvance ASR server. It consists of 10 different Indian accents. They trained three different HMMs namely accent specific, accent pooled (combines all the accent specific training data) and reduced set of the accent pooled training data as part of training the ASR system. They found that the accent pooled training set performed well on phonetically rich isolated speech recognition task. Deshpande et al. [20] distinguished between AE and IE using the second and third formant frequencies of specific accent markers. A simple Gaussian Mixture Model (GMM) was used for the classification purpose. The second and third formant frequency was calculated by using LPC roots, imposing constraints on the bandwidth and the ranges of each formant. Their results show that only the formant frequencies of these accent markers are enough to achieve classification for those two accent groups. Olga et al. [21] have done experiments on acoustic phonetic analysis of vowels produced by North Indians, whose second language is English. They concluded that North Indian English is a separate variety of IE. Srikant Joshi et al. [22] observed that IE speech is better represented by Hindi speech models for vowels common to the two languages rather than by AE models. The study of Wiltshire et al. [23] revealed that both phonemic and phonological influence of native language proficient speakers accent in IE appears in segmental and supra-segmental properties of speech. The investigation of Hema et al. [24] on the sound structure of Indian English pointed that L1(native language) effect in the IE might have been reflected either on the 4168

3 incomplete acquisition of the target phonology, the influence of sociolinguistic factors on the use and the evolution of IE. Experimental Setup There are three basic steps in building our Indian English LVCSR system. They are phonetic dictionary creation, acoustic modeling and language modeling. Creating Phonetic Dictionary Since, most of the Indian languages are phonetic in nature; Grapheme to Phoneme (G2P) conversion needs only mere mapping tables & rules for the lexical representation. However, IE pronunciation varies largely from American and other English pronunciation as well as varies according to regional and educational background within India itself. Hence, phonetic dictionary creation is a non-trivial task for IE. Initially, we have manually created the phonetic dictionary that contains around words which includes words from training corpus and other most frequent words of English language. Our phonetic dictionary contains 41 phones that are specific to Indian accent. Then, we have built the basic pronunciation models on these 20K words using Sequitur G2P software which is a data-driven grapheme-to-phoneme converter based on joint-sequence models [25]. Later, we have applied these models on Link Grammar Parser s dictionary [26] to get larger pronunciation dictionary and then, manually corrected the phonetic dictionary. Then, we have rebuilt our G2P models using this larger pronunciation dictionary. Finally, we have used these G2P models for producing the pronunciation dictionary used in our speech recognition experiments. Currently, the pronunciation in this dictionary matches mostly with IE pronounced in Andhra Pradesh region, since the dictionary has been created and corrected by Telugu language speakers. In future, we plan to build the G2P models for various accents of IE. Acoustic Modeling Acoustic modeling of speech typically refers to the process of establishing statistical representations for the feature vector sequences computed from the speech waveform. In the present work, we have used SphinxTrain [27] for building the acoustic model. The overall process followed by the SphinxTrain for creating the Acoustic Models is shown in Fig.1. Figure 1: The process involved in acoustic modeling NPTEL lecture videos have been used for building the Indian English acoustic models as shown in Fig.2. The video lectures contain various topics from science and engineering spoken lectures from IIT s and other premier institutes. These speakers are from various regions of India and they have spoken various accents of Indian English. We have considered 75 speakers lecture videos for transcribing the speech in order to train the acoustic model. The data has been video recorded in 44 khz sampling frequency and we converted them into wav format and down-sampled it to 16 khz and 16 bit mono file format. Then, we have manually transcribed the audio files. We have considered a minimum of 15 minutes of speech for each speaker while transcribing. The total speech data comprises to 23 hours. Mel frequency cepstral coefficients and their derivatives have been used as features. Then, we have built tri-state context dependent HMMs for each phone. After several experiments, we decided to have the number of Gaussians in GMM modeling as 32 and the number of Senones [28] considered for decision tree clustering is This is due to the fact that our speech data comprises only 23 hours and the vocabulary is also limited. Figure 2: The process followed for creating IE acoustic model. Adaptation The goal of adaptation techniques is to flex speaker independent models into speaker dependent using adequate data needed for full speaker-dependent training. Many stateof-the-art LVCSR systems use speaker-adapted models to improve robustness with respect to speaker variability. The HMM models of the ASR systems are adapted using Maximum Likelihood Linear Regression (MLLR) [29]. It transforms speaker independent models to the speaker dependent by capturing information specific to the speaker. MLLR adapts the observation probability of a HMM in a parametric way by finding a transform that maximizes the likelihood of the adaptation data given the transformed Gaussian parameters. 4169

4 Language Modeling Language models help any speech recognizer to figure out how likely a word sequence is independent of the acoustics. Furthermore, language models play a vital role in resolving acoustic confusions that arise due to co-articulation, assimilation and homophones while decoding. In addition, continuous speech recognition suffers from difficulties such as variation due to sentence structure (prosodies), interaction between adjacent words (crossword co-articulation), and no clear acoustic markers to delineate word boundaries. Hence, language models play a paramount role in guiding and constraining among large number of alternative word hypotheses in continuous speech recognition. N-gram language model is still the predominant choice in state-of-theart speech recognizers. Typically, N-gram models for large vocabulary speech recognizers are trained on hundreds of millions or billions of word strings. In constructing such kind of models, we usually face two problems. Firstly, the large amount of training data can lead to larger N-gram language model which consequently leads to excessively large hypothesis search space. Secondly, to train a domain specific model, we must deal with the data sparseness problem, because large amount of domain specific data are not available. Language modeling of speech extracted from lecture videos suffers due to inadequate training data, since the main source for such kind of text is audio transcriptions. In general, texts downloaded from web which is most often a primary source for collecting large amount of possible training data is not representative for the language encountered in lecture videos. Unfortunately, collecting large amounts of lecture videos and producing detailed transcriptions is very tedious. Also, the lecture speeches may contain dis-fluencies such as filled pause, repetition, and false start. In addition to dis-fluencies, there may be ungrammaticality and a language register different from the one that can be found in written texts. Even some speakers may use crutch words and foreign words within the lectures or during the conversations. In the present work, we have engineered language models (LMs) from text corpora obtained from web. Text standardization is one among the difficult tasks for building language models in case of large vocabulary speech recognition. Text must be divided into sentences. We have used a rule based sentence segmentation system for this task. All the punctuation marks, special symbols are removed except symbols associated with numerals. All the numerals are converted to orthographic type and conjointly to alpha numeric words. The abbreviations are also taken into consideration. All the words are converted into the lowercase. For the generation of the language models, three varieties of corpora have been considered. Firstly, the training transcription is used as the base variety of the language model to tally with the speech in lecture videos. Secondly, the Wikipedia dump [30] is used as the generic variety that contains words from various domains. We have downloaded the Wikipedia dump and converted into plain text using an open source tool called WP2TXT [31]. Thirdly, domain specific corpus pertaining to the lectures has been collected from the internet. Initially, we have built separate tri-gram language models for the base and topic specific corpora. Then, we have built bi-gram language model for the Wikipedia dump and we have considered most frequent first 64,000 words (words occurring more than 100 times in the corpus) to build language model by using varikn toolkit [32]. The varikn toolkit trains language models producing a compact set of high-order n-grams utilizing state-of-art Kneser-Ney smoothing. In Kneser-Ney smoothing, a lower order probability distribution is modified to take into account what is modelled by the higher order probability distributions. Hence, we have used Kneser-Ney smoothing. These three language models are merged together by using SRILM toolkit [33] as described in the Fig.3. The merged language models are used for our speech recognition tasks. Figure 3: The overall procedure for creating the Language Models. In our experiments, we have considered five different domains namely computer architecture (CA), computer networks (CN), computer organization (CO), database (DB) and operating systems (OS). The Language Models for these five different domains were generated individually for domain wise recognition. Table 1 shows the language models perplexity and Out of vocabulary (OOV) rates during the evaluation of the test transcripts. Table 1: Perplexity for the created Language Models. LM s Perplexity Words OOV (%) rates CA CN CO DB OS Decoding We have used Sphinx-4 as decoder which is freely available robust speech recognizer for our speech recognition tasks [34]. There are three primary modules in the Sphinx-4 framework [35]: the FrontEnd, the Decoder, and the Linguist. FrontEnd supports to extract features like MFCC, PLP, LPCC, etc. The Linguist translates any type of standard language model, along with pronunciation information from the dictionary and structural information from one or more sets of acoustic models, into a search graph. 4170

5 The most important component of the Decoder block is the search manager which may perform search algorithms such as frame synchronous Viterbi, A*, bi-directional, and so on. The search manager in the Decoder uses the features from the FrontEnd and the search graph from the Linguist to perform the actual decoding and for generating results. While setting the parameters in decoder s configuration file, absolute beam width, relative beam width and language weight values have been determined experimentally. Absolute beam width selects absolute amount of paths which are explored in every frame and relative beam width affects paths whose score is beam times smaller. Even though, smaller beam width speed-ups the search, we may miss potential solutions by restricting the search space. After experimentation, we have considered the absolute beam width as and relative beam width as 1E-80. Another important factor that has to be tuned during the decoding process is language weight because it decides how much relative importance will be given to the actual acoustic probabilities of the words in the hypothesis. A value between 6 and 13 is suggested as language weight [36]. A low language weight gives more leeway for words with high acoustic probabilities to be hypothesized at the risk of hypothesizing spurious words. In our experiments, we have considered language weight as 10. Experiments and Results In the initial experimentation stage, we have investigated the impact of pronunciation difference between AE, BE and IE on the speech recognition performance. This work is essential for understanding the necessity of building separate acoustic models for decoding IE speech rather than using models already available for English such as HUB-4(AE) and WSJCAM0 (BE). For this reason, we have observed the performance of the speech recognition system using HUB-4 [37], WSJCAM0 [38] and IE acoustic model (developed by us). HUB-4 corpus contains 104 hours of broadcast news collected in 1996 and 97 hours of news broadcasts collected in 1997 which is made available by Linguistic Data Consortium (LDC). In this task, we have used HUB-4 Acoustic Models trained using 140 hours of 1996 and 1997 HUB-4 training data [39]. The models are tri-state within-word and crossword triphone HMMs with no skips permitted between states. This model consists of 6000 senonically-tied states. The phone set for which models have been provided is that of the dictionary cmudict0.6d [40] available on the CMU website. British English acoustic models were trained using the WSJCAM0 corpus. The training corpus contains 90 sentences spoken by 92 speakers each. All recorded sentences were taken from the Wall Street Journal (WSJ) text corpus. All recordings were made in a quasi-soundproof room. The phone set used here is the same 40 phone set from the CMU dictionary. The total vocabulary of this corpus is around 5000 words. Results and Performance Analysis of Various English Acoustic Models We have performed the analysis on the test data that contains speech data of 14 different speakers. Even though, there are many variants of Indian English, two broad varieties of Indian English variants (North Indian and South Indian English) are considered in the present work. The test data consists of 20 minutes audio each for 4 different NPTEL video lectures. In the results, South Indian speakers of NPTEL video lectures are denoted as SI-6 and SI-7 and the North Indian speakers of NPTEL video lectures are denoted as NI-6 and NI-7. For the remaining 10 speakers, we manually recorded the speech data for the operating systems domain. The data recorded consists of five North Indian (NI-1 to NI-5) and five South Indian (SI-1 to SI-5) Speakers. The details of the test data set are shown in table 2. The total test data comprises of 3 hours. Table 2: Details of testing data set. Speakers Speech data(min) No. of Speakers SI (NPTEL) 40 2 NI (NPTEL) 40 2 SI(recorded) 50 5 NI(recorded) 50 5 In case of British English Acoustic models, the word error rates are very high (some times more than 100%) because they were trained on very small speech corpora recorded in noise free environment and hence not suitable for LVCSR experiments on Video lectures. Hence, we have considered only HUB-4 and IE acoustic models performance for comparative analysis. The comparative analysis of HUB-4 and IE acoustic models is shown in the fig.4 in terms of the WER. Figure 4: The difference in WER between HUB-4 and IE model. From the test results, one can observe that Indian English acoustic model has performed better than the HUB-4 model since there is a large difference in WER shown in fig.4. We have observed that an average WER of IE acoustic model (38%) is around 34% lesser than average WER of HUB-4 acoustic model (72%). This is because HUB-4 acoustic model was completely trained on American English accent which does not match with Indian English accent. Further, we have 4171

6 adapted HUB4 acoustic model for Indian speakers to observe any significant difference in WER after adaptation. The average WER of adapted HUB-4 acoustic model is 67% and IE acoustic model WER without adaptation is 38%. Even though, we have noticed a decrease of around 5% in WER on an average, the average WER for the adapted HUB-4 acoustic model is still not comparable to IE acoustic model without adaptation. From these results, we have concluded that building separate IE acoustic model is essential for decoding IE speech. So, we have carried out our further experiments with IE acoustic model alone. Performance Analysis of Adapted IE acoustic Model To improve the performance of the IE acoustic model, we have carried out MLLR adaptation process. The details of the data set for adapting the IE acoustic model is different from the data set used for testing and it is shown in table 3. The WER comparison of IE acoustic model before and after adaptation for all the speakers is shown in fig.5. It can be observed that after adaption, the average WER is 31% i.e. 7% less than the average WER before adaptation as shown in fig.6. Hence, we can conclude that the adaptation of the IE acoustic model helps in a better recognition of the IE lecture speech as it reduces the mismatches caused by the speaker s characteristics Figure 6: The difference in Average WER before and after adaptation. Table 3. Details of adaptation data set. Speakers Speech data(min) No. Of Speakers SI (NPTEL) 20 2 NI (NPTEL) 20 2 SI(recorded) 25 5 NI(recorded) 25 5 In fig.7, an example of our speech recognition system output is given for a better understanding of the results. From the example, one can observe the difference in WER differences for IE and HUB-4 acoustic models. It can also be understood that the lecture s transcription does not match with written language and it is very difficult to get such corpora for building the language models. Figure 7: An example IE ASR system output. Figure 5: The difference in WER before and after adaptation Comparative Analysis of North and South IE Variants Even though, the model that has been developed is referred as IE acoustic model in the present work, from [22] it is clear that IE has many variations due to different L1 inferences that leads to distinct coloration that gives rise to specific regional 4172

7 varieties of spoken. From figure 8, it can be observed that from the test data set the average WER of all SI speakers compared to the average WER of all NI speakers is 14% less without adaptation and 9% less after the adaptation of the IE acoustic model. This could be because of our pronunciation dictionary is more inclined towards the SI accent as the pronunciation model is built from the dictionary which is manually created by south Indian speakers. The result highlights the dissimilarity between the North and South Indian accents, which provides the need for multiple pronunciation dictionaries that will help in better speech recognition system for IE varieties. Figure 8: The difference in North Indian and South Indian WERs. Conclusion In the present work, we have carried out speech recognition experiments in IE video lectures In this work, we have investigated the need for building separate acoustic model for IE rather than using other existing English acoustic models such as HUB-4 (AE) or WSJCAM0 (BE) for IE speech recognition task. From the results, it is evident that IE acoustic model outperformed HUB-4 by having 34% lesser average WER for IE speech recognition. Hence, we can conclude that separate IE acoustic model is required for IE LVCSR experiments because the English pronunciation is much different from American and British English accents. Next, we have investigated the performance of our IE acoustic model on IE lecture recognition task. The average WER before and after adaptation is 38% and 31% respectively. Even though, our IE acoustic model is trained with limited training data (around 23 hours) and the corpora used for building the language models do not mimic the language spoken in the video lectures, the results are promising and comparable to the results reported for AE lecture recognition in the literature. Further, we have observed that South Indian speech is better recognized when compared to the North Indian speech. This is due to the fact that our pronunciation dictionary is inclined towards the South Indian accent. There are two possible future endeavors. One is to improve the performance of IE acoustic model by adding large vocabulary speech corpora for Indian English to the existing training set. Second is to deal with the discrepancies between variants of Indian English accents by building pronunciation models for different accents. References [1] Kavi Narayana Murthy and G Bharadwaja Kumar. Language identification from small text samples*. Journal of Quantitative Linguistics, 13(01):57-80, [2] Adrian Akmajian. Linguistics: An introduction to language and communication. MIT press, [3] Hamid Behravan. Dialect and accent recognition. PhD thesis, University of Eastern Finland, [4] John C Wells. Accents of English, volume 1. Cambridge University Press, [5] Braj B Kachru. The Indianization of English: the English language in India. Oxford University Press Oxford, [6] Andreas Sedlatschek. Contemporary Indian English: variation and change. John Benjamins Publishing, [7] Ravinder Gargesh. Indian English: Phonology. Bernd Kortmann et al. Varieties of English: Africa, South and Southeast Asia, Mouton de Gruyter,, pages , [8] Peri Bhaskararao. English in contemporary India. ABD (Asian/Pacific Book Development), 33(2002):5-7, [9] NPTEL. [10] K Samudravijaya, R Ahuja, N Bondale, T Jose, S Krishnan, P Poddar, PVS Rao, and R Raveendran. A feature-based hierarchical speech recognition system for Hindi. Sadhana (Academy Proceedings in Engineering Sciences), 23: , [11] Mohit Kumar, Nitendra Rajput, and Ashish Verma. A large-vocabulary continuous speech recognition system for Hindi. IBM journal of research and development, 48(5.6): , [12] Rohit Kumar, S Kishore, Anumanchipalli Gopalakrishna, Rahul Chitturi, Sachin Joshi, Satinder Singh, and R Sitaram. Development of Indian language speech databases for large vocabulary speech recognition systems. Proceedings of SPECOM, [13] Pratyush Banerjee, Gaurav Garg, Pabitra Mitra, and Anupam Basu. Application of triphone clustering in acoustic modeling for continuous speech recognition in Bengali. 19th International Conference on Pattern Recognition, ICPR 2008., pages 1-4, [14] R Thangarajan, AM Natarajan, and M Selvam. Word and triphone based approaches in continuous speech recognition for Tamil language. WSEAS transactions on signal processing, 4(3):76-86, [15] G Lakshmi Sarada, A Lakshmi, Hema A Murthy, and T Nagarajan. Automatic transcription of continuous speech into syllable-like units for Indian languages. Sadhana, 34(2): , [16] Y Ma, MP Paulraj, S Yaacob, AB Shahriman, and SK Nataraj. Speaker accent recognition through statistical descriptors of mel-bands spectral energy and neural network model. IEEE Conference on Sustainable Utilization and Development in Engineering and Technology, pages ,

8 [17] Chao Huang, Tao Chen, and Eric Chang. Accent issues in large vocabulary continuous speech recognition. International Journal of Speech Technology, 7(2-3): , [18] Herman Kamper, F élicien Jeje Muamba Mukanya, and Thomas Niesler. Multi-accent acoustic modelling of south African English. Speech Communication, 54(6): , [19] Kaustubh Kulkarni, Sohini Sengupta, V Ramasubramanian, Josef G Bauer, and Georg Stemmer. Accented Indian English asr: Some early results. Spoken Language Technology Workshop, 2008., pages , [20] Shamalee Deshpande, Sharat Chikkerur, and Venu Govindaraju. Accent classification in speech. Fourth IEEE Workshop on Automatic Identification Advanced Technologies., pages , [21] Olga Kalasnhnik and Janet Fletcher. An acoustic study of vowel contrasts in north Indian English. Proceedings of the XVI International Congress of Phonetic Sciences, Germany, pages , [22] Shrikant Joshi and Preeti Rao. Acoustic models for pronunciation assessment of vowels of Indian English. International Conference on O- COCOSDA/CASLRE., pages 1-6, [23] Caroline R Wiltshire and James D Harnsberger. The influence of Guajarati and Tamil l1s on Indian English: A preliminary study. World Englishes, 25(1):91-104, [24] Sirsa Hema and Redford Melissa A. The effects of native language on Indian English sounds and timing patterns. Journal of phonetics, 41(6): , [25] M. Bisani and H. Ney. Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(8): , [26] Daniel DK Sleator and Davy Temperley. Parsing English with a link grammar. arxiv preprint cmplg/ , [27] SphinxTrain. /tutoria-lam. [28] Senones. concepts. [29] Mark JF Gales. Maximum likelihood linear transformations for hmm-based speech recognition. Computer speech & language, 12(2):75-98, [30] Wikipedia. download. [31] WP2TXT. [32] Vesa Siivola, Mathias Creutz, and Mikko Kurimo. Morfessor and varikn machine learning tools for speech and language technology. INTERSPEECH, [33] Andreas Stolcke and et.al. Srilm-an extensible language modeling toolkit. INTERSPEECH, [34] Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea, Peter Wolf, and Joe Woelfel. Sphinx-4: A flexible open source framework for speech recognition. Sun Microsystems, Inc., [35] Paul Lamere, Philip Kwok, William Walker, Evandro B Gouvea, Rita Singh, Bhiksha Raj, and Peter Wolf. Design of the cmu sphinx-4 decoder. INTERSPEECH, [36] SphinxTutorial. www. speech. cs. cmu. edu/sphinx/tutorial.html. [37] Yonghong Yan Xintian Wu Johan Schalkwyk and Ron Cole. Development of cslu lvcsr: the 1997 darpa hub4 evaluation system. complexity, 24(14):7-27, [38] Jeroen Fransen, Dave Pye, Tony Robinson, Phil Woodland, and Steve Young. Wsjcamo corpus and recording description [39 ]cmu. s/hub4opensrc jan2002/info ABOUT MODELS. [40] SphinxDictionary. -bin/cmudict. 4174

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese

More information

Investigation on Mandarin Broadcast News Speech Recognition

Investigation on Mandarin Broadcast News Speech Recognition Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang 1, Xin Lei 1, Wen Wang 2, Takahiro Shinozaki 1 1 Univ. of Washington, Dept. of Electrical Engineering, Seattle, WA 98195 USA 2

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One

More information

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Amit Juneja and Carol Espy-Wilson Department of Electrical and Computer Engineering University of Maryland,

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition Seltzer, M.L.; Raj, B.; Stern, R.M. TR2004-088 December 2004 Abstract

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers October 31, 2003 Amit Juneja Department of Electrical and Computer Engineering University of Maryland, College Park,

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

WHEN THERE IS A mismatch between the acoustic

WHEN THERE IS A mismatch between the acoustic 808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,

More information

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012 Text-independent Mono and Cross-lingual Speaker Identification with the Constraint of Limited Data Nagaraja B G and H S Jayanna Department of Information Science and Engineering Siddaganga Institute of

More information

Edinburgh Research Explorer

Edinburgh Research Explorer Edinburgh Research Explorer Personalising speech-to-speech translation Citation for published version: Dines, J, Liang, H, Saheer, L, Gibson, M, Byrne, W, Oura, K, Tokuda, K, Yamagishi, J, King, S, Wester,

More information

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 Ranniery Maia 1,2, Jinfu Ni 1,2, Shinsuke Sakai 1,2, Tomoki Toda 1,3, Keiichi Tokuda 1,4 Tohru Shimizu 1,2, Satoshi Nakamura 1,2 1 National

More information

Calibration of Confidence Measures in Speech Recognition

Calibration of Confidence Measures in Speech Recognition Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE

More information

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION Mitchell McLaren 1, Yun Lei 1, Luciana Ferrer 2 1 Speech Technology and Research Laboratory, SRI International, California, USA 2 Departamento

More information

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING Gábor Gosztolya 1, Tamás Grósz 1, László Tóth 1, David Imseng 2 1 MTA-SZTE Research Group on Artificial

More information

Speech Recognition by Indexing and Sequencing

Speech Recognition by Indexing and Sequencing International Journal of Computer Information Systems and Industrial Management Applications. ISSN 215-7988 Volume 4 (212) pp. 358 365 c MIR Labs, www.mirlabs.net/ijcisim/index.html Speech Recognition

More information

CEFR Overall Illustrative English Proficiency Scales

CEFR Overall Illustrative English Proficiency Scales CEFR Overall Illustrative English Proficiency s CEFR CEFR OVERALL ORAL PRODUCTION Has a good command of idiomatic expressions and colloquialisms with awareness of connotative levels of meaning. Can convey

More information

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH Don McAllaster, Larry Gillick, Francesco Scattone, Mike Newman Dragon Systems, Inc. 320 Nevada Street Newton, MA 02160

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren Speech Technology and Research Laboratory, SRI International,

More information

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project Phonetic- and Speaker-Discriminant Features for Speaker Recognition by Lara Stoll Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California

More information

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion

More information

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction INTERSPEECH 2015 Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa Department of Computer

More information

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and

More information

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training INTERSPEECH 2015 Vowel mispronunciation detection using DNN acoustic models with cross-lingual training Shrikant Joshi, Nachiket Deo, Preeti Rao Department of Electrical Engineering, Indian Institute of

More information

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many Schmidt 1 Eric Schmidt Prof. Suzanne Flynn Linguistic Study of Bilingualism December 13, 2013 A Minimalist Approach to Code-Switching In the field of linguistics, the topic of bilingualism is a broad one.

More information

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Navdeep Jaitly 1, Vincent Vanhoucke 2, Geoffrey Hinton 1,2 1 University of Toronto 2 Google Inc. ndjaitly@cs.toronto.edu,

More information

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing Pallavi Baljekar, Sunayana Sitaram, Prasanna Kumar Muthukumar, and Alan W Black Carnegie Mellon University,

More information

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University,

More information

Problems of the Arabic OCR: New Attitudes

Problems of the Arabic OCR: New Attitudes Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing

More information

English Language and Applied Linguistics. Module Descriptions 2017/18

English Language and Applied Linguistics. Module Descriptions 2017/18 English Language and Applied Linguistics Module Descriptions 2017/18 Level I (i.e. 2 nd Yr.) Modules Please be aware that all modules are subject to availability. If you have any questions about the modules,

More information

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India World of Computer Science and Information Technology Journal (WCSIT) ISSN: 2221-0741 Vol. 2, No. 1, 1-7, 2012 A Review on Challenges and Approaches Vimala.C Project Fellow, Department of Computer Science

More information

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT Takuya Yoshioka,, Anton Ragni, Mark J. F. Gales Cambridge University Engineering Department, Cambridge, UK NTT Communication

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES Po-Sen Huang, Kshitiz Kumar, Chaojun Liu, Yifan Gong, Li Deng Department of Electrical and Computer Engineering,

More information

Speaker Identification by Comparison of Smart Methods. Abstract

Speaker Identification by Comparison of Smart Methods. Abstract Journal of mathematics and computer science 10 (2014), 61-71 Speaker Identification by Comparison of Smart Methods Ali Mahdavi Meimand Amin Asadi Majid Mohamadi Department of Electrical Department of Computer

More information

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders

More information

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Neural Network GUI Tested on Text-To-Phoneme Mapping A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis

More information

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology ISCA Archive SUBJECTIVE EVALUATION FOR HMM-BASED SPEECH-TO-LIP MOVEMENT SYNTHESIS Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science & Technology

More information

Letter-based speech synthesis

Letter-based speech synthesis Letter-based speech synthesis Oliver Watts, Junichi Yamagishi, Simon King Centre for Speech Technology Research, University of Edinburgh, UK O.S.Watts@sms.ed.ac.uk jyamagis@inf.ed.ac.uk Simon.King@ed.ac.uk

More information

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

More information

Deep Neural Network Language Models

Deep Neural Network Language Models Deep Neural Network Language Models Ebru Arısoy, Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY, 10598, USA {earisoy, tsainath, bedk, bhuvana}@us.ibm.com

More information

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature 1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details

More information

On the Formation of Phoneme Categories in DNN Acoustic Models

On the Formation of Phoneme Categories in DNN Acoustic Models On the Formation of Phoneme Categories in DNN Acoustic Models Tasha Nagamine Department of Electrical Engineering, Columbia University T. Nagamine Motivation Large performance gap between humans and state-

More information

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

Small-Vocabulary Speech Recognition for Resource- Scarce Languages Small-Vocabulary Speech Recognition for Resource- Scarce Languages Fang Qiao School of Computer Science Carnegie Mellon University fqiao@andrew.cmu.edu Jahanzeb Sherwani iteleport LLC j@iteleportmobile.com

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language Z.HACHKAR 1,3, A. FARCHI 2, B.MOUNIR 1, J. EL ABBADI 3 1 Ecole Supérieure de Technologie, Safi, Morocco. zhachkar2000@yahoo.fr.

More information

First Grade Curriculum Highlights: In alignment with the Common Core Standards

First Grade Curriculum Highlights: In alignment with the Common Core Standards First Grade Curriculum Highlights: In alignment with the Common Core Standards ENGLISH LANGUAGE ARTS Foundational Skills Print Concepts Demonstrate understanding of the organization and basic features

More information

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS Akella Amarendra Babu 1 *, Ramadevi Yellasiri 2 and Akepogu Ananda Rao 3 1 JNIAS, JNT University Anantapur, Ananthapuramu,

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading ELA/ELD Correlation Matrix for ELD Materials Grade 1 Reading The English Language Arts (ELA) required for the one hour of English-Language Development (ELD) Materials are listed in Appendix 9-A, Matrix

More information

Phonological Processing for Urdu Text to Speech System

Phonological Processing for Urdu Text to Speech System Phonological Processing for Urdu Text to Speech System Sarmad Hussain Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, B Block, Faisal Town, Lahore,

More information

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,

More information

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Yanzhang He, Eric Fosler-Lussier Department of Computer Science and Engineering The hio

More information

Progressive Aspect in Nigerian English

Progressive Aspect in Nigerian English ISLE 2011 17 June 2011 1 New Englishes Empirical Studies Aspect in Nigerian Languages 2 3 Nigerian English Other New Englishes Explanations Progressive Aspect in New Englishes New Englishes Empirical Studies

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Vivek Kumar Rangarajan Sridhar, John Chen, Srinivas Bangalore, Alistair Conkie AT&T abs - Research 180 Park Avenue, Florham Park,

More information

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Prof. Ch.Srinivasa Kumar Prof. and Head of department. Electronics and communication Nalanda Institute

More information

Noisy SMS Machine Translation in Low-Density Languages

Noisy SMS Machine Translation in Low-Density Languages Noisy SMS Machine Translation in Low-Density Languages Vladimir Eidelman, Kristy Hollingshead, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department of

More information

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge Preethi Jyothi 1, Mark Hasegawa-Johnson 1,2 1 Beckman Institute,

More information

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science

More information

Word Stress and Intonation: Introduction

Word Stress and Intonation: Introduction Word Stress and Intonation: Introduction WORD STRESS One or more syllables of a polysyllabic word have greater prominence than the others. Such syllables are said to be accented or stressed. Word stress

More information

Speaker recognition using universal background model on YOHO database

Speaker recognition using universal background model on YOHO database Aalborg University Master Thesis project Speaker recognition using universal background model on YOHO database Author: Alexandre Majetniak Supervisor: Zheng-Hua Tan May 31, 2011 The Faculties of Engineering,

More information

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

The IRISA Text-To-Speech System for the Blizzard Challenge 2017 The IRISA Text-To-Speech System for the Blizzard Challenge 2017 Pierre Alain, Nelly Barbot, Jonathan Chevelu, Gwénolé Lecorvé, Damien Lolive, Claude Simon, Marie Tahon IRISA, University of Rennes 1 (ENSSAT),

More information

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription Wilny Wilson.P M.Tech Computer Science Student Thejus Engineering College Thrissur, India. Sindhu.S Computer

More information

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas Exploiting Distance Learning Methods and Multimediaenhanced instructional content to support IT Curricula in Greek Technological Educational Institutes P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou,

More information

CS 598 Natural Language Processing

CS 598 Natural Language Processing CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@

More information

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools Dr. Amardeep Kaur Professor, Babe Ke College of Education, Mudki, Ferozepur, Punjab Abstract The present

More information

Multi-Lingual Text Leveling

Multi-Lingual Text Leveling Multi-Lingual Text Leveling Salim Roukos, Jerome Quin, and Todd Ward IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 {roukos,jlquinn,tward}@us.ibm.com Abstract. Determining the language proficiency

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH 2009 423 Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition George

More information

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA LANGUAGE AND SPEECH, 2009, 52 (4), 391 413 391 Variability in Word Duration as a Function of Probability, Speech Style, and Prosody Rachel E. Baker, Ann R. Bradlow Northwestern University, Evanston, IL,

More information

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1) Houghton Mifflin Reading Correlation to the Standards for English Language Arts (Grade1) 8.3 JOHNNY APPLESEED Biography TARGET SKILLS: 8.3 Johnny Appleseed Phonemic Awareness Phonics Comprehension Vocabulary

More information

Automatic Pronunciation Checker

Automatic Pronunciation Checker Institut für Technische Informatik und Kommunikationsnetze Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich Ecole polytechnique fédérale de Zurich Politecnico federale

More information

Language Model and Grammar Extraction Variation in Machine Translation

Language Model and Grammar Extraction Variation in Machine Translation Language Model and Grammar Extraction Variation in Machine Translation Vladimir Eidelman, Chris Dyer, and Philip Resnik UMIACS Laboratory for Computational Linguistics and Information Processing Department

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

Mining Association Rules in Student s Assessment Data

Mining Association Rules in Student s Assessment Data www.ijcsi.org 211 Mining Association Rules in Student s Assessment Data Dr. Varun Kumar 1, Anupama Chadha 2 1 Department of Computer Science and Engineering, MVN University Palwal, Haryana, India 2 Anupama

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access Joyce McDonough 1, Heike Lenhert-LeHouiller 1, Neil Bardhan 2 1 Linguistics

More information

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction CLASSIFICATION OF PROGRAM Critical Elements Analysis 1 Program Name: Macmillan/McGraw Hill Reading 2003 Date of Publication: 2003 Publisher: Macmillan/McGraw Hill Reviewer Code: 1. X The program meets

More information

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1 Linguistics 1 Linguistics Matthew Gordon, Chair Interdepartmental Program in the College of Arts and Science 223 Tate Hall (573) 882-6421 gordonmj@missouri.edu Kibby Smith, Advisor Office of Multidisciplinary

More information

South Carolina English Language Arts

South Carolina English Language Arts South Carolina English Language Arts A S O F J U N E 2 0, 2 0 1 0, T H I S S TAT E H A D A D O P T E D T H E CO M M O N CO R E S TAT E S TA N DA R D S. DOCUMENTS REVIEWED South Carolina Academic Content

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

Segregation of Unvoiced Speech from Nonspeech Interference

Segregation of Unvoiced Speech from Nonspeech Interference Technical Report OSU-CISRC-8/7-TR63 Department of Computer Science and Engineering The Ohio State University Columbus, OH 4321-1277 FTP site: ftp.cse.ohio-state.edu Login: anonymous Directory: pub/tech-report/27

More information

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh The Effect of Discourse Markers on the Speaking Production of EFL Students Iman Moradimanesh Abstract The research aimed at investigating the relationship between discourse markers (DMs) and a special

More information

SARDNET: A Self-Organizing Feature Map for Sequences

SARDNET: A Self-Organizing Feature Map for Sequences SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 141 ( 2014 ) 124 128 WCLTA 2013 Using Corpus Linguistics in the Development of Writing Blanka Frydrychova

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

Florida Reading Endorsement Alignment Matrix Competency 1

Florida Reading Endorsement Alignment Matrix Competency 1 Florida Reading Endorsement Alignment Matrix Competency 1 Reading Endorsement Guiding Principle: Teachers will understand and teach reading as an ongoing strategic process resulting in students comprehending

More information

Chapter 5: Language. Over 6,900 different languages worldwide

Chapter 5: Language. Over 6,900 different languages worldwide Chapter 5: Language Over 6,900 different languages worldwide Language is a system of communication through speech, a collection of sounds that a group of people understands to have the same meaning Key

More information

SIE: Speech Enabled Interface for E-Learning

SIE: Speech Enabled Interface for E-Learning SIE: Speech Enabled Interface for E-Learning Shikha M.Tech Student Lovely Professional University, Phagwara, Punjab INDIA ABSTRACT In today s world, e-learning is very important and popular. E- learning

More information