Resources Author's for Indian copylanguages

Similar documents
The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

A study of speaker adaptation for DNN-based speech synthesis

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Speech Recognition at ICSI: Broadcast News and beyond

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Speech Emotion Recognition Using Support Vector Machine

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Letter-based speech synthesis

Learning Methods in Multilingual Speech Recognition

A Hybrid Text-To-Speech system for Afrikaans

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Modeling function word errors in DNN-HMM based LVCSR systems

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Investigation of Indian English Speech Recognition using CMU Sphinx

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Phonological Processing for Urdu Text to Speech System

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Modeling function word errors in DNN-HMM based LVCSR systems

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Expressive speech synthesis: a review

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Human Emotion Recognition From Speech

Getting the Story Right: Making Computer-Generated Stories More Entertaining

Speaker Identification by Comparison of Smart Methods. Abstract

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

WHEN THERE IS A mismatch between the acoustic

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

SIE: Speech Enabled Interface for E-Learning

Body-Conducted Speech Recognition and its Application to Speech Support System

Edinburgh Research Explorer

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Spoofing and countermeasures for automatic speaker verification

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Building Text Corpus for Unit Selection Synthesis

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Automatic segmentation of continuous speech using minimum phase group delay functions

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Statistical Parametric Speech Synthesis

Voice conversion through vector quantization

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Using dialogue context to improve parsing performance in dialogue systems

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

August 14th - 18th 2005, Oslo, Norway. Code Number: 001-E 117 SI - Library and Information Science Journals Simultaneous Interpretation: Yes

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Parsing of part-of-speech tagged Assamese Texts

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Module 9: Performing HIV Rapid Tests (Demo and Practice)

AQUA: An Ontology-Driven Question Answering System

Named Entity Recognition: A Survey for the Indian Languages

LODI UNIFIED SCHOOL DISTRICT. Eliminate Rule Instruction

Cross Language Information Retrieval

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Transliteration Systems Across Indian Languages Using Parallel Corpora

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Mandarin Lexical Tone Recognition: The Gating Paradigm

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

USF Course Change Proposal Global Citizens Project

Syntactic surprisal affects spoken word duration in conversational contexts

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

Meta Comments for Summarizing Meeting Speech

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Lecturing Module

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Universiteit Leiden ICT in Business

Investigation on Mandarin Broadcast News Speech Recognition

Calibration of Confidence Measures in Speech Recognition

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Problems of the Arabic OCR: New Attitudes

Routledge Library Editions: The English Language: Pronouns And Word Order In Old English: With Particular Reference To The Indefinite Pronoun Man

The IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs. 20 April 2011

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Transcription:

1/ 23 Resources for Indian languages Arun Baby, Anju Leela Thomas, Nishanthi N L, and TTS Consortium Indian Institute of Technology Madras, India September 12, 2016

Roadmap Outline The need for Indian language corpora Introduction Data collection Text selection and correction Speaker selection Recording Summary of the text corpus Voice building Common Label Set Parsing and unified parser Hybrid segmentation Pruning HTS Android applications Conclusion and future work Acknowledgement References 2/ 23

The need for Indian languages corpora 3/ 23 The amount of work in speech domain for Indian languages is comparatively lower than that of other languages A database of speech audio files and corresponding text transcriptions Consortium effort

Introduction 4/ 23 Creating a corpus for Indian languages is a time taking process Mainly because of its diversity and lack of resources An initiative was taken by DeiTY, Ministry of Information Technology, India to sponsor the development of TTS in regional languages Two voices for each language(male and female) are recorded 40 hours of data per language is collected

Data collection 5/ 23 Text selection and correction Speaker selection Recording Summary of the text corpus

Text selection and correction 6/ 23 Text in various Indian languages are collected from newspapers, websites, blogs, etc with the help of web crawlers Text from different domains like children stories, literature, science, tourism, etc was also collected manually Manual correction to get rid of transcription errors (if any) Chosen text is easy to read, covers the most commonly used words and phrases in a language and has maximum syllable coverage

Speaker selection 7/ 23 2 voice talents (1 male and 1 female) are selected Single speaker data limits the variations and change in voice quality Voice which seems pleasant to listen, as well as amenable to signal processing is chosen

Recording 8/ 23 Carried out in a special environment which is free from noise and echo Done by professional speakers(male and female) to maintain constant pitch and prevent stress phenomenon To avoid the fatigue of the speaker, a break is given every 45 minutes The recorded sentences are split at the sentence level Type of recording is mono, with a sampling rate of 48KHz and the number of bits per sample is 16

Summary of the text corpus 9/ 23 Table 1 : Summary of the corpus Female Male Languages English Mono English Mono Duration in hours 12.05 14.45 11.30 12.95 Assamese Number of words 17531 29510 18143 32136 Number of sentences 8513 8713 8892 8941 Duration in hours 5.2 5.01 10.03 10.05 Bengali Number of words 8607 18599 12901 30493 Number of sentences 3239 3253 5316 6187 Duration in hours - 4 - - Bodo Number of words - 3991 - - Number of sentences - 2715 - - Duration in hours 10 10.33 10.13 10.92 Gujarati Number of words 14309 20567 15192 23546 Number of sentences 4671 2396 4826 3288 Duration in hours 7.94 7.23 7.81 7.03 Hindi Number of words 15153 13380 15189 13369 Number of sentences 5240 2605 5243 2806 Duration in hours 7.5 11.82 7.48 7.03 Kannada Number of words 13738 11097 14446 11358 NumberThis ofmay sentences not be the final version. 4448 5132 4778 5934

Summary of the text corpus 10/ 23 Table 2 : Summary of the corpus Languages Female Male English Mono English Mono Duration in hours 8.77 8.19 7.89 9.7 Malayalam Number of words 13738 29165 13738 28933 Number of sentences 5132 5650 5131 5650 Duration in hours 10.35 10.14 10.22 10.61 Manipuri Number of words 21119 23555 18535 24531 Number of sentences 10167 9487 9836 9745 Duration in hours - 4.8-3.27 Marathi Number of words - 18287-12201 Number of sentences - 2448-1992 Duration in hours - 4.27-4.47 Odia Number of words - 3936-4069 Number of sentences - 3578-3573 Duration in hours 7.25 10.24 7.30 9.82 Rajasthani Number of words 11929 20923 13114 22894 Number of sentences 3830 4346 4809 4779 Duration in hours 12.7 10.03 10.9 10.3 Tamil Number of words 20911 28817 20220 32017 Number of sentences 7914 3243 7547 3717 Duration in hours - 23.92-4.2 Telugu Number of words - 42063-12192 NumberThis ofmay sentences not be the final version. - 4043-2481

Voice building 11/ 23 Common Label Set Parsing and unified parser Hybrid segmentation Pruning HTS

Common Label Set Capitalizes on the acoustic similarity of Indian languages 1 Standardized representation for phonemes across different Indian languages Devised using the Latin-1 script 1 B Ramani, S Lilly Christina, G Anushiya Rachel, V Sherlin Solomi, Mahesh Kumar Nandwana, Anusha Prakash, S Aswin Shanmugam, This may Raghava not be the final Krishnan, version. S Kishore, K Samudravijaya, et al. A common attribute based unified hts framework for speech synthesis in Indian languages. In 8th ISCA Workshop on Speech Synthesis, pages 311316, 2013 12/ 23

Parsing and unified parser Traditional parsing approach uses the respective language s rules to parse the word into corresponding phones 2 Unified approach uses the generic language structure of Indian languages Unify the languages based on the Common Label Set Converts UTF-8 text to Common Label Set, applies letter-to-sound rules and generates the corresponding phoneme sequences 2 Arun Baby, Nishanthi N L, Anju This may Leela not be Thomas, the final version. and Hema A Murthy. A unified parser for developing indian language text to speech synthesizers. In International Conference on Text, Speech and Dialogue. Springer, 2016 13/ 23

Hybrid segmentation Manual correction is a monotonous task Flat-start initialization of monophone HMMs, Embedded reestimation and Forced-Viterbi alignment are the three steps used in conventional segmentation This model does not indicate the boundary positions Use of short term energy (STE) as a measure to determine the syllable boundaries 3 Boundaries of the syllables are corrected with group delay and spectral flux 3 S Aswin Shanmugam and Hema Murthy. A hybrid approach to segmentation of speech using group delay processing and hmm based embedded reestimation. presentation in INTERSPEECH, 2014 14/ 23

Pruning Process of discarding badly segmented units 4 Duration, average f0 and STE are the cues taken into consideration Helps in the correction of segmentation errors and also in maintaining acoustic continuity in the voice 4 K Raghava Krishnan. Prosodic This analysis may not be of the Indian final version. languages and its application to text to speech synthesis. http://lantana.tenet.res.in/thesis.php, M S Thesis, Department of Electrical Engineering, IIT Madras, India, July 2015. 15/ 23

HTS A statistical parametric approach 5 Parametric representation of speech by extracting the spectral and excitation features from the database 5 Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura. Speech parameter generation This may not algorithms be the final version. for hmm-based speech synthesis. In Acoustics, Speech, and Signal Processing, 2000. ICASSP00. Proceed- ings. 2000 IEEE International Conference on, volume 3, pages 13151318. IEEE, 2000. 16/ 23

Android applications 17/ 23 Three Android applications were developed 6 Tamil TTS app - for Tamil text-to-speech synthesis Hindi TTS app - for Hindi text-to-speech synthesis Indic TTS app - for text-to-speech synthesis of 13 Indian languages Apps are available for download in the Indic TTS website 6 IIT Madras. Indic tts - android apps. https://www.iitm.ac.in/donlab/tts/ androidapp.php.

Conclusion and future work 18/ 23 The data is hosted on the web Available to all groups of people working for corpus generation and research activities. Data is still being collected

Download statistics 19/ 23 Download statistics (as on 12th September,2016) Figure 1 : Download statistics

Acknowledgement 20/ 23 Funded by Department of Information Technology, Ministry of Communication and Technology, Government of India Figure 2 : Consortium members

References 21/ 23 IIT Madras. Indic tts. https://www.iitm.ac.in/donlab/tts/ SS Agrawal, Sunita Arora, and Karunesh Arora. Towards design, development and standardization of speech corpora for developing Indian language tts system. COCOSDA-2005, Dec, pages 68, 2005 Arun Baby, Nishanthi N L, Anju Leela Thomas, and Hema A Murthy. A unified parser for developing Indian language text to speech synthesizers. In International Conference on Text, Speech and Dialogue. Springer, 2016 S Aswin Shanmugam and Hema Murthy. A hybrid approach to segmentation of speech using group delay processing and hmm based embedded reestimation. presentation in INTERSPEECH, 2014

Questions 22/ 23 Questions???

Thank you 23/ 23 Thank you