Automatic Speech Segmentation of French: Corpus Adaptation

Similar documents
Learning Methods in Multilingual Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Eyebrows in French talk-in-interaction

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Recognition at ICSI: Broadcast News and beyond

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Investigation on Mandarin Broadcast News Speech Recognition

A study of speaker adaptation for DNN-based speech synthesis

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Edinburgh Research Explorer

Mandarin Lexical Tone Recognition: The Gating Paradigm

On the Formation of Phoneme Categories in DNN Acoustic Models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Letter-based speech synthesis

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

Speech Emotion Recognition Using Support Vector Machine

Journal of Phonetics

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Annotation Pro. annotation of linguistic and paralinguistic features in speech. Katarzyna Klessa. Phon&Phon meeting

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Detecting English-French Cognates Using Orthographic Edit Distance

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Calibration of Confidence Measures in Speech Recognition

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Cross Language Information Retrieval

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speech Recognition by Indexing and Sequencing

Florida Reading Endorsement Alignment Matrix Competency 1

Human Emotion Recognition From Speech

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Characterizing and Processing Robot-Directed Speech

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

CEF, oral assessment and autonomous learning in daily college practice

Automatic Assessment of Spoken Modern Standard Arabic

The influence of metrical constraints on direct imitation across French varieties

The Structure of the ORD Speech Corpus of Russian Everyday Communication

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

The IFA Corpus: a Phonemically Segmented Dutch "Open Source" Speech Database

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

An Online Handwriting Recognition System For Turkish

Investigation of Indian English Speech Recognition using CMU Sphinx

Proceedings of Meetings on Acoustics

ONLINE COURSES. Flexibility to Meet Middle and High School Students at Their Point of Need

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Phonological and Phonetic Representations: The Case of Neutralization

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

CS Machine Learning

Big Fish. Big Fish The Book. Big Fish. The Shooting Script. The Movie

CODE Multimedia Manual network version

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Segregation of Unvoiced Speech from Nonspeech Interference

University of New Orleans

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

Phonological Processing for Urdu Text to Speech System

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

English-German Medical Dictionary And Phrasebook By A.H. Zemback

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Basic German: CD/Book Package (LL(R) Complete Basic Courses) By Living Language

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

My First Spanish Phrases (Speak Another Language!) By Jill Kalz

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Longman English Interactive

SPANISH FOR MASTERY 3 PDF

The taming of the data:

Language Independent Passage Retrieval for Question Answering

Transcription:

Automatic Speech Segmentation of French: Corpus Adaptation LPL - Aix-en-Provence - France This work has been carried out thanks to the support of the A*MIDEX project (n ANR-11-IDEX-0001-02) funded by the «Investissements d Avenir» French Government program, managed by the French National Research Agency (ANR)

What is Speech Segmentation? the process of taking the phonetic transcription of an audio speech segment and determining where in time particular phonemes occur in the speech segment. s o r t i r l @ S a audio phonemes time-aligned phonemes Page 2 / 29

What's for? Determining the location of known phonemes is important to a number of speech applications: When developing an ASR system, good initial estimates are essential when training Gaussian Mixture Model (GMM) parameters (Rabiner and Juang, 1993, p. 370). Knowledge of phoneme boundaries is also necessary in some cases of health-related research on human speech processing. and other applications... Page 3 / 29

How to perform Speech Segm.? Manually: Manual alignment has been reported to take between 11 and 30 seconds per phoneme (Leung and Zue, 1984). Manual alignment is too time consuming and expensive to be commonly employed for aligning large corpora. Page 4 / 29

How to perform Speech Segm.? Speech Recognition Engines that can perform Speech Segmentation: HTK - Hidden Markov Model Toolkit CMU Sphinx Open-Source Large Vocabulary CSR Engine Julius Wrappers: Prosodylab-Aligner: python / HTK P2FA: python / HTK and many others... Page 5 / 29

How to perform Speech Segm.? Graphical User Interface: SPPAS (Bigi, 2012) Speech Segm. is also called: Alignment Page 6 / 29

On which languages? SPPAS can perform speech segmentation of: French, English, Italian, Spanish, Chinese, Taiwanese, Japanese. Requirement: an acoustic model for each language. Page 7 / 29

an Acoustic Model??? ~h "S" <BEGINHMM> <NUMSTATES> 5 <STATE> 2 <MEAN> 25 3.865123e+00-2.796230e+00-2.741646e+00-2.575907e+00-2.209618e+00-5.850142e+00-3.059854e+00 2.294439e+00 6.802940e-01-2.800637e+00-1.763918e+00 3.845190e-01 1.286 847e+00-1.407083e+00-1.252665e+00-1.862736e+00-3.524270e-01 4.247507e-01-1.773855e-02 7.232670e-01-3.501371e-01-8.653453e-01-1.168209e+00-5.176944e-01 1.447603e+ 00 <VARIANCE> 25 1.297570e+01 2.348404e+01 3.699827e+01 3.013035e+01 4.785572e+01 4.348248e+01 4.807753e+01 4.529767e+01 4.452133e+01 4.717181e+01 5.047903e+01 4.394471e+01 5.295042e+00 3.326635e+00 3.577229e+00 3.221893e+00 6.327312e+00 4.562069e+00 5.920639e+00 7.081470e+00 5.766568e+00 5.546420e+00 5.610922e+00 4.105053e+00 1.246813e+00 <GCONST> 1.085982e+02 <STATE> 3 <MEAN> 25 4.182722e+00-5.747316e+00-5.573908e+00-3.280269e+00 7.250799e-01-1.220587e+00 7.397585e-02 4.036344e+00 5.651740e-01-3.612718e+00-3.532877e+00-1.029424e+00 7.7643 20e-02-1.490477e-01-1.060979e-01 8.130542e-02 2.693116e-01 4.773618e-01 2.419368e-01-1.171875e-01-1.453947e-01 3.595677e-03-1.755375e-01-1.827260e-01-9.910033e-02 <VARIANCE> 25 1.229548e+01 1.833777e+01 3.330074e+01 3.391322e+01 4.468183e+01 4.548661e+01 5.034616e+01 4.177621e+01 4.829255e+01 4.718935e+01 4.383722e+01 3.838983e+01 5.534610e-01 9.874231e-01 1.471683e+00 1.390052e+00 2.534417e+00 2.351494e+00 2.433162e+00 2.457205e+00 2.317599e+00 2.229505e+00 2.289994e+00 2.051025e+00 4.103379e-01 <GCONST> 9.480565e+01 <STATE> 4 <MEAN> 25 4.170075e+00-3.602696e+00-3.229792e+00-2.666616e+00-5.769264e-01-2.755867e+00-6.961405e-01 2.032978e+00 1.096958e-01-2.195134e+00-2.524131e+00-9.696913e-01 7.72 3407e-02 1.414706e+00 1.097951e+00 8.257185e-01-3.040556e-01-2.347561e-02-2.900199e-01-1.342138e+00-5.801741e-01 3.527923e-01 4.388814e-01 3.887816e-02-1.326638e+00 <VARIANCE> 25 1.412758e+01 2.168075e+01 4.145230e+01 3.500136e+01 6.340505e+01 5.574141e+01 5.442813e+01 4.434394e+01 4.613047e+01 4.639702e+01 4.196549e+01 4.127845e+01 1.312419e+00 1.832024e+00 2.573012e+00 2.434281e+00 3.214828e+00 3.160381e+00 3.389642e+00 3.730893e+00 3.638973e+00 3.536761e+00 3.276227e+00 2.968326e+00 1.121088e+00 <GCONST> 1.025482e+02 <TRANSP> 5 0.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.490560e-01 5.509440e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 6.871416e-01 3.128584e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.482542e-01 5.517458e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 <ENDHMM> Page 8 / 29

Yes, an Acoustic Model! It's a probability distribution (a 5-states HMM, blah blah blah). But, don't matter! It's not necessary to understand. The model is trained from data the text corresponding to the audio the text corresponding to the audio Training Acoustic Model the text corresponding to the audio Page 9 / 29

Measure: Impact of the training data on the Speech Segmentation the impact of the quality vs quantity the impact of the speech style How to measure the impact of the training set on speech segmentation? Training Acoustic Model Training set Automatically time-aligned set Test set Page 10 / 29

Evaluating Automatic Speech Segm.? Compare automatic segm. with a human segm. What to compare: Duration Position of phoneme boundaries Middle of the phoneme Manual: p Automatic: p Page 11 / 29

Evaluating Automatic Speech Segm.? Measure what percentage of the automatic-alignment boundaries are within a given time threshold of the manually-aligned boundaries. Agreement of humans on the location of phoneme boundaries is, on average, 93.78% within 20 msec on a variety of English corpora (J-P. Hosom, 2008). Page 12 / 29

Manual vs Automatic Manual Automatic D = T(Automatic) T(Manual) = -0.09s I preferred to evaluate the center of the phonemes Page 13 / 29

French Phoneset Vowels Consonants Others a S p H a~ Z t j E f k w e s b i v d sil is silence o clusters /o/ and /O/ z g sp is short pause o~ fp is filled pause EU clusters /2/ and /@/ m gb is garbage EU9 is /9/ n @@ is laugher u y l U~ clusters /e~/ and /9~/ r clusters /r/ and /R/ dummy Page 14 / 29

Training corpus The difficulties are that corpora are: 1. from various file formats 2. speech is segmented at various levels (phones, tokens, utterances) 3. ortho. transcriptions are of various qualities 4. corpora are of various speech styles Points 1 and 2 are solved by scripting the data Point 3 and 4 are the purpose of this study. Page 15 / 29

Training corpus Corpus name Transcription Speech Duration Style Europe Manually phonetized 40 min Political debate Eurom1 Ortho. standard manually tokenized 26 min Read paragraphs Read-Speech Ortho. standard 98 min Read sentences AixOx Ortho. standard 122 min Read paragraphs CID Enriched ortho. 7h30min Conversation MapTaskAix Standard ortho. 2h48min Conversation Task-oriented Page 16 / 29

Test corpus Read Speech: about 2 minutes of AixOx (1748 phonemes) Spontaneous Speech: about 2 minutes of CID (1854 phonemes) Manually phonetized and segmented: By one expert, then revised by another one. the test consists in: Automatic segm. of the phonemes of each sentence; Compare with the manual segmentation: The time threshold is fixed to 40 ms. Page 17 / 29

Training procedure Manually time-aligned DataSet / 1 Well phonetized DataSet / 2 Training set Automatically phonetized DataSet / 3 DataSet1 DataSet2 DataSet3 Training Step 1 Acoustic Model Training Step 2 Acoustic Model Training Step 3 Acoustic Model Page 18 / 29

Question 1: quality vs quantity Perform step 1 from DataSet1 (3 min) D < 40 ms: Read speech 82.61% Conversation 81.44% Perform step 2 from DataSet2 (42 min) D < 40 ms: Read speech 85.07% Conversation 87.86% Split DataSet3: perform as many step 3 as sub-sets. Page 19 / 29

Step 3. Compare sub-sets Standard Ortho. Transcription Automatic Phonetization Enriched Ortho. Transc. Automatic Phonetization Manual Phonetization MapTaskAix MapTaskAix (2h48min) Blue: 112min AixOx (2h02min) ReadSpeech (98min) CID 8 spk (7h30) CID 2 spk (~60min) Europe (40min) 82.78 83.92 84.04 85.07 86.04 87.30 87.01 (% on ReadSpeech) 92.56 75.67 82.09 85.09 87.86 Step 2 87.92 87.16 88.03 (% on Conversation) 91.69 The quality plays a decisive role Page 20 / 29

The sooner the better Introduce all manually annotated data as soon as possible in the training procedure. Re-Perform steps 1 and 2: D < 40 ms: Read Speech: 94.16% Conversational Speech: 92.77% This model is (now) pretty stable. DataSet3: perform as many step 3 as sub-sets. Page 21 / 29

Question 2: speech style D < 40 ms Read Speech (%) D < 40 ms Conversational Speech (%) Step 2 94.16 92.77 Step 3. Read Speech 93.02 92.99 Step 3. Read Speech + AixOx 91.59 90.40 Step 3. MapTaskAix 89.93 89.21 Step 3. CID 93.25 92.23 Step 3. Read Speech + CID 93.36 93.42 Page 22 / 29

The Acoustic Model The selected sub-sets of DataSet3 are useful to perform a 4th step to train a Triphone model: D < 40 ms: Read Speech: 95.08% Conversational Speech: 95.42% Page 23 / 29

Other measures: Duration read speech spontaneous speech Page 24 / 29

Other measures: start boundary read speech spontaneous speech Page 25 / 29

Other measures: end boundary read speech spontaneous speech Page 26 / 29

Conclusion This work enables advices to be given to data producers: Requirements for a Monophone Acoustic Model: at least 3 minutes of time-aligned data 30-60 minutes manually phonetized data Requirements for a Triphone Acoustic Model: a pronunciation dictionary at least 8 hours of well -transcribed speech From these data, I can train an acoustic model and add the new language in SPPAS! Page 27 / 29

Perspectives: Forced Alignment on Children Speech (FACS) FA = Phonetization + Speech Segmentation (Bigi, 2011) EVALITA 2014. Multilingual model: speech segmentation of an un-trained language Page 28 / 29

References Hosom, J. P. (2009). Speaker-independent phoneme alignment using transition-dependent states. Speech Communication, 51(4), 352-368. Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition (Vol. 14). Englewood Cliffs: PTR Prentice Hall. Zue, V., Seneff, S., & Glass, J. (1990). Speech database development at MIT: TIMIT and beyond. Speech Communication, 9(4), 351-356. Bigi, B. (2012). SPPAS: a tool for the phonetic segmentation of speech. In LREC (Vol. 8, pp. 1748-1754). Bigi, B., Péri, P., & Bertrand, R. (2012). Orthographic Transcription: which Enrichment is required for phonetization?. In LREC (Vol. 8, pp. 1756-1763). Bigi, B. (2012). The SPPAS participation to Evalita 2011. In EVALITA 2011: Workshop on Evaluation of NLP and Speech Tools for Italian. Page 29 / 29