Munich AUtomatic Segmentation (MAUS)

Similar documents
Learning Methods in Multilingual Speech Recognition

Multi-Tier Annotations in the Verbmobil Corpus

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Detecting English-French Cognates Using Orthographic Edit Distance

Speech Recognition at ICSI: Broadcast News and beyond

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Emotion Recognition Using Support Vector Machine

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Letter-based speech synthesis

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Investigation on Mandarin Broadcast News Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Automatic Pronunciation Checker

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Modeling function word errors in DNN-HMM based LVCSR systems

Phonological Processing for Urdu Text to Speech System

Disambiguation of Thai Personal Name from Online News Articles

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

On the Formation of Phoneme Categories in DNN Acoustic Models

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Corrective Feedback and Persistent Learning for Information Extraction

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Linking Task: Identifying authors and book titles in verbose queries

Universal contrastive analysis as a learning principle in CAPT

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Finding Translations in Scanned Book Collections

Journal of Phonetics

A study of speaker adaptation for DNN-based speech synthesis

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

SARDNET: A Self-Organizing Feature Map for Sequences

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Beyond the Pipeline: Discrete Optimization in NLP

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Constructing Parallel Corpus from Movie Subtitles

The taming of the data:

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

CS 598 Natural Language Processing

Arabic Orthography vs. Arabic OCR

BMBF Project ROBUKOM: Robust Communication Networks

Natural Language Processing. George Konidaris

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

CEF, oral assessment and autonomous learning in daily college practice

Learning Methods for Fuzzy Systems

Florida Reading Endorsement Alignment Matrix Competency 1

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Inducing socio-cognitive conflict in Finnish and German groups of online learners by CSCL script

Edinburgh Research Explorer

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

A Case Study: News Classification Based on Term Frequency

Using dialogue context to improve parsing performance in dialogue systems

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Software Maintenance

THE ROLE OF TOOL AND TEACHER MEDIATIONS IN THE CONSTRUCTION OF MEANINGS FOR REFLECTION

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

The IFA Corpus: a Phonemically Segmented Dutch "Open Source" Speech Database

Lecture 10: Reinforcement Learning

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

Problems of the Arabic OCR: New Attitudes

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Assignment 1: Predicting Amazon Review Ratings

Language properties and Grammar of Parallel and Series Parallel Languages

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Dialog Act Classification Using N-Gram Algorithms

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

An Evaluation of POS Taggers for the CHILDES Corpus

Lecture 1: Machine Learning Basics

Calibration of Confidence Measures in Speech Recognition

AQUA: An Ontology-Driven Question Answering System

Eyebrows in French talk-in-interaction

Vorlesung Mensch-Maschine-Interaktion

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

A Version Space Approach to Learning Context-free Grammars

Using AMT & SNOMED CT-AU to support clinical research

Human Emotion Recognition From Speech

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

REVIEW OF CONNECTED SPEECH

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Online Updating of Word Representations for Part-of-Speech Tagging

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Statewide Framework Document for:

Georgetown University at TREC 2017 Dynamic Domain Track

Transcription:

Munich AUtomatic Segmentation (MAUS) Phonemic Segmentation and Labeling using the MAUS Technique F. Schiel, Chr. Draxler, J. Harrington Bavarian Archive for Speech Signals Institute of Phonetics and Speech Processing Ludwig-Maximilians-Universität München, Germany www.bas.uni-muenchen.de info@bas.uni-muenchen.de

Overview Statistical Segmentation and Labeling Super Pronunciation Model : Building the Automaton Pronunciation Model : From Automaton to Markov Model

Statistical Segmentation and Labeling Let Ψ be all possible Segmentation & Labeling (S&L) for a given utterance. Then the search for best S&L ˆK is: ˆK = argmax K Ψ P(K o) = argmax K Ψ P(K )p(o K ) p(o) with o the acoustic observation of the signal. Since p(o) = const for all K this simplifies to: ˆK = argmax K Ψ P(K )p(o K ) with: P(K ) = apriori probability for a label sequence, p(o K ) = the acoustical probability of o given K (often modeled by a concatenation of HMMs)

Statistical Segmentation and Labeling S&L approaches differ in creating Ψ and modeling P(K ) For example: forced alignment Ψ = 1 and P(K ) = 1 hence only p(o K ) is maximized. Other ways to model Ψ and P(K ): phonological rules resulting in M variants with P(K ) = 1 M phonotactic n-grams lexicon of pronunciation variants Markov process (MAUS)

Building the Automaton Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model Start with the orthographic transcript: heute Abend By applying lexicon-lookup and/or a test-to-phoneme algorithm produce a (more or less standardized) citation form in SAM-PA: hoyt@?a:b@nt Add word boundary symbols #, form a linear automaton G c :

Building the Automaton Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model Extend automaton G c by applying a set of substitution rules q k where each q k = (a, b, l, r) with a : pattern string b : replacement string l : left context string r : right context string For example the rules (/@n/,/m/,/b/,/t) and (/b@n/,/m/,/a:/,/t/) generate the reduced/assimilated pronunciation forms /?a:bmt/ and /?a:mt/ from the canonical pronunciation /?a:b@nt/ (evening)

Building the Automaton Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model Applying the two rules to G c results in the automaton:

From Automaton to Markov Process Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model Add transition probabilities to the arcs of G(N, A) Case 1 : all paths through G(N, A) are of equal probability Not trivial since paths can have different lengths! Transition probability from node d i to node d j : P(d j d i ) = P(d j)n(d i ) P(d i )N(d j ) N(d i ) : number of paths ending in node d i P(d i ) : probability that node d i is part of a path N(d i ) and P(d i ) can be calculated recursively through G(N, A) (see Kipp, 1998 for details).

From Automaton to Markov Process Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model Example: Markov process with 4 possible paths of different length Total probabilities: 1 3 4 1 3 1 1 = 1 4 1 1 4 1 1 = 1 4 1 3 4 1 3 1 = 1 4 1 3 4 1 4 1 1 = 1 4

From Automaton to Markov Process Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model Case 2 : all paths through G(N, A) have a probability according to the individual rule probabilities along the path through G(N, A) Again not trivial, since contexts of different rule applications may overlap! This may cause total branching probabilities > 1 Please refer to Kipp, 1998 for details to calculate correct transition probabilities.

Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model From Markov Process to Hidden Markov Model True HMM : add emission probabilities to nodes N of G c. -> Replace the phonemic symbols in N by mono-phone HMM. The search lattice for previous example:

Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model From Markov Process to Hidden Markov Model Word boundary nodes # are replaced by a optional silence model: Possible silence intervals between words can be modeled.

Evaluation of Label Sequence Evaluation of Segmentation How to evaluate a S&L system? Required: reference corpus with hand-crafted S&L ( gold standard ). Usually two steps: 1 Evaluate the accuracy of the label sequence (transcript) 2 Evaluate the accuracy of segment boundaries

Evaluation of Label Sequence Evaluation of Label Sequence Evaluation of Segmentation Often used for label sequence evaluation: Cohen s κ κ = amount of overlap between two transcripts (system vs. gold standard); independent of the symbol set size (Cohen 1960). We consider κ not appropriate for S&L evaluation, since no gold standard exists in phonemic S&L different symbol set sizes do not matter in S&L the task difficulty is not considered (e.g. read vs. spontaneous speech)

Evaluation of Label Sequence Evaluation of Label Sequence Evaluation of Segmentation Proposal: Relative Symmetric Accuracy (RSA) = = the ratio from average symmetric system-to-labeler agreement ŜA hs to average inter-labeler agreement ŜA hh. RSA = ŜA hs ŜA hh 100%

Evaluation of Label Sequence Evaluation of Label Sequence Evaluation of Segmentation German MAUS: 3 human labelers spontaneous speech (Verbmobil) 9587 phonemic segments Average system - labeler agreement Average inter - labeler agreement Relative symmetric accurarcy ŜA hs = 81.85% ŜA hh = 84.01% RSA = 97.43%

Evaluation of Segmentation Evaluation of Label Sequence Evaluation of Segmentation No standardized methodology Problem: insertions and deletions Solution: compare only matching segments Often: count boundary deviations greater than threshold (e.g. 20msec) as errors Better: deviation histogram measured against all human segmenters

Evaluation of Segmentation Evaluation of Label Sequence Evaluation of Segmentation German MAUS: Note: center shift typical for HMM alignment

MAUS software package: ftp://ftp.bas.uni-muenchen.de/pub/bas/softw/maus MAUS requires UNIX System V or cygwin Gnu C compiler HTK (University of Cambridge) Current language support: German, English, Hungarian, Icelandic, Estonian, Portuguese, Spanish A MAUS web services is currently in alpha. If interested in a demo, please contact me after the talk.

References Kipp A (1998): Automatische Segmentierung und Etikettierung von Spontansprache. Doctoral Thesis, Technical University Munich. Wester M, Kessens J M, Strik H (1998): Improving the performance of a Dutch CSR by modeling pronunciation variation. Workshop on Modeling Pronunciation Variation, Rolduc, Netherlands, pp. 145-150. Kipp A, Wesenick M B, Schiel F (1996): Automatic Detection and Segmentation of Pronunciation Variants in German Speech Corpora. Proceedings of the ICSLP, Philadelphia, pp. 106-109. Schiel F (1999) Automatic Phonetic Transcription of Non-Prompted Speech. Proceedings of the ICPhS, San Francisco, August 1999. pp. 607-610. MAUS: ftp://ftp.bas.uni-muenchen.de/pub/bas/softw/maus Draxler Chr, Jänsch K (2008): WikiSpeech A Content Management System for Speech Databases. Proceedings of Interspeech Brisbane, Australia, pp. 1646-1649. CLARIN: http://www.clarin.eu/ Cohen J (1960): A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1): 37-46. Fleiss J L (1971): Measuring nominal scale agreement among many raters. Psychological Bulletin, Vol. 76, No. 5 pp. 378-382. Burger S, Weilhammer K, Schiel F, Tillmann H G (2000): Verbmobil Data Collection and Annotation. In: Verbmobil: Foundations o Speech-to-Speech Translation (Ed. Wahlster W), Springer, Berlin, Heidelberg. Schiel F, Heinrich Chr, Barfüßer S (2011): Alcohol Language Corpus. Language Resources and Evaluation, Springer, Berlin, New York, in print.

How to adapt MAUS to a new language? Several possible ways (in ascending performance and effort): Define a mapping from the phoneme set of the new language to the German set (or any other available language in MAUS). Constrain pronunciation to canonical form. Effort: nil Performance: for some languages surprisingly good.

Hand craft pronunciation rules (depending on language not more than 10-20) and run MAUS in the manual rule set mode. Effort: small Performance: Very much dependent of the language, the type of speech, the speakers etc. Adapt HMM to a corpus of the new language using an iterative training schema (script maus.iter). Corpus does not need to be annotated. Effort: moderate (if corpus is available) Performance: For most languages very good, depending on the adaptation corpus (size, quality, match to target language etc.)

Retrieve statistically weighted pronunciation rules from a corpus. The corpus needs to be at least of 1 hour length and segmented/labeled manually. Effort: high. Performance: Unknown.