Munich AUtomatic Segmentation (MAUS)

Similar documents
Multi-Tier Annotations in the Verbmobil Corpus

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Learning Methods in Multilingual Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Detecting English-French Cognates Using Orthographic Edit Distance

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

Speech Emotion Recognition Using Support Vector Machine

Letter-based speech synthesis

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Phonological Processing for Urdu Text to Speech System

Corrective Feedback and Persistent Learning for Information Extraction

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Modeling function word errors in DNN-HMM based LVCSR systems

THE ROLE OF TOOL AND TEACHER MEDIATIONS IN THE CONSTRUCTION OF MEANINGS FOR REFLECTION

Linking Task: Identifying authors and book titles in verbose queries

Automatic Pronunciation Checker

Investigation on Mandarin Broadcast News Speech Recognition

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Modeling function word errors in DNN-HMM based LVCSR systems

Software Maintenance

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A study of speaker adaptation for DNN-based speech synthesis

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Disambiguation of Thai Personal Name from Online News Articles

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

The IFA Corpus: a Phonemically Segmented Dutch "Open Source" Speech Database

On the Formation of Phoneme Categories in DNN Acoustic Models

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Assignment 1: Predicting Amazon Review Ratings

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Finding Translations in Scanned Book Collections

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

Beyond the Pipeline: Discrete Optimization in NLP

Calibration of Confidence Measures in Speech Recognition

Lecture 10: Reinforcement Learning

Learning Methods for Fuzzy Systems

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

CS 598 Natural Language Processing

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Journal of Phonetics

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Statewide Framework Document for:

Natural Language Processing. George Konidaris

BMBF Project ROBUKOM: Robust Communication Networks

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Achim Stein: Diachronic Corpora Aston Corpus Summer School 2011

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

REVIEW OF CONNECTED SPEECH

A Version Space Approach to Learning Context-free Grammars

Generating Test Cases From Use Cases

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Using AMT & SNOMED CT-AU to support clinical research

Mandarin Lexical Tone Recognition: The Gating Paradigm

On-Line Data Analytics

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Universal contrastive analysis as a learning principle in CAPT

Online Updating of Word Representations for Part-of-Speech Tagging

GACE Computer Science Assessment Test at a Glance

CS Machine Learning

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Using dialogue context to improve parsing performance in dialogue systems

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Writing a Basic Assessment Report. CUNY Office of Undergraduate Studies

arxiv: v1 [cs.cl] 2 Apr 2017

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

The taming of the data:

Human Emotion Recognition From Speech

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

On-the-Fly Customization of Automated Essay Scoring

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Constructing Parallel Corpus from Movie Subtitles

Eyebrows in French talk-in-interaction

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Python Machine Learning

Word Segmentation of Off-line Handwritten Documents

Lecture 1: Machine Learning Basics

Florida Reading Endorsement Alignment Matrix Competency 1

Transcription:

Munich AUtomatic Segmentation (MAUS) Phonemic Segmentation and Labeling using the MAUS Technique F. Schiel with contributions of A. Kipp, Th. Kisler Bavarian Archive for Speech Signals Institute of Phonetics and Speech Processing Ludwig-Maximilians-Universität München, Germany www.bas.uni-muenchen.de schiel@bas.uni-muenchen.de

Overview university-logo Super Pronunciation Model : Building the Automaton Pronunciation Model : From Automaton to Markov Model MAUS Web Application MAUS Web Services

Let Ψ be all possible Segmentation & Labeling (S&L) for a given utterance. Then the search for best S&L ˆK is: ˆK = argmax K Ψ P(K o) = argmax K Ψ P(K )p(o K ) p(o) with o the acoustic observation of the signal. Since p(o) = const for all K this simplifies to: ˆK = argmax K Ψ P(K )p(o K ) with: P(K ) = apriori probability for a label sequence, p(o K ) = the acoustical probability of o given K (often modeled by a concatenation of HMMs) university-logo

S&L approaches differ in creating Ψ and modeling P(K ) For example: forced alignment Ψ = 1 and P(K ) = 1 hence only p(o K ) is maximized. Other ways to model Ψ and P(K ): phonological rules resulting in M variants with P(K ) = 1 M phonotactic n-grams lexicon of pronunciation variants Markov process (MAUS) university-logo

university-logo

university-logo

Building the Automaton Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model Start with the orthographic transcript: heute Abend By applying lexicon-lookup and/or a test-to-phoneme algorithm produce a (more or less standardized) citation form in SAM-PA: hoyt@?a:b@nt Add word boundary symbols #, form a linear automaton G c :

Building the Automaton Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model Extend automaton G c by applying a set of substitution rules q k where each q k = (a, b, l, r) with a : pattern string b : replacement string l : left context string r : right context string For example the rules (/@n/,/m/,/b/,/t) and (/b@n/,/m/,/a:/,/t/) generate the reduced/assimilated pronunciation forms /?a:bmt/ and /?a:mt/ from the canonical pronunciation /?a:b@nt/ (evening) university-logo

Building the Automaton Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model Applying the two rules to G c results in the automaton:

From Automaton to Markov Process Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model Add transition probabilities to the arcs of G(N, A) Case 1 : all paths through G(N, A) are of equal probability Not trivial since paths can have different lengths! Transition probability from node d i to node d j : P(d j d i ) = P(d j)n(d i ) P(d i )N(d j ) N(d i ) : number of paths ending in node d i P(d i ) : probability that node d i is part of a path N(d i ) and P(d i ) can be calculated recursively through G(N, A) (see Kipp, 1998 for details). university-logo

From Automaton to Markov Process Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model Example: Markov process with 4 possible paths of different length Total probabilities: 1 3 4 1 3 1 1 = 1 4 1 1 4 1 1 = 1 4 1 3 4 1 3 1 = 1 4 1 3 4 1 4 1 1 = 1 4

From Automaton to Markov Process Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model Case 2 : all paths through G(N, A) have a probability according to the individual rule probabilities along the path through G(N, A) Again not trivial, since contexts of different rule applications may overlap! This may cause total branching probabilities > 1 Please refer to Kipp, 1998 for details to calculate correct transition probabilities.

Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model From Markov Process to Hidden Markov Model True HMM : add emission probabilities to nodes N of G c. -> Replace the phonemic symbols in N by mono-phone HMM. The search lattice for previous example:

Building the Automaton From Automaton to Markov Process From Markov Process to Hidden Markov Model From Markov Process to Hidden Markov Model Word boundary nodes # are replaced by a optional silence model: Possible silence intervals between words can be modeled.

Evaluation of Label Sequence Evaluation of Segmentation How to evaluate a S&L system? Required: reference corpus with hand-crafted S&L ( gold standard ). Usually two steps: 1 Evaluate the accuracy of the label sequence (transcript) 2 Evaluate the accuracy of segment boundaries

Evaluation of Label Sequence Evaluation of Label Sequence Evaluation of Segmentation Often used for label sequence evaluation: Cohen s κ κ = amount of overlap between two transcripts (system vs. gold standard); independent of the symbol set size (Cohen 1960). We consider κ not appropriate for S&L evaluation, since no gold standard exists in phonemic S&L different symbol set sizes do not matter in S&L the task difficulty is not considered (e.g. read vs. spontaneous speech)

Evaluation of Label Sequence Evaluation of Label Sequence Evaluation of Segmentation Proposal: Relative Symmetric Accuracy (RSA) = = the ratio from average symmetric system-to-labeler agreement ŜA hs to average inter-labeler agreement ŜA hh. RSA = ŜA hs ŜA hh 100%

Evaluation of Label Sequence Evaluation of Label Sequence Evaluation of Segmentation German MAUS: 3 human labelers spontaneous speech (Verbmobil) 9587 phonemic segments Average system - labeler agreement Average inter - labeler agreement Relative symmetric accurarcy ŜA hs = 81.85% ŜA hh = 84.01% RSA = 97.43%

Evaluation of Segmentation Evaluation of Label Sequence Evaluation of Segmentation No standardized methodology Problem: insertions and deletions Solution: compare only matching segments Often: count boundary deviations greater than threshold (e.g. 20msec) as errors Better: deviation histogram measured against all human segmenters

Evaluation of Segmentation Evaluation of Label Sequence Evaluation of Segmentation German MAUS: Note: center shift typical for HMM alignment

WebMAUS MAUS Web Services MAUS software package: ftp://ftp.bas.uni-muenchen.de/pub/bas/softw/maus MAUS package consists of basis script maus corpus processor maus.corpus adaptive maus maus.iter chunk segmentation processor maus.trn helper programs: visualization, graph generator etc. parameter sets for supported languages test benchmarks university-logo

Software Package MAUS WebMAUS MAUS Web Services MAUS installation requires: UNIX System V or cygwin Gnu C compiler HTK (University of Cambridge) awk,sox Current language support: deu, eng, ita, aus (with pronunciation modelling) hun, ekk, por, spa, nld, sampa (without modelling) maus BPF=file.par \ SIGNAL=file.wav LANGUAGE=eng \ OUT=file.TextGrid OUTFORMAT=TextGrid university-logo

WebMAUS MAUS Web Services How to adapt MAUS to a new language? Several possible ways (in ascending performance and effort): Use SAM-PA language (collective MAUS phoneme set). No pronunciation modelling possible. Effort: nil Performance: for some languages surprisingly good.

WebMAUS MAUS Web Services Hand craft pronunciation rules (depending on language not more than 10-20) and run MAUS in the manual rule set mode. Effort: small Performance: Very much dependent of the language, the type of speech, the speakers etc. Adapt HMM to a corpus of the new language using an iterative training schema (script maus.iter). Corpus does not need to be annotated. Effort: moderate (if corpus is available) Performance: For most languages very good, depending on the adaptation corpus (size, quality, match to target language etc.) university-logo

WebMAUS MAUS Web Services Retrieve statistically weighted pronunciation rules from a corpus. The corpus needs to be at least of 1 hour length and segmented/labeled manually. Effort: high. Performance: Unknown.

MAUS Web Interface WebMAUS MAUS Web Services http://clarin.phonetik.uni-muenchen.de/baswebservices/ WebMAUS: web interface to the latest version of MAUS Pros: no local installation necessary runs on all platforms (even SmartPhones) text-normalization and text-to-phoneme (partially) Cons: no adaptation to new languages no application of proprietary rule sets no iterative adaptation mode

WebMAUS Basic WebMAUS MAUS Web Services WebMAUS Basic : Signal + Text -> Segmentation simple, robust includes text-normalisation, tokenization and text-to-phoneme conversion no control of parameters or input (except language) supported languages: deu, hun, eng, nld, ita supported output: TextGrid (praat) SIGNAL TXT normalisation tokenization text to phoneme (Balloon) MAUS TextGrid

WebMAUS General WebMAUS MAUS Web Services WebMAUS General : Signal + Phonology -> Segmentation full control of all MAUS options phonologic input allows fine tuning requires input in BAS Partitur Format (BPF) supported output BPF, TextGrid, Emu supported languages: deu, eng, ita, aus, hun, ekk, por, spa, nld, sampa

WebMAUS Multiple WebMAUS MAUS Web Services WebMAUS Multiple : Signals + Texts -> Segmentations drag & drop of input files features like WebMAUS Basic batch processing of unlimited file pairs

MAUS Web Services WebMAUS MAUS Web Services web service = direct call to a server MAUS web services can be used within programming languages or scripts or from the command line, e.g.: curl -v -X POST -H content-type: multipart/form-data \ -F LANGUAGE=deu -F TEXT=@file.txt -F SIGNAL=@file.wav \ http://clarin.phonetik.uni-muenchen.de/ BASWebServices/services/runMAUSBasic To get started call: curl -X GET \ http://clarin.phonetik.uni-muenchen.de/baswebservices/services/help university-logo

MAUS Web Services WebMAUS MAUS Web Services script maus.web = CSH wrapper to web service calls The script maus.web (in MAUS package) can be used like the original maus script, but issues web service calls. maus.web BPF=file.par \ SIGNAL=file.wav LANGUAGE=eng \ OUT=file.TextGrid OUTFORMAT=TextGrid

References WebMAUS MAUS Web Services Kipp A (1998): Automatische Segmentierung und Etikettierung von Spontansprache. Doctoral Thesis, Technical University Munich. Wester M, Kessens J M, Strik H (1998): Improving the performance of a Dutch CSR by modeling pronunciation variation. Workshop on Modeling Pronunciation Variation, Rolduc, Netherlands, pp. 145-150. Kipp A, Wesenick M B, Schiel F (1996): Automatic Detection and Segmentation of Pronunciation Variants in German Speech Corpora. Proceedings of the ICSLP, Philadelphia, pp. 106-109. Schiel F (1999) Automatic Phonetic Transcription of Non-Prompted Speech. Proceedings of the ICPhS, San Francisco, August 1999. pp. 607-610. MAUS: ftp://ftp.bas.uni-muenchen.de/pub/bas/softw/maus Draxler Chr, Jänsch K (2008): WikiSpeech A Content Management System for Speech Databases. Proceedings of Interspeech Brisbane, Australia, pp. 1646-1649. CLARIN: http://www.clarin.eu/ Cohen J (1960): A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1): 37-46. Fleiss J L (1971): Measuring nominal scale agreement among many raters. Psychological Bulletin, Vol. 76, No. 5 pp. 378-382. Kisler T, Schiel F, Sloetjes H (2012): Signal processing via web services: the use case WebMAUS. In: Proceedings Digital Humanities 2012, Hamburg, Germany, pp 30-34.

Questions? WebMAUS MAUS Web Services