Text-to-Speech synthesis using OpenMARY

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

A study of speaker adaptation for DNN-based speech synthesis

Expressive speech synthesis: a review

Mandarin Lexical Tone Recognition: The Gating Paradigm

Natural Language Processing. George Konidaris

Phonological Processing for Urdu Text to Speech System

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Speech Emotion Recognition Using Support Vector Machine

Rhythm-typology revisited.

English Language and Applied Linguistics. Module Descriptions 2017/18

Designing a Speech Corpus for Instance-based Spoken Language Generation

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Letter-based speech synthesis

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Voice conversion through vector quantization

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Word Stress and Intonation: Introduction

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Learning Methods in Multilingual Speech Recognition

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

The Acquisition of English Intonation by Native Greek Speakers

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Introduction, Organization Overview of NLP, Main Issues

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Journal of Phonetics

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

A High-Quality Web Corpus of Czech

/$ IEEE

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

TA Certification Course Additional Information Sheet

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Statistical Parametric Speech Synthesis

The taming of the data:

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

REVIEW OF CONNECTED SPEECH

Getting the Story Right: Making Computer-Generated Stories More Entertaining

Phonological and Phonetic Representations: The Case of Neutralization

Copyright and moral rights for this thesis are retained by the author

Building Text Corpus for Unit Selection Synthesis

CS 598 Natural Language Processing

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Body-Conducted Speech Recognition and its Application to Speech Support System

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

A Hybrid Text-To-Speech system for Afrikaans

A survey of intonation systems

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

An Evaluation of POS Taggers for the CHILDES Corpus

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Online Marking of Essay-type Assignments

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

The CESAR Project: Enabling LRT for 70M+ Speakers

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Annotation Pro. annotation of linguistic and paralinguistic features in speech. Katarzyna Klessa. Phon&Phon meeting

Outline. Dave Barry on TTS. History of TTS. Closer to a natural vocal tract: Riesz Von Kempelen:

Discourse Structure in Spoken Language: Studies on Speech Corpora

LING 329 : MORPHOLOGY

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Character Stream Parsing of Mixed-lingual Text

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

First Grade Curriculum Highlights: In alignment with the Common Core Standards

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

Large vocabulary off-line handwriting recognition: A survey

On-Line Data Analytics

Developing a TT-MCTAG for German with an RCG-based Parser

Speech Recognition by Indexing and Sequencing

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Linking object names and object categories: Words (but not tones) facilitate object categorization in 6- and 12-month-olds

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

A Graph Based Authorship Identification Approach

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Human Emotion Recognition From Speech

Transcription:

Text-to-Speech synthesis using OpenMARY An introduction and practical tutorial Marc Schröder, DFKI marc.schroeder@dfki.de enterface Amsterdam, 14 July 2010

Overview Some Text-to-Speech (TTS) basics Natural Language Processing Generating the sound diphone synthesis unit selection synthesis HMM-based synthesis OpenMARY existing system MARY 4.0 toolkit for adding new languages and voices Tutorial overview what you will learn to do in the tutorial Marc Schröder, DFKI 2

What is text-to-speech synthesis? You have one message from Dr Johnson. TTS Marc Schröder, DFKI 3

Applications of TTS Texts readers for the blind in eyes-free environments (e.g., while driving) Telephone-based voice portals Multi-modal interactive systems talking heads embodied conversational agents (ECAs) Marc Schröder, DFKI 4

A Talking Head Hello, nice to meet you. TTS Information on timing and mouth shapes Marc Schröder, DFKI 5

Structure of a TTS system TEXT SSML Text or Speech synthesis markup Either plain text or SSML document natural language processing techniques text analysis ACOUSTPARAMS phonetic transcription + prosodic parameters Intonation specification Pausing & speech timing signal processing techniques audio generation AUDIO Wave file Marc Schröder, DFKI 6

Structure of a TTS system: MARY TTS Text analysis Input markup parser TEXT or SSML RAWMARYXML Shallow NLP Phonemiser Symbolic prosody RAWMARYXML PARTSOFSPEECH PARTSOFSPEECH ALLOPHONES ALLOPHONES INTONATION Acoust. parameters INTONATION ACOUSTPARAMS Audio generation waveform synthesis ACOUSTPARAMS AUDIO Marc Schröder, DFKI 7

System structure: Input markup parser TEXT or SSML RAWMARYXML System-internal XML representation MaryXML => speech synthesis markup parsing is simple XML transformation Use XSLT => easily adaptable to new markup language Marc Schröder, DFKI 8

System structure: Shallow NLP Shallow NLP Tokeniser RAWMARYXML TOKENS sentence boundaries, tokens = word-like units Text normalisation TOKENS WORDS expanded, pronounceable forms (see next slide) Part-of-speech tagger WORDS PARTSOFSPEECH Marc Schröder, DFKI 9

Preprocessing / Text normalisation Net patterns (email, web addresses) info@dfki.de Date patterns 23/07/2001 Time patterns 12:24 h, 12:24 Duration patterns 12:24 h, 12 h 24 min Currency patterns 12.95 Measure patterns 123.09 km Telephone number patterns +49-681-85775-5303 Number patterns (cardinal, ordinal, roman) 3 3rd III. Abbreviations engl. Special characters & Marc Schröder, DFKI 10

System structure: Phonemisation Phonemiser PARTSOFSPEECH PHONEMES lexicon lookup letter-to-sound conversion morphological decomposition letter-to-sound rules syllabification word stress assignment Custom pronounciation PHONEMES ALLOPHONES slurring, non-standard pronounciation potentially trainable from annotated data of a given person Marc Schröder, DFKI 11

System structure: Prosody Prosody? intonation (accented syllables; high or low phrase boundaries) rhythmic effects (pauses, syllable durations) loudness, voice quality Symbolic prosody prediction ALLOPHONES INTONATION assign prosody by rule, based on punctuation part-of-speech modelled using Tones and Break Indices (ToBI) tonal targets: accents, boundary tones phrase breaks Marc Schröder, DFKI 12

System structure: Calculation of acoustic parameters Duration prediction INTONATION DURATIONS segment duration predicted by rules or by decision trees Contour generation DURATIONS ACOUSTPARAMS fundamental frequency curve predicted by rules or by decision trees Marc Schröder, DFKI 13

System structure: Waveform synthesis Waveform synthesis ACOUSTPARAMS AUDIO several waveform generation technologies Marc Schröder, DFKI 14

Creating sound: Waveform synthesis technologies (1) Formant synthesis acoustic model of speech generate acoustic structure by rule robotic sound Marc Schröder, DFKI 15

Creating sound: Waveform synthesis technologies (2) Concatenative synthesis diphone synthesis glue pre-recorded diphones together adapt prosody through signal processing unit selection synthesis glue units from a large corpus of speech together prosody comes from the corpus, (nearly) no signal processing Marc Schröder, DFKI 16

Creating sound: Waveform synthesis technologies (3) Statistical-parametric speech synthesis with Hidden Markov Models models trained on speech corpora no data needed at runtime => small footprint Marc Schröder, DFKI 17

Examples of speech synthesis technologies MARY TTS unit selection HMM-based MBROLA diphones expressive unit selection Commercial unit selection IVONA Loquendo formant synthesis DecTalk Marc Schröder, DFKI 18

Concatenative synthesis: Isolated phones don't work target: w I n t r= d ei a I w ei T t d n r= acoustic unit database (units = phone segments recorded in isolation) Marc Schröder, DFKI 19

Concatenative synthesis: Diphones target: w I n t r= d ei _-w w-i I-n n-t t-r= r=-d d-ei ei- -w (wonder) w-i (will) I-n (spin) n-t (fountain) Diphones = sound segments from the middle of one phone to the middle of the next phone t-r= (water) r=-d (nerdy) d-ei (date) ei-_ (away) acoustic unit database units = diphone segments recorded in carrier words (flat intonation) Marc Schröder, DFKI 20

Concatenative synthesis: Diphones (2) target: w I n t r= d ei _-w w-i I-n n-t t-r= r=-d d-ei ei-_ PSOLA pitch manipulation Marc Schröder, DFKI 21

Concatenative synthesis Unit selection target: w I n t r= d ei Which of these? Let's discuss the question of interchanges another day. acoustic unit database units = (di-)phone segments recorded in natural sentences (natural intonation) Marc Schröder, DFKI 22

AI Poker: The voices of Sam and Max Sam: Unit Selection Synthesis Voice specifically recorded for AI Poker Natural sound within poker domain Max: HMM-based synthesis Sound quality is limited but constant with any text Marc Schröder, DFKI 23

Sam's voice: Unit selection syntheis Ich habe zwei Paare. + + +... several hours of speech recordings Unit selection corpus => very good quality within the poker domain! Marc Schröder, DFKI 24

Sam's voice: Unit selection syntheis Ich kann auch ganz andere Sachen... + + +... several hours of speech recordings Unit selection corpus reduced quality with arbitrary text Marc Schröder, DFKI 25

Max's voice: HMM-based synthesis Ich habe zwei Paare. Hidden Markov Models acoustic feature vectors statistical models vocoder Marc Schröder, DFKI 26

Max's voice: HMM-based synthesis Ich kann auch ganz andere Sachen... Hidden Markov Models acoustic feature vectors statistical models vocoder constant quality with arbitrary text Marc Schröder, DFKI 27

MARY TTS 4.0 Pure Java Runs on any platform with Java 5 Client-server architecture http interface your browser is a MARY client Multilingual, with UTF-8 support English (US and GB) German Willkommen Turkish Konuşma Telugu స చ స నసస Marc Schröder, DFKI 28

Audio effects in MARY 4.0 Some can be applied to any voice vocal tract length (longer shorter ) Robot effect Whisper effect Jet pilot More effects for HMM-based voices pitch level (higher lower ) pitch range (wider narrower ) speaking rate (faster slower ) Can be parameterised & combined to create characteristic voices Marc Schröder, DFKI 29

MARY TTS: New language support workflow Wikipedia XML dump clean text Wikipedia text import allophones.xml Transcription GUI most frequent words in the language Dump splitter Markup cleaner Feature maker sentences w/ diphone+prosody features pronounciation lexicon letter-tosound for unknown words list of function words Manual check, exclude unsuitable sentences Script selection optimising coverage selected sentences / script Basic NLP components enable conversion TEXT->ALLOPHONES in new locale Phonemiser rudimentary POS tagger Synthesis components enable conversion ALLOPHONES->Audio in new voice Redstart record speech db generic implementations with basic functionality: Tokeniser Symbolic prosody speakerspecific pronounciation acoustic models for F0+ duration unit selection voice files HMMbased voice files audio files Voice Import Tools

What you will learn to do in the MARY Tutorial Installing the MARY system languages and voices Interacting with MARY using the web client basic experimentation interactive test of audio effects interactive documentation of http interface Triggering TTS from your own software http interface Java client code selecting language, voice and effects in requests Marc Schröder, DFKI 31

What you will learn to do in the MARY Tutorial (2) Using timing information: REALISED_ACOUSTPARAMS and REALISED_DURATIONS Performance: caching Marc Schröder, DFKI 32