Lecture 10: Generation and speech synthesis

Similar documents
Using dialogue context to improve parsing performance in dialogue systems

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Speech Recognition at ICSI: Broadcast News and beyond

Software Maintenance

SIE: Speech Enabled Interface for E-Learning

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

A Framework for Customizable Generation of Hypertext Presentations

English Language and Applied Linguistics. Module Descriptions 2017/18

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Designing a Speech Corpus for Instance-based Spoken Language Generation

A study of speaker adaptation for DNN-based speech synthesis

AQUA: An Ontology-Driven Question Answering System

Beyond the Pipeline: Discrete Optimization in NLP

Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:

Linking Task: Identifying authors and book titles in verbose queries

Applications of memory-based natural language processing

Word Stress and Intonation: Introduction

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Lecturing Module

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Phonological Processing for Urdu Text to Speech System

Speech Emotion Recognition Using Support Vector Machine

Surface Structure, Intonation, and Meaning in Spoken Language

Letter-based speech synthesis

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Mandarin Lexical Tone Recognition: The Gating Paradigm

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

CEFR Overall Illustrative English Proficiency Scales

Compositional Semantics

Expressive speech synthesis: a review

Miscommunication and error handling

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

A Case Study: News Classification Based on Term Frequency

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Degeneracy results in canalisation of language structure: A computational model of word learning

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Generation of Referring Expressions: Managing Structural Ambiguities

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Effect of Word Complexity on L2 Vocabulary Learning

Functional Mark-up for Behaviour Planning: Theory and Practice

L1 Influence on L2 Intonation in Russian Speakers of English

The Common European Framework of Reference for Languages p. 58 to p. 82

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Language Acquisition Chart

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Mastering Team Skills and Interpersonal Communication. Copyright 2012 Pearson Education, Inc. publishing as Prentice Hall.

Rhythm-typology revisited.

To appear in The TESOL encyclopedia of ELT (Wiley-Blackwell) 1 RECASTING. Kazuya Saito. Birkbeck, University of London

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

Florida Reading Endorsement Alignment Matrix Competency 1

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Parsing of part-of-speech tagged Assamese Texts

A Hybrid Text-To-Speech system for Afrikaans

Reinforcement Learning by Comparing Immediate Reward

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Loughton School s curriculum evening. 28 th February 2017

Teacher: Mlle PERCHE Maeva High School: Lycée Charles Poncet, Cluses (74) Level: Seconde i.e year old students

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

Human Emotion Recognition From Speech

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Knowledge-Based - Systems

ANGLAIS LANGUE SECONDE

Investigate the program components

CS 598 Natural Language Processing

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Let's Learn English Lesson Plan

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Learning Methods for Fuzzy Systems

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Lecture 1: Machine Learning Basics

BEETLE II: a system for tutoring and computational linguistics experimentation

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

Modeling user preferences and norms in context-aware systems

COMMUNICATIVE LANGUAGE TEACHING

Building Text Corpus for Unit Selection Synthesis

Transcription:

Lecture 10: Generation and speech synthesis Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, October 12 2012 Outline General architecture Natural language generation Speech synthesis Summary 2

Outline General architecture Natural language generation Speech synthesis Summary 3 A simple schema Extra-linguistic environment Language understanding Interpreted utterance ãu Intended response am Generation Recognition hypotheses ~ uu Utterance to synthesise um Speech recognition Dialogue management Speech synthesis input speech signal (user utterance) User output speech signal (machine utterance) 4

A simple schema Extra-linguistic environment Language understanding Interpreted utterance ãu Intended response am Generation Recognition hypotheses ~ uu Utterance to synthesise um Speech recognition Dialogue management Speech synthesis input speech signal (user utterance) User output speech signal (machine utterance) 5 List of basic components (4) Natural language generation (NLG) is the reverse task of NLU: given a high level representation of the response, find the rights words to express it How to express (or realise) the given intention might depend on various contextual factors am = Confirm um = 2.0 Yes, I agree! 1.3 Yes, I already love this class! 0.8 Sure! 6

List of basic components (5) Finally, speech synthesis (TTS, for «text-tospeech») is the task of generating a speech signal corresponding to the selected system reply Can be modulated in various ways (voice, intonation, accent, etc.) um = Yes, I agree! 7 Outline General architecture Shallow generation Deep generation Speech synthesis Natural language generation Statistical generation Generation of referring expressions Summary 8

Natural language generation The goal of NLG is to convert a high-level communicative goal into a concrete utterance As for natural language understanding (NLU), a wide range of methods exists for NLG, with varying degrees of complexity Some of them are «shallow» approaches based on canned utterances Others adopt a «deep» approach based on generic grammatical resources and reasoning patterns And of course, we can also train statistical systems to generate optimal utterances based on data 9 Shallow NLG Shallow approaches to NLG the system designer manually maps the communicative goals am to specific handcrafted utterances um The utterances might contain slots to be filled Goal am AskRepeat Assert(cost(ticket, price)) Ask(departure) Utterance um «Sorry, could you please repeat?» «This ticket will cost you {price} USD» «Please state where you are flying from» «Where are you departing from?» 10

Shallow NLG Shallow approaches are by far the most popular in commercial systems Limited effort: there are rarely more than a few hundreds prompts for a given system Gives the designer full control over the system behaviour (important for quality assurance) One can introduce some variation by randomly selecting the utterance from a set of possible candidates 11 Deep NLG Shallow approaches rely on the detailed specification of every possible utterance A good part of this process is domainindependent and could be automatised Communicative Goal am Sentence planner Surface realiser Prosody assigner Utterance um Deep NLG pipeline 12

Deep NLG Pipeline of modules: Sentence planning: selection of abstract linguistic items (lexemes, semantic structure) necessary to achieve the communicative goal. Surface realisation: construction of a surface utterance based on the abstract items and language-specific constraints (word order, morphology, function words, etc.) Prosody assignment: determination of the utterance s prosodic structure based on information structure (e.g. what is in focus, what is given vs. what is new) 13 Sentence planning How to perform sentence planning? Recall Grice s cooperative principle, and in particular the Maxim of Quantity: say exactly as much as is necessary for your contribution The goal is therefore to find the best way to convey the system s intention, in the fewest possible words... but while remaining clear and unambiguous! The communicative goal must sometimes be split in several separate utterances 14

Surface realisation Given a high-level semantics of the utterance provided by the utterance planner, one can then realise it in a concrete utterance This is the inverse operation as classical parsing! Some grammatical formalisms are «bidirectional» or reversible, i.e. they can be used for both parsing and generation HPSG or CCG grammars are reversible (at least can be made reversible, given some grammar engineering) 15 Deep NLG Sentence planning and surface realisation are intertwined operations Some systems perform both operations together Example: the SPUD and CRISP systems, based on TAG grammars and classical planning algorithms [M. Stone et al (2003). «Microplanning with communicative intentions: The SPUD system», in Computational Intelligence] [A. Koller and M. Stone. 2007. «Sentence generation as planning». In Proceedings of ACL] 16

Prosodic assignment Information structure: theme: part of an utterance which is talked about (given) rheme: what is said about the theme (new) Linguistic realisation of this structure in word order, syntax and intonation [S. Prevost (1996) «A Semantics of Contrast and Information Structure for Specifying Intonation in Spoken Language Generation», PhD thesis] 17 Statistical generation Deep, logic-based approaches to generation can be «brittle»: Requires fine-grained grammatical resources Need to rank large numbers of alternative utterances produced for a given semantic representation... and according to which quality measures? User adaptation is difficult 18

Statistical generation Statistical generation can help us produce more fluent, user-tailored utterances Two strategies: Supervised learning: learning generation from annotated examples Reinforcement learning: learning via trial-and-error and feedback Possibility to jointly optimise DM and NLG? [Verena Rieser, Oliver Lemon (2010), «Natural Language Generation as Planning under Uncertainty for Spoken Dialogue Systems». Empirical Methods in Natural Language Generation] [O. Lemon (2011), «Learning what to say and how to say it: joint! optimization of spoken dialogue management and Natural Language Generation»,!Computer Speech and Language] 19 Generation of referring expressions Generating referring expressions (GRE) is an interesting subproblem of NLG Objective: given a reference to an object/entity in the context, find the best referring expression for it! Let s say we want to talk about this object The object? The triangular object? The orange triangular object that is to the right of the pink pyramid and to the left of the white cylinder? 20

Generation of referring expressions GRE typically searches for the minimal distinguishing expression for the target A distinguishing expression matches the target, but none of the distractors (other salient objects in the context) Target Distractors 21 Generation of referring expressions Dale and Reiter s Incremental Algorithm: 1. order the properties P by preference 2. Iterate through ordered list of properties P 3. add attribute to description being constructed if it rules out any remaining distractors 4. terminate when a distinguishing description has been constructed (or no more properties) [Robert Dale and Ehud Reiter (1995), «Computational Interpretations of the Gricean Maxims in the Generation of Referring Expressions». Cognitive Science] 22

Incremental algorithm: example Assume three properties: Shape, Colour and Size, with Shape > Colour > Size We want to talk about object 4 1 2 3 4 5 6 7 Step Current expression Remaining distractors We analyse the Shape property. Object 4 has Shape=triangular Adding the property Shape=triangular removes distractors {1,2,3,6,7} We analyse the Colour property. Object 4 has Colour=orange The object {1,2,3,5,6,7} The triangular object {5} The triangular object Adding the property Colour=orange remove the distractor 5 The orange triangular object Found distinguishing expression! The orange triangular object 23 Outline General architecture Natural language generation Speech synthesis: Text analysis Waveform synthesis Summary 24

Speech synthesis The last component of our architecture is the speech synthesiser (or «text-to-speech», TTS) The TTS module converts a concrete utterance into a speech waveform This mapping is performed in two steps: 1. Conversion of input utterance into a phonemic representation (text analysis) 2. Conversion of phonemic representation into the waveform (waveform synthesis) 25 Speech synthesis Please take the box! Text analysis Waveform synthesis pliːz ˈteɪk "ə ˈbɑks! Input utterance Internal phonemic representation Final waveform 26

Text analysis in TTS How do we produce the phonemic representation? 1. Text normalisation (abbreviations, numbers, etc.) 2. Phonetic analysis, based on a pronunciation dictionary and a grapheme-to-phoneme (g2p) converter 3. Prosodic analysis to determine e.g. prosodic phrases, pitch accents, and overall tune 27 Prosodic analysis Utterances can be structured in intonational phrases Correlated, but not identical to syntactic phrases! These phrases can be extracted based on features such as punctuation, presence of function words etc. Words can be more or less prosodically prominent E.g. emphatic accents, pitch accents, unaccented, reduced Finally, utterances are also characterised by their global tune (rise and fall of F0 over time) 28

Phonemic representation At the end of the text analysis (normalisation + phonemic and prosodic analysis), we end up with an internal phonemic representation of our utterance prosodic boundaries phonemes (ARPA format) Values for the F0 contour 29 Waveform synthesis Once we have a phonemic representation, we need to convert it into a waveform Two families of methods: Concatenative synthesis: glue together prerecorded units of speech (taken from a speech corpus) Formant & articulatory synthesis: generate sounds using acoustic models of the vocal tract 30

Waveform synthesis Concatenative synthesis Pros More natural-sounding & intelligible speech Easier modelling, limited signal processing Cons Requires a speech corpus Limited flexibility Formant and articulatory synthesis Explicit model of speech production Many parameters can be tweaked Robotic sounds Complex modelling and signal processing 31 Concatenative synthesis Concatenative synthesis: We record and store various units of speech in a database When synthesising a sound, we search the appropriate segments in this database We then «glue» them together to produce a fluent sound Target: wintr=dei («winter day») unit2 unit1 unit4 unit5 unit6 unit7 unit3 32

Concatenative synthesis Concatenative methods differ by the kind of «units of speech» they are using Diphone synthesis: phone-like units going from the middle of one phone to the middle of the next one Unit selection: units of different sizes, can be much larger than a diphone Most commercial TTS systems deployed today are based on unit selection 33 Diphone synthesis [diagram borrowed from M. Schröder] 34

Diphone synthesis For diphone synthesis, the acoustic database consists of recorded diphones Usually embedded in carrier phrases Must be carefully segmented, labelled, pitch-marked, etc. After concatenation, the sound must be adjusted to meet the desired prosody Such signal processing might distort the speech sound! Limited account of pronunciation variation (only coarticulation due to neighbouring phone) 35 Unit selection synthesis In unit selection synthesis, the «units of speech» come from a segmented corpus of natural speech [diagram borrowed from M. Schröder] 36

Unit selection synthesis How do we search for the best units matching our phonemic specifications? Search for a unit that matches as closely as possible our requirements (F0, stress level, etc.) for the unit... and that concatenates smoothly with its neighbours Given a specification st, we search for the unit ut that minimises two costs: Target cost T(u t, st): how well the specification matches ut Join cost J(u t, ut+1): how well ut joins with its neighbour ut+1 37 Unit selection synthesis Assume that we are given an internal phonemic representation S={s1,s2,...sn} We want to find the best sequence of speech units for S In other words, we search for the unit sequence Û= {u1,u2,...un} such that: Û = argmin U n T (s t,u t )+ t=1 n 1 t=1 J(u t,u t+1 ) Target cost between specification st and unit ut Join cost between unit ut and unit ut+1 38

Unit selection synthesis Unit selection can produce high-quality sounds Depending on the corpus size and quality, of course But it s rather inflexible: difficult to modulate the prosody of the speech sound How can we e.g. change the sound s emotional content? Alternative: annotate the speech corpus with fine-grained informations, and use these in the selection But requires a much larger corpus! 39 Outline General architecture Natural language generation Speech synthesis Summary 40

Summary We started by describing different methods for natural language generation (NLG): Shallow methods rely on canned utterances, possibly augmented with some slots to fill in Deep NLG relies on grammatical resources and logical reasoning to plan & realise the utterance Finally, statistical methods automatically learn the mapping between communicative goals and their corresponding utterances from data 41 Summary We also focused on the problem of generating referring expressions (GRE): Given a reference to an object/entity, try to find the best linguistic expression for it To achieve this, we need to find an expression which is both distinguishing (matches the target object, but no other object) and minimal 42

Summary We finally described the speech synthesis task: First step: convert the utterance into an internal phonemic representation, together with a prosodic structure Second step: convert this representation into a waveform Concatenative synthesis (diphone, unit selection): reuse pre-recorded units from an acoustic database Formant & articulatory synthesis: use an explicit acoustic model of the vocal tract to generate the sound 43 Incremental NLG + TTS? Some recent work on incremental NLG and TTS Allows the system to be much more reactive (to correct its own production, and to react to user feedback) Can change or rephrase the utterance «on the fly» Other advantage: can start playing the sound even before the full synthesis is complete [H. Buschmeier, Timo Baumann et al. (2012). «Combining Incremental Language Generation and Incremental Speech Synthesis for Adaptive Information Presentation». In Proceedings of SIGDIAL] 44

Next Monday For our last session, we ll: describe how to evaluate spoken dialogue systems and wrap up everything we have seen If you have any questions or need help (for the 2nd assignment, or on the course in general), we can also talk about it! 45