High-Level and Low-Level Synthesis

Similar documents
CEFR Overall Illustrative English Proficiency Scales

English Language and Applied Linguistics. Module Descriptions 2017/18

Florida Reading Endorsement Alignment Matrix Competency 1

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Word Stress and Intonation: Introduction

Linguistics Program Outcomes Assessment 2012

5. UPPER INTERMEDIATE

Designing a Speech Corpus for Instance-based Spoken Language Generation

Abstractions and the Brain

Providing Feedback to Learners. A useful aide memoire for mentors

UML MODELLING OF DIGITAL FORENSIC PROCESS MODELS (DFPMs)

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Integrating simulation into the engineering curriculum: a case study

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

TAG QUESTIONS" Department of Language and Literature - University of Birmingham

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Part I. Figuring out how English works

Litterature review of Soft Systems Methodology

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

NAME: East Carolina University PSYC Developmental Psychology Dr. Eppler & Dr. Ironsmith

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Text Type Purpose Structure Language Features Article

Software Maintenance

RED 3313 Language and Literacy Development course syllabus Dr. Nancy Marshall Associate Professor Reading and Elementary Education

Simulation in Maritime Education and Training

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Ohio s New Learning Standards: K-12 World Languages

Assessment and Evaluation

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

The Common European Framework of Reference for Languages p. 58 to p. 82

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

Derivational and Inflectional Morphemes in Pak-Pak Language

Communication around Interactive Tables

TRAITS OF GOOD WRITING

1/25/2012. Common Core Georgia Performance Standards Grade 4 English Language Arts. Andria Bunner Sallie Mills ELA Program Specialists

The College Board Redesigned SAT Grade 12

Surface Structure, Intonation, and Meaning in Spoken Language

Summary results (year 1-3)

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Writing a composition

Implementing the English Language Arts Common Core State Standards

South Carolina English Language Arts

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Achievement Level Descriptors for American Literature and Composition

Thesis-Proposal Outline/Template

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Parsing of part-of-speech tagged Assamese Texts

Understanding and Supporting Dyslexia Godstone Village School. January 2017

Modern Fantasy CTY Course Syllabus

Introduction to the Common European Framework (CEF)

Client Psychology and Motivation for Personal Trainers

Constraining X-Bar: Theta Theory

Seminar - Organic Computing

The Strong Minimalist Thesis and Bounded Optimality

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

ACTION LEARNING: AN INTRODUCTION AND SOME METHODS INTRODUCTION TO ACTION LEARNING

CS 598 Natural Language Processing

The Oregon Literacy Framework of September 2009 as it Applies to grades K-3

Lecturing Module

ENGLISH. Progression Chart YEAR 8

ANGLAIS LANGUE SECONDE

Table of Contents. Introduction Choral Reading How to Use This Book...5. Cloze Activities Correlation to TESOL Standards...

Guidelines for Writing an Internship Report

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

GOLD Objectives for Development & Learning: Birth Through Third Grade

HOLISTIC LESSON PLAN Nov. 15, 2010 Course: CHC2D (Grade 10, Academic History)

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

COSCA COUNSELLING SKILLS CERTIFICATE COURSE

California Department of Education English Language Development Standards for Grade 8

BENG Simulation Modeling of Biological Systems. BENG 5613 Syllabus: Page 1 of 9. SPECIAL NOTE No. 1:

Pedagogical Content Knowledge for Teaching Primary Mathematics: A Case Study of Two Teachers

Heritage Korean Stage 6 Syllabus Preliminary and HSC Courses

Facing our Fears: Reading and Writing about Characters in Literary Text

THEORETICAL CONSIDERATIONS

English 491: Methods of Teaching English in Secondary School. Identify when this occurs in the program: Senior Year (capstone course), week 11

LEGO MINDSTORMS Education EV3 Coding Activities

Politics and Society Curriculum Specification

Physics 270: Experimental Physics

Ontologies vs. classification systems

Procedia - Social and Behavioral Sciences 146 ( 2014 )

MASTER S THESIS GUIDE MASTER S PROGRAMME IN COMMUNICATION SCIENCE

Initial teacher training in vocational subjects

Developing an Assessment Plan to Learn About Student Learning

1. Programme title and designation International Management N/A

November 2012 MUET (800)

Speech Emotion Recognition Using Support Vector Machine

Loveland Schools Literacy Framework K-6

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Lecture 1: Machine Learning Basics

LITERACY ACROSS THE CURRICULUM POLICY

Australia s tertiary education sector

Wonderworks Tier 2 Resources Third Grade 12/03/13

Lower and Upper Secondary

ACCREDITATION STANDARDS

Transcription:

1 High-Level and Low-Level Synthesis 1.1 Differentiating Between Low-Level and High-Level Synthesis We need to differentiate between low- and high-level synthesis. Linguistics makes a broadly equivalent distinction in terms of human speech production: low-level synthesis corresponds roughly to phonetics and high-level synthesis corresponds roughly to phonology. There is some blurring between these two components, and we shall discuss the importance of this in due course (see Chapter 11 and Chapter 25). In linguistics, phonology is essentially about planning. It is here that the plan is put together for speaking utterances formulated by other parts of the linguistics, like semantics and syntax. These other areas are not concerned with speaking their domain is how to arrive at phrases and sentences which appropriately reflect the meaning of what a person has to say. It is felt that it is only when these pre-speech phrases and sentences are arrived at that the business of planning how to speak them begins. One reason why we feel that there is somewhat of a break between sentences and planning how to speak them is that those same sentences might be channelled into a part of linguistics which would plan how to write them. Diagrammed this looks like: semantics/syntax phrase/sentence graphology phonology writing plan speaking plan The task of phonology is to formulate a speaking plan, whereas that of graphology is to formulate a writing plan. Notice that in either case we end up simply with a plan not with writing or speech: that comes later. 1.2 Two Types of Text It is important to note the parallel between graphology and phonology that we can see in the above diagram. The apparent equality between these two components conceals something very important, which is that the graphology pathway, leading eventually to a rendering of the sentence as writing, does not encode as much of the information available Developments in Speech Synthesis Mark Tatham and Katherine Morton 2005 John Wiley & Sons, Ltd. ISBN: 0-470-85538-X

18 Developments in Speech Synthesis at the sentence level as the phonology does. Phonology, for example, encodes prosodic information, and it is this information in particular which graphology barely touches. Let us put this another way. In the human system graphology and phonology assume that the recipient of their processes will be a human being it has been this way since human beings have written and spoken. Speech and writing are the result of systematic renderings of sentences, and they are intended to be decoded by human beings. As such the processes of graphology and phonology (and their subsequent low-level rendering stages: graphetics and phonetics) make assumptions about the device (the human being) which is to input them for decoding. With speech synthesis (designed to simulate phonology and phonetic rendering), textto-speech synthesis (designed to simulate human beings reading aloud text produced by graphology/graphetics) and automatic speech recognition (designed to simulate human perception of speech) such assumptions cannot be made. There is a simple reason for this: we really do not yet have adequate models of all the human processes involved to produce other than imperfect simulations. Text in particular is very short on what it encodes, and as we have said, the shortcomings lie in that part of a sentence which would be encoded by prosodic processing were the sentence to be spoken by a human being. Text makes one of two assumptions: the text is not intended to be spoken, in which case any expressive content has to be text-based that is, it must be expressed using the available words and their syntactic arrangement; the text is to be read out aloud; in which case it is assumed that the reader is able to supply an appropriate expression and prosody and bring this to the speech rendering process. By and large, practised human readers are quite good at actively adding expressive content and general prosody to a text while they are actually reading it aloud. Occasionally mistakes are made, but these are surprisingly rare, given the look-ahead strategy that readers deploy. Because the process is an active one and depends on the speaker and the immediate environment, it is not surprising that different renderings arise on different occasions, even when the same speaker is involved. 1.3 The Context of High-Level Synthesis Rendering processes within the overall text-to-speech system are carried out within a particular context the prosodic context of the utterance. Whether a text-to-speech system is trying to read out text which was never intended for the purpose, or whether the text has been written with human speech rendering in mind, the task for a text-to-speech system is daunting. This is largely down to the fact that we really do not have an adequate model of what it is that human readers bring to the task or how they do it. There is an important point that we are developing throughout this book, and that is that it is not a question of adding prosody or expression, but a question of rendering a spoken version of the text within a prosodic or expressive framework. Let us term these the additive model and the wrapper model. We suggest that high-level synthesis the development of an utterance plan is conducted within the wrapper context.

High-Level and Low-Level Synthesis 19 Conceptually these are very different approaches, and we feel that one of the problems encountered so far is that attempts to add prosody have failed because the model is too simplistic and insufficiently reflective of what the human strategy is. This seems to focus on rendering speech within an appropriate prosodic wrapper, and our proposals for modelling the situation assume a hierarchical structure which is dominated by this wrapper (see Chapter 34). Prosody is a general term, and can be viewed as extending to an abstract characterisation of a vehicle for features such as expressive or, more extremely, emotive content. An abstract characterisation of this kind would enumerate all the possibilities for prosody, and part of the rendering task would be to highlight appropriate possibilities for particular aspects of expression. Notice that it would be absurd to attempt to render prosody in this model (since it is simultaneously everything of a prosodic nature), just as it is absurd to try to render syntax in linguistic theory (since it simultaneously characterises all possible sentences in a language). Unfortunately some text-to-speech systems have taken this abstract characterisation of prosody and, calling it neutral prosody, have attempted to add it to the segmental characterisation of particular utterances. The results are not satisfactory because human listeners do not know what to make of such a waveform: they know it cannot occur. Let us summarise what is in effect a matter of principle in the development of a model within linguistic theory: Linguistic theory is about the knowledge of language and of a particular language which is shared between speakers and listeners of that language. The model is static and does not include in its strictest form processes involving drawing on this knowledge for characterising particular sentences. As we move through the grammar toward speech we find the linguistic component referred to as phonology a characterisation for a particular language of everything necessary to build utterance plans to correspond with the sentences the grammar enumerates (all of them). Again there is no formal means within this type of theory for drawing on that knowledge for planning specific utterances. Within the phonology there is a characterisation of prosody, a recognisable subcomponent intonational, rhythmic and prominence features of utterance planning. Again prosody enumerates all possibilities and, we say again, with no recipe for drawing on this knowledge. Although prosody is normally referred to as a sub-component of phonology, we prefer to regard phonological processes as taking place within a prosodic context: that is, prosodic processes are logically prior to phonological processes. Hence the wrapper model referred to above. Pragmatics is a component of the theory which characterises expressive content along the same theoretical lines as the other components. It is the interaction between pragmatics and prosody which highlights those elements of prosody which associate with particular pragmatic concepts. So, for example, the pragmatic concept of anger (deriving from the bio-psychological concept of anger) is associated with features of prosody which when combined uniquely characterise expressive anger. Prosody is not, therefore, neutral expression; it the characterisation of all possible prosodic features in the language.

20 Developments in Speech Synthesis 1.4 Textual Rendering Text-to-speech systems, by definition, take written text which would have been derived from a writing plan devised by a graphology, and use it to generate a speaking plan which is then spoken. Notice from the diagram above, though, that human beings obviously do not do this; writing is an alternative to speaking, it does not precede it. The exception, of course, is when human beings themselves take text and read it out loud. And it is this human behaviour which text-to-speech systems are simulating. We shall see that an interesting problem arises here. The operation of graphology the production of a plan for writing out phrases or sentences constitutes a lossy encoding process: information is lost during the process. What eventually appears on paper does not encode all of a speaker s intentions. For example, the mood of the writer does not come across, except perhaps in the choice of particular words. The mood of a speaker, however, invariably does come across. Immediately, though, we can observe that mood (along with emotion or intention) could not have been encoded in the phrase or sentence except via actual words so much of what a third party detects of a person s mood is conveyed by tone-ofvoice. It is not actually expressed in the sentence to begin with. The human system gets away with this lossy encoding that we come across in written text because human readers in general find no difficulty in restoring what has been removed or at least some acceptable substitute. For example, compare the following two written sentences: It was John. It wasn t Mary, it was John. The way in which the words It was John are spoken differs, although the text remains the same. No native speaker of English who can also read fails to make this difference. But a text-to-speech system struggles to add the contrastive emphasis which a listener would expect. This is an easy example it is not hard to imagine variations in rendering an utterance which are much more subtle than this. Some researchers have tried to estimate what is needed to perform this task of restoring semantic or pragmatic information at the reading aloud stage. And to a certain extent restoration is possible. But most agree that there are subtleties which currently defeat the most sophisticated algorithms because they rest on unknown factors such as world knowledge what a speaker knows about the world way beyond the current linguistic context. graphology writing plan semantics/syntax phrase/sentence phonology speaking plan The above diagram is therefore too simple. There is a component missing something which accounts for what a speaker, in planning and rendering an utterance, brings to the process which would not normally be encoded. A more appropriate diagram would look like this:

High-Level and Low-Level Synthesis 21 graphology writing plan semantics/syntax phrase/sentence phonology speaking plan pragmatics [characterisation of expression] In linguistics, much of a person s expression (mood, emotion, attitude, intention) is characterised by a component called pragmatics, and it is here that the part of language which has little to do with choice of actual words is formulated. Pragmatics has a direct input into phonology (and phonetics) and influences the way in which an utterance is actually spoken. Pragmatics (Verschueren 2003) is maturing late. The semantic and syntactic areas matured earlier, as well as phonology (without reference to expression the phonology of what has been called the neutral or expressionless utterance plan). It was therefore not surprising that the earlier speech technology models, adopted in speech synthesis and automatic speech recognition, were not sensitive to expression they omitted reference to pragmatics or its output. We reiterate many times in this book that a major reason for the persistent lack of convincing naturalness in speech synthesis is that systems are based on a pragmatics-free model of linguistics. A pragmatics-free model of linguistics fails to accommodate the variability associated with what we might call in very general terms expression or tone-of-voice. The kinds of things reflected here are a speaker s emotional state, their feelings toward the person they re speaking to, their general attitude, the environmental constraints which contribute to the choice of style for the speech, etc. There are many facets to this particular problem which currently preoccupy many researchers (Tatham and Morton 2004). One of the tasks of this book will be to introduce the theoretical concepts necessary to enable an expression information channel to link up with speech planning, and to show how this translates into a better plan for rendering into an actual speech soundwave. Although this book is not about automatic speech recognition, we suggest that consideration of expression, properly modelled in phonology and phonetics, points to a considerable improvement in the performance of automatic speech recognition systems.