Foot Structure and Pitch Contour Paper Review. Arthur R. Toth Language Technologies Institute Carnegie Mellon University 7/22/2004

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Mandarin Lexical Tone Recognition: The Gating Paradigm

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

A Hybrid Text-To-Speech system for Afrikaans

Rhythm-typology revisited.

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Word Stress and Intonation: Introduction

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

A study of speaker adaptation for DNN-based speech synthesis

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

CS Machine Learning

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Building Text Corpus for Unit Selection Synthesis

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Probability and Statistics Curriculum Pacing Guide

Phonological encoding in speech production

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Journal of Phonetics

/$ IEEE

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Speech Emotion Recognition Using Support Vector Machine

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Annotation Pro. annotation of linguistic and paralinguistic features in speech. Katarzyna Klessa. Phon&Phon meeting

Designing a Speech Corpus for Instance-based Spoken Language Generation

Phonological Processing for Urdu Text to Speech System

A Case Study: News Classification Based on Term Frequency

The Acquisition of English Intonation by Native Greek Speakers

Letter-based speech synthesis

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Getting the Story Right: Making Computer-Generated Stories More Entertaining

Automatic intonation assessment for computer aided language learning

Modeling function word errors in DNN-HMM based LVCSR systems

The influence of metrical constraints on direct imitation across French varieties

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Cross Language Information Retrieval

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Corpus Linguistics (L615)

Multi-Lingual Text Leveling

Detecting English-French Cognates Using Orthographic Edit Distance

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Copyright by Niamh Eileen Kelly 2015

Edinburgh Research Explorer

Discourse Structure in Spoken Language: Studies on Speech Corpora

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Learning Methods in Multilingual Speech Recognition

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Lecture 1: Machine Learning Basics

Modeling function word errors in DNN-HMM based LVCSR systems

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Human Emotion Recognition From Speech

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Eyebrows in French talk-in-interaction

SARDNET: A Self-Organizing Feature Map for Sequences

English Language and Applied Linguistics. Module Descriptions 2017/18

Switchboard Language Model Improvement with Conversational Data from Gigaword

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Segregation of Unvoiced Speech from Nonspeech Interference

Expressive speech synthesis: a review

Python Machine Learning

Journal of Phonetics

The Bruins I.C.E. School

Linking Task: Identifying authors and book titles in verbose queries

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Automatic Pronunciation Checker

Phonological and Phonetic Representations: The Case of Neutralization

Effect of Word Complexity on L2 Vocabulary Learning

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

IEEE Proof Print Version

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

LISTENING STRATEGIES AWARENESS: A DIARY STUDY IN A LISTENING COMPREHENSION CLASSROOM

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Psychology and Language

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

L1 Influence on L2 Intonation in Russian Speakers of English

Transcription:

Foot Structure and Pitch Contour Paper Review Arthur R. Toth Language Technologies Institute Carnegie Mellon University 7/22/2004

Papers Esther Klabbers, Jan van Santen and Johan Wouters, Prosodic Factors for Predicting Local Pitch Shape, IEEE 2002 Workshop on Speech Synthesis Esther Klabbers and Jan P. H. van Santen, Control and prediction of the impact of pitch modification on synthetic speech quality, Eurospeech 2003 Esther Klabbers and Jan P. H. van Santen, Clustering of foot-based pitch contours in expressive speech, SSW5, 2004.

1 st Paper: IEEE 2002 Workshop Investigate predictive power of different prosodic factoring schemes. Extend diphone voice by making additional recordings under different prosodic contexts. Use foot structure to guide choice of prosodic contexts.

Introduction Problem: corpora typically have 1 example per diphone coming from stressed context These examples are sometimes bad matches for prosodic context, and much signal modification (with potential quality degradation) can be necessary. Adding many examples to cover more possibilities could lead to a large database Difficult to use in embedded devices Difficult to keep speaker consistent across more examples Need to find good selection criteria

Feet and Pitch Left-headed foot Sequence of 1 or more syllables, 1 st is accented Followed by accented syllable or phrase boundary Typical accent, up-down pitch movement Monosyllabic: rise-fall on single syllable Polysyllabic: rise on first, fall on rest

Factorization Schemes Simple Foot Complex1 Complex2 Stress {0,1} Last accent {0,1,2} Accent{0,1} Accent {0,1} Accent {0,1} Next accent {0,1,(2)} Last accent {0,1,2} Last accent {0,1,2,3} Phrase-fin. Syll.{0,1,2 } Phrase-fin. Foot{0,1,2} Next accent {0,1,2} Next accent {0,1,2,3} Levels 12 19 54 96

Experiments Corpus 472 sentences spoken by a female Segmented and annotated by hand 1493 of 8860 syllables were used Only ones starting with a sonorant Measures RMSE between one contour and another contour estimated from the second Delta distance

Results Mean Simple Foot Complex1 Complex2 Levels 12 19 30 48 RMSE 13.1 12.8 12.7 11.9 Delta Distance 11.9 10.9 11.3 10.4

Discussion Foot scheme performs better than Simple and similar/better than Complex1 Complex2 performs best but has too many factors. Hypothesis 1: The distinction between medial, phrase-final and utterance-final feet is important for predicting pitch contour shapes. Hypothesis 2: The position of the previous accented syllable is irrelevant if the current syllable is the head of the foot.

Text Corpus Analysis Analyzed large text corpus 359,276 sentences from newspapers, novels, and bible Used Festival to compute foot factor levels for each diphone: 16,926,727 total of 22,865 types Simplified by disregarding consonant position and only having single versions of consonant-consonant diphones: 9,367,407 tokens of 21,458 types Using a standard database of 3353 diphones, only 6020 had to be added to cover 95% of diphone-foot tags.

2 nd Paper: Eurospeech 2003 Continues in the vein of trying to reduce the amount of signal modification necessary by using foot structure to improve selection. Perceptual experiment to investigate degradation caused by pitch modification Correlation of weighted perceptual score with different pitch and delta pitch distances

Speech Corpus Analysis Same prosodic factorization as 1 st paper Corpora Duration corpus: corpus from 1 st paper Foot Corpus I Recorded to testing effect of position on pitch contour 285 sentences, spoken by a highly-expressive female Each sentence target is an all-sonorant CVC syllable Foot Corpus II: Instructed speaker to be less expressive, speaker uncomfortable

Distance Measures Tried various distance measures D p = (log10( F 2 0 i ) log10( F0 j )) D D D wp p w p = = = E(log10( F0 i ) log10( F0 j )) E 2 ( log10( F0 i ) log10( F0 j )) E where 2 ( log10( F0 i ) log10( F0 j )) E = E i E j 2

Results Foot annotation scheme performed better than Simple for all 3 corpora and was generally better then complex It appeared that some levels in the Foot scheme could be collapsed further For Head, Doesn t matter whether unstressed syllables follow For unstressed syllables, only matters whether they are immediately preceded by the head For all syllables, important if foot is phrase-medial, phrase-final with continuation rise, or utterance-final. New 12-level factorization scheme that is still better than simple

Perceptual Experiment Use OGIresLPC algorithm Use data from Foot Corpus I Sentences had carrier phrase and target word Target word was sonorant CVC syllable from corpus Two versions: one where syllable is in same prosodic context, another where it is in different context Sentence parts were concatenated with Snack 20ms pause inserted between carrier phrase and target word Participants compared pairs on 7-point scale

Results Computed weighted score for each sentence, based on z- score normalization Used linear regression with different distances to try to predict weighted scores At first, appeared that pitch distance and delta distance caused highest variance With more varied material, weighted distances might give better correlations. Direction of pitch change important. 2 new distance measures were created Decreasing pitch was worse than increasing pitch

3 rd Paper: 5 th ISCA SSW Concerned with categorizing foot-based pitch contours in expressive speech Clustering instead of prediction Classifying emotions in speech is problematic, so focusing on what pitch contours actually occur

Models TTS system used Generalized Linear Alignment Model Pitch contour consists of phrase curves, accent curves, and segmental perturbation curves Phrase curve has two linear components Phrase start to syllable with nuclear pitch accent There to end, with steeper decline This paper uses Simplified Linear Alignment Model Assumes accent is realized by up-down movement, where location depends on # syllables in foot

Corpus 2 children s stories by Beattrix Potter Read by semi-professional female speaker 10 minutes of speech, not counting pauses 2929 syllables 128 sentences

Annotation Automatic phoneme segmentation by CSLU s phonetic alignment system Phonetic transcription from Festival Phonemes checked and alignments hand-corrected with Wavesurfer Syllable transcription created by hand and aligned with phoneme labels ESPS get_f0 used to extract pitch every 5ms Wrote Wavesurfer plug-in to interpolate with lines

Pitch Normalization Pitch contours are different lengths and need to be normalized for comparison. Simple interpolation doesn t work because peak location tends to differ between monosyllabic and polysyllabic feet. Predicted peak locations were used to split intervals, and 50 points were sampled on each side.

Analysis Distances between pitch contours were calculated as: 1-cor(F 0i,F 0j )

Clustering Used S-PLUS hclust method for clustering (non-metrical hierarchical) Each object gets own clustered, then clusters joined until only 1 Used ward method: minimum variance method that finds compact spherical clusters Final number of clusters determined empirically by looking and listening

Results 6 clusters were selected The paper has figures of medians of z- normalized pitch contours for each cluster. There is a also a table showing bigram relative frequencies, with some discussion.

Conclusion Authors feel this paper has shown assumptions made in Generalized Linear Alignment Model are correct. Discoveries two feet (most frequently occurring at the end of a minor or major phrase) can be connected by what seems to be a different type of phrase curve consisting of an increasing movement on the first foot and a decreasing movement on the last foot. continuation rise which was always assumed to be present at minor phrase boundaries was only observed in fewer than 10% of feet occurring at the minor phrase boundary in this corpus. Need to confirm these discoveries for other speakers