Table 1: Classification accuracy percent using SVMs and HMMs

Similar documents
Learning Methods in Multilingual Speech Recognition

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Emotion Recognition Using Support Vector Machine

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Human Emotion Recognition From Speech

On the Formation of Phoneme Categories in DNN Acoustic Models

Rhythm-typology revisited.

Mandarin Lexical Tone Recognition: The Gating Paradigm

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Segregation of Unvoiced Speech from Nonspeech Interference

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

A study of speaker adaptation for DNN-based speech synthesis

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

Proceedings of Meetings on Acoustics

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

WHEN THERE IS A mismatch between the acoustic

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

The Acquisition of English Intonation by Native Greek Speakers

Automatic intonation assessment for computer aided language learning

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A survey of intonation systems

Copyright by Niamh Eileen Kelly 2015

Letter-based speech synthesis

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Phonological encoding in speech production

Automatic Pronunciation Checker

L1 Influence on L2 Intonation in Russian Speakers of English

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Speaker recognition using universal background model on YOHO database

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

The influence of metrical constraints on direct imitation across French varieties

Journal of Phonetics

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Word Stress and Intonation: Introduction

Phonological and Phonetic Representations: The Case of Neutralization

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Switchboard Language Model Improvement with Conversational Data from Gigaword

Word Segmentation of Off-line Handwritten Documents

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Discourse Structure in Spoken Language: Studies on Speech Corpora

Edinburgh Research Explorer

Assignment 1: Predicting Amazon Review Ratings

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Universal contrastive analysis as a learning principle in CAPT

SARDNET: A Self-Organizing Feature Map for Sequences

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Dialog Act Classification Using N-Gram Algorithms

Individual Differences & Item Effects: How to test them, & how to test them well

Journal of Phonetics

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

CS 446: Machine Learning

Investigation on Mandarin Broadcast News Speech Recognition

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Using dialogue context to improve parsing performance in dialogue systems

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Calibration of Confidence Measures in Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Speaker Recognition. Speaker Diarization and Identification

Speech Recognition by Indexing and Sequencing

A Case Study: News Classification Based on Term Frequency

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Eyebrows in French talk-in-interaction

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

English Language and Applied Linguistics. Module Descriptions 2017/18

Speaker Identification by Comparison of Smart Methods. Abstract

Linking Task: Identifying authors and book titles in verbose queries

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

IEEE Proof Print Version

Infants learn phonotactic regularities from brief auditory experience

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Transcription:

Feature Sets for the Automatic Detection of Prosodic Prominence Tim Mahrt, Jui-Ting Huang, Yoonsook Mo, Jennifer Cole, Mark Hasegawa-Johnson, and Margaret Fleck This work presents a series of experiments which explore the utility of various acoustic features in the classification of words as prosodically prominent or nonprominent. For this set of experiments, a 35,009 word subset of the Buckeye Speech Corpus was used [12]. This subset is divided across fifty-four segments of the Buckeye Speech Corpus. In a previous study, the words were transcribed for prosodic prominence by several teams, of sixteen naive native-speakers of English each, using the method Rapid Prosody Transcription developed in our prior work [10]. In the present study, we mapped the quasi-continuous valued prosody labels from the transcribed portion of the corpus to a binary prominence label. If at least one rater deemed a word prominent, it was labeled prominent or otherwise it was labeled nonprominent. 15,955 were labeled prominent, yielding a baseline chance level of prominence assignment at 54.4%. 90% of the words were used in training the learning algorithms and the other 10% was used in testing. Several acoustic correlates are associated with prominence, including F0, duration, and intensity [1, 2, 4, 17, 3, 6, 15, 16, 11, 7, 14, 18]. The relative contribution that these play in speech recognition and in recognition by humans is well discussed in the literature [5, 18, 9, 13]. In the first set of experiments, Support Vector Machines (SVM) were used. SVMs were chosen because the task is a vector-input, class-label-output task, and SVMs do well at such tasks. Here, a set of 36 features was used, including both features known to be correlated to prominence and features not known to be correlated, such as the length of the pause after a word. The ten best-performing features were, in order, the minimum energy of the final vowel normalized by phones, the ratio of the energy of the current word to the next word, the post-word pause duration, the word duration normalized by phones, the maximum energy of the last vowel normalized by phone-class, the minimum value of f0 in the following word, the maximum energy of the stressed vowel normalized by phone-class, the stressed vowel duration normalized by phone-class, the minimum energy of the next word, and the maximum energy of the word. The classification accuracy was tested with related features clustered together into four groups: pause, duration, intensity, and pitch. The results are reported in table 1 For the second set of experiments, Hidden Markov Models (HMM) with three hidden SVM Features SVM acc. HMM Features HMM acc. pause feats. 61.1 Post-Word Pause Duration (PWPD) 57.7 duration feats. 69.0 Stressed Vowel Duration (SVD) 65.1 intensity feats. 71.4 - - pitch feats. 72.1 - - intensity + pitch 75.1 MFCC 68.7 intensity + pitch + duration 75.8 MFCC + SVD 65.82 intensity + pitch + duration + pause 76.1 MFCC + SDV + PWPD 56.2 Table 1: Classification accuracy percent using SVMs and HMMs 1

Context region pre-stress stressed syllable post-stress Classification accuracy 67.4 66.1 67.4 Table 2: Classification accuracy for context regions using HMMs states were used. HMMs can take advantage of temporal information in the sequencing of units. Mel-frequency cepstral coefficients (MFCC) were generated using HTK and were used as the encoding of temporal features. These data were concatenated with per-word durational measures, taken from phoneme-occurrence timestamps in the Buckeye corpus. The post-word pause duration is the time between the end of the last phoneme in the current word and the beginning of the first phoneme in the next word. The results for these experiments are summarized in Table 1. Although the feature sets used between the HMM and the SVM are not the same, they correspond to each other. Note that the classification accuracy in the SVM is always higher. For this reason and that many of the top performing features used in the SVM experiment were normalized features, this suggests that temporal information is less useful than changes in the accoustic signal. These findings support evidence found in the human perception of prosody [8]. If temporal information is useful then some temporal regions may be more useful than others. In English, as prominence is primarily expressed on the stressed syllable [7], it may be expected that by extracting features only from the stressed syllable we would obtain provide better prominence classification results, with the other regions of the word contributing noise. However, prominence also has residual effects on the rest of the word. For example, F0 can peak in the post-stress syllable [11, 7]. To test prominence detection based on the stressed-unstressed distinction within the word, the words in Buckeye were split into three regions: pre-stress, stress, and post-stress. MFCC vectors were extracted from each of these regions and were tested independently of each other. The results for the three regions, reported in Table 2, are fairly similar to each other and to the results for the trials reported earlier using MFCCs extracted from the entire word. Thus, interesting information does exist throughout prominent words. To see if making this contextual information more explicit could be used to improve accuracy, a new feature was created from the sum of the log-likelihoods of each frame being prominent given the model trained on MFCCs in the previous experiment. These values were trained on a new HMM with an accuracy of 56.2%, which suggests that the explicit contextual feature is not useful. In our final experiment, we modified the classification task so that words were considered prominent when two or more raters labeled a word as prominent (rather than one or more). Words which were not labeled as prominent by any raters were still considered nonprominent but those which were labeled by only a single rater was thrown out. Agreement between labelers can provide greater confidence that the word is indeed prominent, whereas words with only a single prominent judgment are more likely to be mistakes. The accuracy for this zero vs two or more classification task when only using MFCCs is 71.5% as compared to 68.7% for the zero vs one or more task, suggesting that words with only a single judgment of prominence are indeed less reliable. In this study we sought different strategies to improve learning performance. We found that normalized features are often more informative than raw features. The contribution of 2

temporal regions was observed and it was found that no one region was the most informative. And finally, by removing labels with low rater agreement, we were able to boost performance. This study is supported by NSF IIS-0703624 to Cole and Hasegawa-Johnson. For their varied contributions, we would like to thank the members of the Illinois Prosody-ASR research group. 3

References [1] M. Beckman. Stress and non-stress accent. Foris Pubns USA, 1986. [2] M. Beckman and J. Edwards. Articulatory evidence for differentiating stress categories. Phonological structure and phonetic form, page 7, 1994. [3] T. Cambier-Langeveld and A. Turk. A cross-linguistic study of accentual lengthening: Dutch vs. English. Journal of Phonetics, 27(3):255 280, 1999. [4] J. Cole, H. Kim, H. Choi, and M. Hasegawa-Johnson. Prosodic effects on acoustic cues to stop voicing and place of articulation: Evidence from Radio News speech. Journal of Phonetics, 35(2):180 209, 2007. [5] A. Cutler, D. Dahan, and W. Van Donselaar. Prosody in the comprehension of spoken language: A literature review. Language and speech, 40(2):141, 1997. [6] G. Kochanski, E. Grabe, J. Coleman, and B. Rosner. Loudness predicts prominence: Fundamental frequency lends little. The Journal of the Acoustical Society of America, 118:1038, 2005. [7] D. Ladd. Intonational phonology. Cambridge Univ Pr, 2008. [8] Y. Mo. Prosody production and perception with conversational speech. PhD thesis, University of Illinois Urbana-Champaign, 2010. [9] Y. Mo, J. Cole, and J. Hasegawa-Johnson. How do ordinary listeners perceive prosodic prominence? Syntagmatic vs. Paradigmatic comparison. In Poster presented at the 157th Meeting of the Acoustical Society of America, Portland, Oregon., 2009. [10] Y. Mo, J. Cole, and E. Lee. Naive listeners prominence and boundary perception. Proc. Speech Prosody, Campinas, Brazil, pages 735 738, 2008. [11] J. Pierrehumbert. The phonology and phonetics of English intonation. MIT Cambridge, MA, 1980. [12] M. A. Pitt, L. Dilley, K. Johnson, S. Kiesling, W. Raymond, E. Hume, and et al. Buckeye corpus of conversational speech (2nd release). Columbus, OH: Department of Psychology, Ohio State University, 2007. Retrieved March 15, 2006, from www.buckeyecorpus.osu.edu. [13] A. Rosenberg. Automatic Detection and Classification of Prosodic Events. PhD thesis, Columbia University, 2009. [14] C. S. Information Structure and the Prosodic Structure of English. PhD thesis, University of Edinburgh, 2006. [15] A. Sluijter and V. Van Heuven. Spectral balance as an acoustic correlate of linguistic stress. Journal of the Acoustical Society of America, 100(4):2471 2485, 1996. 4

[16] F. Tamburini and C. Caini. An automatic system for detecting prosodic prominence in American English continuous speech. International Journal of Speech Technology, 8(1):33 44, 2005. [17] A. Turk and L. White. Structural influences on accentual lengthening in English. Journal of Phonetics, 27(2):171 206, 1999. [18] K. Yoon. Imposing native speakers prosody on non-native speakers utterances: The technique of cloning prosody. Journal of the Modern British & American Language & Literature, 25(4):197 215, 2007. 5