Prosodic Event Recognition using Convolutional Neural Networks with Context Information

Similar documents
Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Emotion Recognition Using Support Vector Machine

A study of speaker adaptation for DNN-based speech synthesis

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Human Emotion Recognition From Speech

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Recognition at ICSI: Broadcast News and beyond

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

On the Formation of Phoneme Categories in DNN Acoustic Models

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

WHEN THERE IS A mismatch between the acoustic

Lecture 1: Machine Learning Basics

Cultivating DNN Diversity for Large Scale Video Labelling

Affective Classification of Generic Audio Clips using Regression Models

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A Case Study: News Classification Based on Term Frequency

Linking Task: Identifying authors and book titles in verbose queries

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Proceedings of Meetings on Acoustics

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Assignment 1: Predicting Amazon Review Ratings

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

The Acquisition of English Intonation by Native Greek Speakers

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Python Machine Learning

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Automatic intonation assessment for computer aided language learning

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Word Segmentation of Off-line Handwritten Documents

English Language and Applied Linguistics. Module Descriptions 2017/18

Segregation of Unvoiced Speech from Nonspeech Interference

Designing a Speech Corpus for Instance-based Spoken Language Generation

CEFR Overall Illustrative English Proficiency Scales

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Using dialogue context to improve parsing performance in dialogue systems

Discourse Structure in Spoken Language: Studies on Speech Corpora

Generative models and adversarial training

THE enormous growth of unstructured data, including

Detecting English-French Cognates Using Orthographic Edit Distance

Multi-Lingual Text Leveling

Mandarin Lexical Tone Recognition: The Gating Paradigm

arxiv: v1 [cs.lg] 15 Jun 2015

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Temporal Information Extraction for Question Answering Using Syntactic Dependencies in an LSTM-based Architecture

Lip Reading in Profile

Statewide Framework Document for:

CAFE ESSENTIAL ELEMENTS O S E P P C E A. 1 Framework 2 CAFE Menu. 3 Classroom Design 4 Materials 5 Record Keeping

Course Law Enforcement II. Unit I Careers in Law Enforcement

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Offline Writer Identification Using Convolutional Neural Network Activation Features

Rhythm-typology revisited.

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

SARDNET: A Self-Organizing Feature Map for Sequences

arxiv: v1 [cs.cl] 20 Jul 2015

arxiv: v1 [math.at] 10 Jan 2016

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

arxiv: v2 [cs.cv] 30 Mar 2017

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Speaker Recognition. Speaker Diarization and Identification

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Speaker Identification by Comparison of Smart Methods. Abstract

A deep architecture for non-projective dependency parsing

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Individual Differences & Item Effects: How to test them, & how to test them well

Calibration of Confidence Measures in Speech Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

arxiv: v1 [cs.cv] 10 May 2017

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

arxiv: v4 [cs.cl] 28 Mar 2016

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cl] 27 Apr 2016

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

CS Machine Learning

Artificial Neural Networks written examination

L1 Influence on L2 Intonation in Russian Speakers of English

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

Transcription:

Prosodic Event Recognition using Convolutional Neural Networks with Context Information Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) August 23, 2017

Prosodic Event Recognition (PER) labelling of segments: syllables or words e.g. pitch accents and phrase boundaries statistical learning task frame-based or aggregated features acoustic (speech signal) and lexico-syntactic (text) information useful for automatic language understanding connection between prosody and phrasing, semantics, information structure, etc. Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 2

Example Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 3

Related Work comparability of methods difficult most comparable work on pitch accent recognition: 87% on speaker-dependent detection [Wang et al. 2015] 83% for speaker-independent detection [Ren et al. 2004] 64% for classification of ToBI types [Rosenberg et al. 2010] Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 4

CNN-based Prosodic Event Recognition convolutional neural network (CNN) learns high-level feature representations from low-level acoustic descriptors relies only on acoustic features that are readily obtained from the speech signal only segmental information is time-alignment at the word level ( word-based recognition) address explicit context modelling in a simple and efficient way Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 5

Experimental Focus detection (binary) and classification (multi-class) ToBI pitch accents and intonational phrase boundaries [Silverman et al. 1992] American English data speaker-dependent and speaker-independent evaluation Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 6

Model supervised learning task: each word is labelled as carrying a prosodic event or not feature matrix: frame-based representation of audio signal 2 convolution layers max pooling finds most salient features resulting feature maps concatenated to one feature vector softmax layer: 2 units for binary classification or several for multi-class Position Indicator: 1.Convolution 2. Convolution Max Pooling Softmax Feature dimension feat_map_1 feat_map_2 feat_map_3... feat_map_1 feat_map_2 feat_map_3... w(t-1) w(t) w(t+1) 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0...... Prosodic event classes Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 7

Acoustic Features extracted using the opensmile toolkit [Eyben et al. 2013] two different feature sets: prosody: smoothed f0, RMS energy, PCM loudness, voicing probability, Harmonics-to-Noise-Ratio Mel: 27 features extraced from the Mel-frequency spectrum features computed for each 20ms frame with a 10ms shift all frames are grouped into feature matrices that represent each word zero padding ensures that matrices have the same size Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 8

Modelling Context most PER methods do context modelling prosodic events span longer stretches of speech e.g. right and left context words CNN looks for patterns in the whole input adding right and left context frames to the input matrix makes modelling the current word more difficult max pooling may find more salient features in neighbouring segments Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 9

Position Indicator Feature 1st convolution layer: kernels span entire feature dimension model is constantly informed if the current frames belong to the current word or not Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 10

Hyperparameters 1st layer: 100 kernels of shape 6 d, stride 4 1 2nd layer: 100 kernels of shape 4 1, stride 2 1 max pooling size is set so that output has same shape dropout with p = 0.2 applied before the softmax layer models trained for 50 epochs with adaptive learning rate (Adam) and L2 regularization all experiments are repeated 3 times and the results are averaged Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 11

Data Boston University Radio News Corpus subset that is manually labelled with ToBI event types [Ostendorf et al. 1993] 3 female, 2 male speakers 2 hours and 45 minutes of speech largest speaker set f2b used for speaker-dependent experiments with 10-fold cross-validation speaker-independent: leave-one-speaker-out cross-validation Speakers f1a f2b f3a m1a m2b PA # words 4375 12357 2736 3584 3607 PB # words 4362 12606 2736 5055 3607 Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 12

Labels binary classification (detection): all labels grouped together as one class multi-class classification of 5 different ToBI types: pitch accents: (1) H*;!H* (2) L* (3) L+H*; L+!H* (4) L*+H; L*+!H (5) H+!H* boundary tones: (1) L-L% (2) L-H% (3) H-L% (4)!H-L% (5) H-H% uncertain events ignored for both detection and classification uncertain types ignored for classification Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 13

Results: Pitch Accent Recognition one speaker all speakers Feature set prosody Mel pros.+mel prosody Mel pros.+mel Detection 1 word 84.2 84.2 84.0 81.9 78.3 79.3 3 words 58.3 53.1 53.6 58.2 54.3 55.3 3 words + PF 86.3 83.3 83.9 83.6 80.3 81.1 Classification 1 word 74.4 72.7 73.5 68.0 64.7 64.5 3 words 52.4 47.8 47.8 50.5 48.4 48.4 3 words + PF 76.3 72.3 72.9 69.0 65.9 65.3 all results reported in accuracy (%) Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 14

Results: Phrase Boundary Recognition one speaker all speakers Feature set prosody Mel pros.+mel prosody Mel pros.+mel Detection 1 word 87.6 89.2 89.8 86.5 85.3 86.1 3 words 80.3 75.4 75.4 82.7 81.0 80.8 3 words + PF 90.2 90.4 90.5 89.8 88.3 88.8 Classification 1 word 85.6 87.6 88.0 85.1 84.4 84.9 3 words 79.7 74.5 74.6 82.5 81.4 81.5 3 words + PF 87.8 88.7 88.8 87.3 86.2 86.7 all results reported in accuracy (%) Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 15

Results: Overview Pitch Accents Phrase Boundaries using best-performing feature set Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 16

Observations large drop in performance when extending the input to include the right and left context words performance improves after adding position indicator features results for phrase boundaries show similar pattern as for pitch accents prosody feature set performs best differences in feature sets not as large for phrase boundaries Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 17

Effects of z-scoring non-normalized normalized Pitch Accents Detection 83.6 77.0 Classification 69.0 62.6 Phrase Boundaries Detection 89.8 83.0 Classification 87.3 83.2 speaker-independent experiments using prosody and position features the CNN looks or relative changes in speech, and normalizing may lead to a loss in fine differences Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 18

Conclusion position indicator feature is crucial for this method model generalizes well from a speaker-dependent setup to a speaker-independent setting presented method can be readily applied to other datasets strong and efficient modelling technique that will be used as a basis in future work further feature and results analysis necessary Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 19

Thank you! sabrina.stehwien@ims.uni-stuttgart.de thang.vu@ims.uni-stuttgart.de Sabrina Stehwien, Ngoc Thang Vu University of Stuttgart Institute for Natural Language Processing (IMS) 20