Prosody-based automatic segmentation of speech into sentences and topics

Similar documents
Investigation on Mandarin Broadcast News Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Speech Emotion Recognition Using Support Vector Machine

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Learning Methods in Multilingual Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Assignment 1: Predicting Amazon Review Ratings

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Mandarin Lexical Tone Recognition: The Gating Paradigm

Modeling function word errors in DNN-HMM based LVCSR systems

Switchboard Language Model Improvement with Conversational Data from Gigaword

Multi-View Features in a DNN-CRF Model for Improved Sentence Unit Detection on English Broadcast News

Calibration of Confidence Measures in Speech Recognition

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Linking Task: Identifying authors and book titles in verbose queries

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

HLTCOE at TREC 2013: Temporal Summarization

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Segregation of Unvoiced Speech from Nonspeech Interference

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

CS Machine Learning

On-Line Data Analytics

Letter-based speech synthesis

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Statewide Framework Document for:

Human Emotion Recognition From Speech

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

On the Formation of Phoneme Categories in DNN Acoustic Models

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Getting the Story Right: Making Computer-Generated Stories More Entertaining

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

Word Segmentation of Off-line Handwritten Documents

Disambiguation of Thai Personal Name from Online News Articles

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Cross Language Information Retrieval

The stages of event extraction

Probability and Statistics Curriculum Pacing Guide

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

WHEN THERE IS A mismatch between the acoustic

Proceedings of Meetings on Acoustics

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Journal of Phonetics

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Dialog Act Classification Using N-Gram Algorithms

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Lecture 1: Machine Learning Basics

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Rhythm-typology revisited.

Phonological and Phonetic Representations: The Case of Neutralization

Individual Differences & Item Effects: How to test them, & how to test them well

Large vocabulary off-line handwriting recognition: A survey

Probabilistic Latent Semantic Analysis

Truth Inference in Crowdsourcing: Is the Problem Solved?

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

We re Listening Results Dashboard How To Guide

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Deep Neural Network Language Models

(Sub)Gradient Descent

An Online Handwriting Recognition System For Turkish

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

AQUA: An Ontology-Driven Question Answering System

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Linking object names and object categories: Words (but not tones) facilitate object categorization in 6- and 12-month-olds

What Different Kinds of Stratification Can Reveal about the Generalizability of Data-Mined Skill Assessment Models

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Speaker recognition using universal background model on YOHO database

Running head: DELAY AND PROSPECTIVE MEMORY 1

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Natural Language Processing. George Konidaris

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Transcription:

Prosody-based automatic segmentation of speech into sentences and topics as presented in a similarly called paper by E. Shriberg, A. Stolcke, D. Hakkani-Tür and G. Tür Vesa Siivola Vesa.Siivola@hut.fi Vesa Siivola, Audio Mining, Oct 3 2002 p.1/27

Why? Segmentation of sentences and topics is needed for robust information extraction. Sentence segmentation: No typographic cues in speech input (punctuation, paragraphs, capitalization, etc...) First step for topic segmentation Topic segmentation needed for: Topic detection and tracking Summarization this is speech as you can see it can be hard to read without any punctuation capitalization would also help as well as paragraphing Vesa Siivola, Audio Mining, Oct 3 2002 p.2/27

Part I: Models and features Vesa Siivola, Audio Mining, Oct 3 2002 p.3/27

Information sources for segmentation Language models Prosody Timing, pitch, stress and other voice qualities (creak) Relatively unaffected by word identity robust with speech recognition (SR) errors Segmentation by prosody can be used alone for audio browsing Many prosodic features invariant to channel changes robust Minimal additional load when used with traditional SR Vesa Siivola, Audio Mining, Oct 3 2002 p.4/27

Teaching prosodic models Word boundaries found by forced alignment SR ( mismatch between training and test data) Features extracted from words on both sides of boundaries (or alternatively 200 ms from pause) after a earthquake hit last night (pause) at eleven we bring 200ms 200ms 200ms 200ms Started with 100 features, reduced with decision tree experiments Pause durations, phone durations, pitch information, voice quality Given: speaker gender, speaker change no energy or amplitude based features these were not robust enough to different channels Vesa Siivola, Audio Mining, Oct 3 2002 p.5/27

Pause features Pauses give strong hint about possible topic and sentence breaks. Used features Current pause duration / previous pause duration False pauses at stop closures no problem, model learns them raw durations vs. speaker normalized durations Vesa Siivola, Audio Mining, Oct 3 2002 p.6/27

Phone and syllable duration features Typically speaker slows down toward the end of units. Last syllable length compared to average syllable length Last word s longest phone and longest vowel general features vs. speaker normalized features Vesa Siivola, Audio Mining, Oct 3 2002 p.7/27

Pitch features pitch tracker LTM filtering median filtering piecewise linear stylization feature computation µ log2 µ log f0 µ+log2 pitch (f 0 ) estimation not very robust, needs postprocessing f 0 doubling/halving, estimated on per speaker basis median filtering (removes unstable estimates at beginnings of voiced sounds piecewise linearization As a result we get stylized f 0 contour Vesa Siivola, Audio Mining, Oct 3 2002 p.8/27

Pitch features 2 f 0 reset features Speaker usually resets pitch at new block. Typically preceded by final fall Features: log ratio or log difference of min, max, mean, start, end stylized f 0 at next and preceding word f 0 range features Pitch range in word before the boundary compared to baseline f 0 f 0 slope at each side of the boundary f 0 continuity across the boundary Vesa Siivola, Audio Mining, Oct 3 2002 p.9/27

Other features pitch halving at f 0 detector (usually sign of creaky voice) gender of the speaker (given, not estimated) speaker change (given, not estimated) Vesa Siivola, Audio Mining, Oct 3 2002 p.10/27

Modeling: Decision trees CART-based, used IND-package which copes with missing values Decision trees (DT) make no assumption about shape of the feature distributions Also categorial features work Decision trees are interpretable by humans b=true true 0.8 0.2 true 0.51 0.49 a>5 a<=5 false 0.4 0.6 b=false false 0.1 0.9 false 0.3 0.7 Vesa Siivola, Audio Mining, Oct 3 2002 p.11/27

Feature selection algorithm Initially highly redundant feature set not very good for a greedy algorithm like CART Iterative feature selection algorithm First, leave-one-out as long as performance does not decrease significantly Second, beam search over all subsets, which contain human selected core features Vesa Siivola, Audio Mining, Oct 3 2002 p.12/27

Language modeling for sentence segmentation Hidden Markov model: observations words, states words and boundaries. Observations keep the model and word stream in sync. Can be described as: P S = P (< S > w n 1, w n 2, < S >)P (w n < S >) P!S = P (w n w n 1, w n 2, < S >) where <S> is a sentence boundary Trained from annotated, boundary tagged training data, Katz-backoff. Vesa Siivola, Audio Mining, Oct 3 2002 p.13/27

Language modeling for topic segmentation 100 unigram topic cluster language models politics HMM: states are topic clusters, observations are sentences Complete graph with initial and end states news start sports news end Presegment data on pauses > 0.65 s culture Vesa Siivola, Audio Mining, Oct 3 2002 p.14/27

Model combination Posteriori probability interpolation P (T i W, F ) λp LM (T i W ) + (1 λ)p DT (T i F i, W ) λ is optimized on the held-out data Integrated hidden Markov modeling Similar to to Hidden Markov model in language modeling Model emits both words and prosodic observations HMM posteriors as decision tree features (not used here) Vesa Siivola, Audio Mining, Oct 3 2002 p.15/27

Part II: Data and experiments Vesa Siivola, Audio Mining, Oct 3 2002 p.16/27

Data Switchboard data (SWB) Telephone conversations Hand-labeled subset of data from Linguistic Data Consortium (LDC) Broadcast news (BN): from LDC s 1997 Broadcast news Sentence boundaries automatically marked by MITRE tagger (punctuation, capitalization etc...) Some data from Hub-4 for language models for sentence detection TDT and TDT2 data for language models for topic detection Vesa Siivola, Audio Mining, Oct 3 2002 p.17/27

Data 2 For experiments, with recognized speech, SRI s DECIPHER recognizers 1-best output was used. Switchboard WER 46.7 % Broadcast news WER 30.5 % Task Training Tuning Test LM Prosody SWB sentence (real) 1.2M words 1.2M words 103K words 101K words SWB sentence (recog) 1.2M words 1.2M words 6K words 8K words BN sentence 130M words 700K words 24K words 21K words BN topic 10.7M words 700K words 205K words 44K words Vesa Siivola, Audio Mining, Oct 3 2002 p.18/27

Results: BN sentence segmentation Model True words SR words chance 6.2 13.3 Lower bound 0.0 7.9 With f 0 feats LM only 4.1 11.8 Prosody only 3.6 10.9 Interpolated 3.5 10.8 Combined HMM 3.3 13.3 Without f 0 feats Prosody only 3.8 11.3 Interpolated 3.2 Combined HMM 11.1 Vesa Siivola, Audio Mining, Oct 3 2002 p.19/27

Results: BN sentence segmentation Features queried in decision tree: 46% Pause duration at boundary 42% Speaker change 11% f 0 difference 1% last syllable duration Decision tree like expected from prosody study literature Prosodic features better than word based and also more robust to SR errors Dropping f 0 based features did not matter much Vesa Siivola, Audio Mining, Oct 3 2002 p.20/27

Results: SWB sentence segmentation Model True words SR words chance 11.0 25.8 Lower bound 0.0 17.6 LM only 4.3 22.8 Prosody only 6.7 22.9 Interpolated 4.1 22.2 Combined HMM 4.0 22.5 Vesa Siivola, Audio Mining, Oct 3 2002 p.21/27

Results: SWB sentence segmentation Features queried in decision tree: 49% Phone and syllable duration preceding boundary 18% Pause length at boundary 17% Speaker change 15% Pause at previous word boundary ( <S> Yeah <S> I know what you mean ) 1% How long this speaker has been speaking Prosodic model not very good or material very easy for language modeling few words appear often at starts of sentences ( I ) Prosodic model robust to ASR errors, LM model degrades badly Vesa Siivola, Audio Mining, Oct 3 2002 p.22/27

Results: BN topic segmentation Model True words SR words chance 0.3 0.3 With f 0 feats LM only 0.190 0.190 Prosody only 0.166 0.173 Combined HMM 0.138 0.144 Without f 0 feats Combined HMM 0.151 Vesa Siivola, Audio Mining, Oct 3 2002 p.23/27

Results: BN topic segmentation Features queried in decision tree: 43% Pause duration at boundary 36% f 0 range (preceding word vs. base f 0 ) 9% Speaker change 7% Speaker gender: men use f 0 differently than women (even after normalization) 5% How long this speaker has been speaking Pause duration underestimated, since the speech was preprocessed by cutting at longer than 0.65 s pauses Prosody more reliable than LM. Combination very good F 0 features important, serious degradation without. Vesa Siivola, Audio Mining, Oct 3 2002 p.24/27

Summary topic LM much more robust than sentence LM Feature usage corpus and task dependent Improvements Model lexical stress, syllable structure Different combinations of features Remove mismatch in training (true words/recognized words) condition on shows speaking style speaker Vesa Siivola, Audio Mining, Oct 3 2002 p.25/27

Home exercise a) Which features were the most important? b) Which of these were estimated, which ones were given? c) How hard it would be to estimate given features and what kind of error rates could be achieved? d) How well would the sentence/topic finder work, if it really had to estimate also the given features? e) What are the real phenomenon that the most important features are sensitive to? That is, how do people separate sentences and topics in real life and how these effects can be seen from the features? Without tests there can be no right answer to questions c) and d). State your own guestimate and also shortly the reasoning behind your answer. Vesa Siivola, Audio Mining, Oct 3 2002 p.26/27

Project work: based on temporal features Sentence segmentation Data: Syntymättömien sukupolvien Eurooppa Features pause duration prev pause duration last syl length (vs. average syl len?) average length of syllable in last sentence last word Modeling: SOM or MLP? Vesa Siivola, Audio Mining, Oct 3 2002 p.27/27