Probabilistic Latent Semantic Analysis for Broadcast News Story Segmentation

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Probabilistic Latent Semantic Analysis

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Latent Semantic Analysis

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Switchboard Language Model Improvement with Conversational Data from Gigaword

English Language and Applied Linguistics. Module Descriptions 2017/18

Learning Methods in Multilingual Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Mandarin Lexical Tone Recognition: The Gating Paradigm

Calibration of Confidence Measures in Speech Recognition

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

Acquiring Competence from Performance Data

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Investigation on Mandarin Broadcast News Speech Recognition

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

arxiv: v1 [cs.cl] 2 Apr 2017

Truth Inference in Crowdsourcing: Is the Problem Solved?

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Lecture 9: Speech Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

On document relevance and lexical cohesion between query terms

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

arxiv:cmp-lg/ v1 22 Aug 1994

Modeling function word errors in DNN-HMM based LVCSR systems

A Bayesian Learning Approach to Concept-Based Document Classification

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Linking Task: Identifying authors and book titles in verbose queries

Disambiguation of Thai Personal Name from Online News Articles

Assignment 1: Predicting Amazon Review Ratings

Comment-based Multi-View Clustering of Web 2.0 Items

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Rule Learning With Negation: Issues Regarding Effectiveness

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Lecture 1: Machine Learning Basics

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Improvements to the Pruning Behavior of DNN Acoustic Models

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Deep Neural Network Language Models

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Constructing Parallel Corpus from Movie Subtitles

Modeling function word errors in DNN-HMM based LVCSR systems

Combining a Chinese Thesaurus with a Chinese Dictionary

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Multi-View Features in a DNN-CRF Model for Improved Sentence Unit Detection on English Broadcast News

On-the-Fly Customization of Automated Essay Scoring

Multi-Lingual Text Leveling

CS Machine Learning

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

CROSS COUNTRY CERTIFICATION STANDARDS

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Cross Language Information Retrieval

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

The Role of String Similarity Metrics in Ontology Alignment

A Case Study: News Classification Based on Term Frequency

Letter-based speech synthesis

Miscommunication and error handling

Detecting English-French Cognates Using Orthographic Edit Distance

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Generative models and adversarial training

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

CEFR Overall Illustrative English Proficiency Scales

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Secondary English-Language Arts

Rule Learning with Negation: Issues Regarding Effectiveness

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Memory-based grammatical error correction

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Phonological and Phonetic Representations: The Case of Neutralization

Houghton Mifflin Online Assessment System Walkthrough Guide

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Voice conversion through vector quantization

Word Segmentation of Off-line Handwritten Documents

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

HLTCOE at TREC 2013: Temporal Summarization

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Corrective Feedback and Persistent Learning for Information Extraction

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Concepts and Properties in Word Spaces

Sapphire Elementary - Gradebook Setup

Transcription:

Interspeech 2011, Florence, Italy Probabilistic Latent Semantic Analysis for Broadcast News Story Segmentation Mimi LU 1,2, Cheung-Chi LEUNG 2, Lei XIE 1, Bin MA 2 and Haizhou LI 2 1 Shaanxi Provincial Key Lab of Speech and Image Information Processing, Northwestern Polytechnical University, China 2 Institute for Infocomm Research, A*STAR, Singapore 1

Broadcast news story segmentation The task of dividing broadcast news (BN) programs into homogeneous units each addressing a main topic A key precursor to various tasks, such as spoken document retrieval and summarization Three categories of cues for story segmentation: Lexical, acoustic and visual cues 2

Motivation Lexical cohesion based methods Words in a story hang together by semantic relations Different stories deploy different set of words Usually measured by rigid word counts Literal matching on individual terms is unreliable: Synonym: car, automobile ; Polysemy: china can refer to a nation or porcelain; apple can refer to Apply Computer Inc or fruit; Conceptual matching is introduced: E.g. Latent semantic analysis (LSA), Probabilistic latent semantic analysis (PLSA) 3

Motivation Spoken document segmentation is different from text segmentation: Task performs on LVCSR outputs where erroneous words exist, thus breaking the lexical cohesion Many recognition errors are from Out-Of-Vocabulary (OOV) words, which are typically name entities that are key to topics Phoneme n-gram: partial matching Incorrectly recognized words may contain subword units correctly recognized 4

Contributions We use PLSA for story segmentation for broadcast news We use phoneme n-gram as the basic unit for lexical cohesion measure to handle erroneous LVCSR transcripts Cross entropy based approach is introduced for lexical cohesion measure and it is compared with cosine similarity We compare dynamic programming (DP) with TextTiling for story boundary identification 5

PLSA model Probabilistic latent semantic analysis d: document, w: word, z: topic Maximum Likelihood Estimation Pdw (, ) = PdPw ( ) ( d) Pw ( d) = Pw ( zpz ) ( d) z Z Maximize log-likelihood of co-occurrence pairs L ndw (, )log Pdw (, ) = d w E-step M-step Pw ( z) = d w Pw ( zpz ) ( d) Pz ( dw, ) = Pw ( zpz ) ( d) ndwpzdw (, ) (, ) d z ndwpzdw (, ) (, ) Pz ( d) = w z ndwpzdw (, ) (, ) w ndwpzdw (, ) (, ) Folding-in process for unseen test data: keep P(w z) fixed 6

System overview Stemming & stopwords removal ASR transcripts Training data PLSA parameter estimation Word count matrix: Rows: vocabulary, Columns: document Pw ( z) Lexical cohesion measure Boundary identification Test data PLSA parameter estimation with foldingin Pwb ( test ) = Pw ( z)* Pz ( btest ) z 7

Sentence Construction Sentence delimiters are not available in LVCSR transcripts Pseudo-sentence: each text block with a fixed number of consecutive words is formed Story boundary candidates 8

Lexical cohesion measure Cosine similarity Measure the closeness between two vectors, usually calculated on term frequencies Apply with PLSA statistics: Sim(, i j) = w w Pwb ( ) Pwb ( ) i Pwb ( ) Pwb ( ) i 2 2 j w j 9

Lexical cohesion measure Cross entropy A divergence measure to depict how different two distributions are H( pq, ) px ( )log qx ( ) = Minimum obtained when p = q x 10

Lexical cohesion measure Cross entropy Apply with PLSA statistics: Normalization: CrossEnt( i, j) P( w b )log P( w b ) = w i j CrossEnt(, i j) CrossEnt(,) i i Dissim(, i j) = CrossEnt(, i j) 11

Boundary identification Local comparison Compute lexical scores between adjacent blocks Locate valleys (similarity) or peaks (dissimilarity) E.g. TextTiling When salient topic change occurs 12

Boundary identification Global optimization Minimize the cost of a specific segmentation K C( S) = Cost( sk ) k = 1 Sˆ = arg min CS ( ) S Cost( s ) = Dissim(, i j) N( len( s )) S = {s 1,, s k,, s K } is a segmentation of document D Implementation: dynamic programming (DP) k When topic transitions are smooth i j k Normalization factor 13

Boundary identification Normalization factor ( ( )) N len s k To make long and short segments comparable Inter-block disparity distribution: N( len) = len α, α > 1 14

Corpus: Experimental setup LVCSR transcripts of TDT2 VOA English broadcast news Data used (Number of programs): training = 56, development = 27, test = 28 Tuning parameters: TextTiling: block length, sliding window shift, lexical score threshold DP: block length, alpha in normalization factor Phoneme n-gram sequences generated from word transcripts using the CMU dictionary Evaluation criterion: F1-measure F1 measure = 2* recall * precision recall + precision 15

Experimental results F1-measure 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.6985 0.6759 0.6379 0.6207 0.5439 0.5349 word 1-gram phoneme 1-gram phoneme 2-gram phoneme 3-gram phoneme 4-gram 0.1 0 PLSA-DP-CE PLSA-DP-CS PLSA-TT-CE PLSA-TT-CS LSA-TT-CS Classical TT Approach DP: dynamic programming; TT: TextTiling CE: cross entropy; CS: cosine similarity 16

Conclusions We investigate the use of PLSA for BN story segmentation Phoneme subwords are adopted to address problems from LVCSR errors Cross entropy and cosine similarity for lexical cohesion measure, and DP and TextTiling for story boundary identification are compared respectively Experimental results suggest: PLSA can effectively boost story segmentation performance Cross entropy shows advantages for describing distributional variation DP provides better performance for story boundary identification Performance gain using phoneme n-gram shows its ability to handle erroneous LVCSR transcripts 17

Thanks for your attention! 18