The ICSI RT-09 Speaker Diarization System. David Sun

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Speech Emotion Recognition Using Support Vector Machine

Human Emotion Recognition From Speech

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

WHEN THERE IS A mismatch between the acoustic

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Modeling function word errors in DNN-HMM based LVCSR systems

Speaker recognition using universal background model on YOHO database

A study of speaker adaptation for DNN-based speech synthesis

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Methods in Multilingual Speech Recognition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Word Segmentation of Off-line Handwritten Documents

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

New Insights Into Hierarchical Clustering And Linguistic Normalization For Speaker Diarization

Calibration of Confidence Measures in Speech Recognition

Python Machine Learning

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Artificial Neural Networks written examination

Generative models and adversarial training

Speaker Recognition. Speaker Diarization and Identification

Investigation on Mandarin Broadcast News Speech Recognition

Probabilistic Latent Semantic Analysis

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

A Case Study: News Classification Based on Term Frequency

Segregation of Unvoiced Speech from Nonspeech Interference

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Speech Recognition by Indexing and Sequencing

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Multi-modal Sensing and Analysis of Poster Conversations toward Smart Posterboard

Lecture 1: Machine Learning Basics

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Evolutive Neural Net Fuzzy Filtering: Basic Description

On the Formation of Phoneme Categories in DNN Acoustic Models

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Affective Classification of Generic Audio Clips using Regression Models

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Support Vector Machines for Speaker and Language Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

GACE Computer Science Assessment Test at a Glance

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Annotation and Taxonomy of Gestures in Lecture Videos

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Edinburgh Research Explorer

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

CS Machine Learning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Proceedings of Meetings on Acoustics

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Meta Comments for Summarizing Meeting Speech

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Assignment 1: Predicting Amazon Review Ratings

Speaker Identification by Comparison of Smart Methods. Abstract

Automatic Pronunciation Checker

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

A Reinforcement Learning Variant for Control Scheduling

Transcription:

The ICSI RT-09 Speaker Diarization System David Sun

Papers The ICSI RT-09 Speaker Diarization System, Gerald Friedland, Adam Janin, David Imseng, Xavier Anguera, Luke Gottlieb, Marijn Huijbregts, Mary Tai Knox and Oriol Vinyals, Transactions on Audio, Speech and Language Processing (TASLP, July 2011 Speaker Diarization: a review of recent research, Xavier Anguera, Simon Bozonnet, Nicholas Evans, Corinne Fredouille, Gerald Friedland, and Oriol Vinyals, Transactions on Audio, Speech and Language Processing (TASLP), July 2011 Robust speech/non-speech classification in heterogeneous multimedia content, Huijbregts, Marijn and de Jong, Franciska, Speech Commun, Feb 2011

Topics Speech diarization overview ICSI RT-09 Signal pre-processing Speech activity detection Speaker segmentation clustering Joint audio video diarization

Speech Diarization Question who spoke when? Unsurpervised segmentation into speakerhomogenous regions. Challenge: Number of speakers unknown Amount of speech unknown

Applications Annotating broadcast news, TV, radio. Meeting content indexing, linking, summarization, navigation Multiple audio, video and textual streams. Behavior analysis Find dominant speakers Engagement Emotions Speech-to-text Speaker model adaptation

Major Projects Euro European Union (EU) Multimodal Meeting Manager (M4) project, The Swiss Interactive Multimodal Information Management (IM2) project The EU Augmented Multi-party Interaction (AMI) project/ EU Augmented Multi-party Interaction with Distant Access (AMIDA) project EU Computers in the Human Interaction Loop (CHIL) project USA?

Evaluation US National Institute for Standards and Technology (NIST) official speaker diarization (Rich-Transcription) evaluation Standard protocol and database broadcast news (BN) Recorded in studio (high S/N ratio) Structured speech meeting data Recorded using far field mics (high variability, room artifacts) More spontaneous and overlapping lecture meetings coffee breaks

General Architecture Noise reduction Multichannel processing* Feature extraction Voice activity Detection

ICSI Speaker Diarization System (Single Distant Microphone)

ICSI Speaker Diarization System (Multiple Distant Microphone)

Signal Preprocessing Dynamic Range Compression Convert to linear 16-bit PCM (truncate high order bits) Downsample to 16KHz (little impact on performance)

Multichannel Processing Beamforming S/N enhancement technique Combine recording from multiple microphones into a single enhanced audio source.

Beamforming Microphones are spatially located, each captures random noise. Adding multiple channels together: Desired signal enhanced Noise cancels out or suppressed Delay-(filter)-sum Output is a weighted sum of delayed inputs

Robust Beamformer (BeamformIt)

Feature Extraction MFCC computed from HTK 19 features 10ms step size 30ms analysis window

Speech Activity Detection (SAD) Task: detecting the fragments in an audio recording that contain speech Simultaneous classification and classification Classification task Given a fragment of audio distinguishing speech from non-speech Non-speech can be silence or audible non-speech. Segmentation task Determine the start time and end time of each fragment

Why? More practical to process small speech segments instead of an entire recording Performance of the ASR system can be enhanced ASR will always produce hypothesis on input audio, even for audible non-speech => more insertion errors cluster the speech segments on a speaker for automatic ASR tuning. All non-speech presented to a speaker clustering system will contaminate the speaker models and this will decrease the clustering quality

Silence Based Detection Assume audio only contains speech and silence. Algorithm: Calculate the energy of short (often overlapping) windows. The local minima of this energy series are considered silence.

Silence Based Detection Broadcast News (BN) recordings major part of the recording consists of speech and small pauses between utterances or topics Meetings more spontaneous speech

Model Based Detection Train a GMM for each class. HMMs with one state for each class tend to produce short segments even with low transition probabilities To force minimum time constraints on segments HMMs are created with a string of states per class that each share the same GMM. Control minimum time of each segment with the number of strings in each string

Model Based Detection

Model Based Detection Performing a Viterbi decoding run using HMM results in the segmentation and classification of an audio file. Advantage: Very easy to add classes. Disadvantage The GMMs need to be trained on some training set Acoustic mismatch between training and testing data sets

Model Based Detection How to do it without a training set? Can one use the data itself during classification?

SAD Step1. Bootstrapping Speech/Silence Perform initial segmentation using some standard modelbased algorithm Two parallel HMM: initial models M non-speech and M speech Diagonal covariance Minimum of 30 states for silence and 75 states for speech Feature extraction 12 MFCCs Zero-crossing rates 1 st and 2 nd derivatives 39-D feature vector every 32 ms window/10ms overlap Data segmented into sets of speech / non-speech regions.

SAD Step2. Training Models for Non-speech Non-speech (silence + sound) model trained from data Evaluate confidence score on segments classified as non-speech. Normalize all segments longer than 1sec to 1sec intervals M silence M sound

SAD

SAD Step2. Training Models for Non-speech Use M silence and M sound and M speech to reclassify data. Data assigned to sound and silence models are merged M silence and M sound trained on input data, but M speech trained on outside data. M silence and M sound will likely give higher likehood to all data compared to M speech samples pulled from speech model are dropped We are more confident about the models Evaluate confidence score on remaining data via lower threshold. Additional Gaussians can be used to train GMM. Iterate the above process three times. No data from Speech model has been moved to Sound model

SAD Step3: Training all three models M silence and M sound are well trained All non-speech segments are likely be correctly classified. M speech can be trained with all remaining data Retrain M silence, M sound, M speech together by increasing the number of Gaussians at each step until hit threshold.

SAD

SAD Step4: Training Speech and Silence Models Algorithm not well suited for data that contains only speech and silence with no non-speech sound. M sound will be trained on misclassfied speech After a couple of iterations, M sound and M speech becomes competing models Use Bayesian Information Criterion to check for model similarity If S(i, j) > 0, replace with a single speech mode.

SAD

Speaker Segmentation and Clustering Agglomerative hierarchical clustering Start with large number of clusters Iterative procedure for cluster merging, model re-training, realignment.

We want to Initialization estimate k the number of clusters estimate g the number of Gaussians per GMM Step 1 Pre-clustering Extract long-term acoustic features with good speaker discrimination. 100 pitch values and 80 formants /second Hamming window size of 1000 ms Speech/non-speech segment less than 2000ms untouched, larger than 2000ms split to 1000ms segments. Trade-off between accurate estimation of features or a large number of feature vectors.

Intialization Step 1 Pre-clustering Feature vectors clustered with diagonal covariances Over-estimate the number of initial clusters Merging will only reduce clusters

Initialization Adaptive seconds per Gaussian: Number of seconds of data available per Gaussian for training k should be chosen in relation to the number of different speakers g is related to the total amount of speech Use linear regression to estimate g

Core Algorithm Model HMM to capture temporal structure of acoustic observations GMM as emission probabilities. State represent different speakers. Agglomerative Hierarchical Clustering Iterative algorithm Compares clusters via metric, merge ones that are similar

Core Algorithm Step 1: Model retraining and re-segmentation Given speech, goal is to generate speaker models and segment data without prior information. Iterative procedure (like EM) Training based on current segmentation Each frame is assigned to an single state Using all the segments belonging to state k, update GMM using standard EM. Recompute segmentation based on updated model. Using Viterbi algorithm HMM need to remain in the same state for at least 2.5 seconds (min duration of speech of 250 samples). To ensure the clusters are not modeling small units such as phones. Each speaker takes the floor for at least that amount of time.

Core Algorithm Step2: Model merging: Each cluster will correspond to one speaker but a speaker will have many clusters. Metric to determine if two clusters should be merged. Model selection problem: Given two clusters, are the two separate models better than a joint model. Measure the change in BIC score. Decision rule: S(i, j) > 0 then merge, otherwise not merge

Core Algorithm Step2: Model merging: At each iteration, merge the pair with largest S(i, j). How to merge: The sum of the two merged GMM Initialize each mixture with the same mean and variance as the original Mixture weights re-scaled to sum to one. Step 4 Stopping criteria: No more merging required when all delta-bic are negative. Final segmentation based on current cluster models.

Audiovisual Diarization One distant mic and close up camera. MFCC (same as before) Prosodic: 10 features Video: motion vector magnitudes over estimated skin blocks. Alpha = 0.75, beta = 0.1

Results