Automatic Speaker Recognition

Similar documents
Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Recognition at ICSI: Broadcast News and beyond

Human Emotion Recognition From Speech

Support Vector Machines for Speaker and Language Recognition

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

(Sub)Gradient Descent

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Emotion Recognition Using Support Vector Machine

Modeling function word errors in DNN-HMM based LVCSR systems

Speaker Recognition. Speaker Diarization and Identification

Speaker recognition using universal background model on YOHO database

WHEN THERE IS A mismatch between the acoustic

Python Machine Learning

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Learning Methods in Multilingual Speech Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Lecture 1: Machine Learning Basics

CS Machine Learning

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

On the Formation of Phoneme Categories in DNN Acoustic Models

Spoofing and countermeasures for automatic speaker verification

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Assignment 1: Predicting Amazon Review Ratings

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Generative models and adversarial training

Rule Learning With Negation: Issues Regarding Effectiveness

Edinburgh Research Explorer

Artificial Neural Networks written examination

Rule Learning with Negation: Issues Regarding Effectiveness

Probabilistic Latent Semantic Analysis

Calibration of Confidence Measures in Speech Recognition

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Switchboard Language Model Improvement with Conversational Data from Gigaword

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

arxiv: v2 [cs.cv] 30 Mar 2017

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

CS 446: Machine Learning

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Proceedings of Meetings on Acoustics

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

Lecture 9: Speech Recognition

The stages of event extraction

Segregation of Unvoiced Speech from Nonspeech Interference

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Linking Task: Identifying authors and book titles in verbose queries

arxiv: v1 [cs.lg] 3 May 2013

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Finding Translations in Scanned Book Collections

Automatic Pronunciation Checker

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

CSL465/603 - Machine Learning

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Multi-modal Sensing and Analysis of Poster Conversations toward Smart Posterboard

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Exposé for a Master s Thesis

Deep Neural Network Language Models

Knowledge Transfer in Deep Convolutional Neural Nets

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

SARDNET: A Self-Organizing Feature Map for Sequences

Reducing Features to Improve Bug Prediction

A Vector Space Approach for Aspect-Based Sentiment Analysis

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Speech Recognition by Indexing and Sequencing

Speaker Identification by Comparison of Smart Methods. Abstract

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Copyright by Sung Ju Hwang 2013

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Investigation on Mandarin Broadcast News Speech Recognition

Segmentation of Multi-Sentence Questions: Towards Effective Question Retrieval in cqa Services

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Lecture 1: Basic Concepts of Machine Learning

THE world surrounding us involves multiple modalities

Transcription:

Automatic Speaker Recognition Qian Yang 04. June, 2013

Outline Overview Traditional Approaches Speaker Diarization

State-of-the-art speaker recognition systems use: GMM-based framework SVM-based framework

Outline Overview Traditional Approaches Features and Models Evaluation and Performance Speaker Diarization

Traditional Approaches Features for Speaker Recognition Primary features used in speaker recognition systems are cepstral features (eg. MFCC, PLP) Voice activity detection (VAD) to remove non-speech frames Some form of blind decovolution is used to remove stationary channel effects (CMS) Feature warping is to compensate channel variability Time differential cepstral (delta cepstra, or delta detla features) are usually appended to cepstral features Typically 24-40 dimensional features are used

Support Vector Machines Latest discriminative approach for speaker verification

Traditonal Approaches Support Vector Machines A SVM is a supervised method A SVM is a binary discriminative classifier ( target speaker vs. impostors) transform data to a higher dimensional space using kernel functions Construct a hyperplane in higher dimensional space to maximize the margin between support vectors The data points lying on the boundaries are support vectors Design different kernels for speaker recognition task kernel Ideal output (0/1) Support vectors When classification, a class decision is based upon whether the value, f(x), is above or below a threshold

Traditonal Approaches Support Vector Machines How to represent speaker utterances in higher dimensional space?? Example: A linear kernel for speaker verification task Train GMMs on two utterances using MAP Derive a distance metric between two utterances using a modified KL divergence Define a linear kernel based on the distance above

Outline Overview Traditional Approaches Speaker Diarization Overview Cross-show speaker diarization Speaker Tracking

Unlike speaker verification/identifcation, no enrollment data for training and no prior knowledge about number of speakers!!!

Speaker Diarization Generic Architecture 3-step Diarization process 1. Voice Activity Detection (VAD) Detects different acoustic events: music, speech, noise.. 2. Speaker Change Detection Detects speaker turn changes missed by VAD 3. Speaker Clustering Group speaker segments from same speaker together

Cross-show Speaker Diarization Giving a set of shows from same source, provides global IDs for speakers who appear across shows. Why cross-show? In digital libraries or multimedia archives, some speakers may appear multiple times, such as journalists, politicians.. Linkage to conventional speaker diarization Global IDs vs. Local IDs

Cross-show Speaker Diarization Generic Architectures Scheme 1 Scheme 2 Scheme 3

Cross-show Speaker Diarization Some results on English Podcast News data Cross-show DER as metrics

Speaker Tracking Aims to determine if and when target speaker speaks in a multispeaker record. The target speaker has been enrolled into the system in the previous phase. Why speaker tracking? Tracking anchor speakers or politicians on broadcast news Linkage to speaker diarization Supervised approach (speaker models are required) Diarization results are used as inputs

KIT s speaker tracking system Consists of two components: speaker segementation + Open-set speaker identification

System Architecture Speaker segmentation Split the speech into segments. Each segment has only one speaker speaking Open-set speaker identification(sid) - Training Phase: train 4 UBMs for telephone/studio male/female speech (gender and channel dependent) on ESTER2 data Speaker models obtained by MAP adapted on corresponding UBMs - Detection Phase: System 1(baseline) - gender/bandwidth classification (GMM-based classifier), test segment scored against speaker and UBM models which matches gender/bandwidth conditions System 2 (FSC) - Apply frame-base score competition (FSC) [JSW07] - 100 impostors(50f+50m) for T-Norm from ESTER1 data

Some results on French broadcast news ESTER2 data, French BN 114 target speakers, 105h training data, 6 hours dev data Metrics: Half total error rate (HTER) System HTER-time HTER-speaker Baseline 25.307% -- 31.943% -- Baseline+TNorm 25.314% -0.03% 31.953% -0.03% FSC 24.098% 4.78% 31.319% 1.95% FSC+TNorm 27.830% -9.97% 33.260% -4.12%

Thanks for your attention! Questions?