Increasing Speaker Recognition Algorithm. Unseen Conditions

Similar documents
ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speaker recognition using universal background model on YOHO database

WHEN THERE IS A mismatch between the acoustic

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

A study of speaker adaptation for DNN-based speech synthesis

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Recognition at ICSI: Broadcast News and beyond

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Human Emotion Recognition From Speech

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Emotion Recognition Using Support Vector Machine

Support Vector Machines for Speaker and Language Recognition

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Speaker Identification by Comparison of Smart Methods. Abstract

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Spoofing and countermeasures for automatic speaker verification

Speaker Recognition. Speaker Diarization and Identification

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Calibration of Confidence Measures in Speech Recognition

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Lecture 1: Machine Learning Basics

Probabilistic Latent Semantic Analysis

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Learning Methods in Multilingual Speech Recognition

The Good Judgment Project: A large scale test of different methods of combining expert predictions

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Body-Conducted Speech Recognition and its Application to Speech Support System

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Knowledge Transfer in Deep Convolutional Neural Nets

Automatic Pronunciation Checker

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speaker Recognition For Speech Under Face Cover

Probability and Statistics Curriculum Pacing Guide

Proceedings of Meetings on Acoustics

Segregation of Unvoiced Speech from Nonspeech Interference

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Why Did My Detector Do That?!

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Python Machine Learning

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Professor Christina Romer. LECTURE 24 INFLATION AND THE RETURN OF OUTPUT TO POTENTIAL April 20, 2017

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Software Maintenance

Modeling function word errors in DNN-HMM based LVCSR systems

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Reinforcement Learning by Comparing Immediate Reward

Switchboard Language Model Improvement with Conversational Data from Gigaword

An Introduction to Simio for Beginners

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Modeling function word errors in DNN-HMM based LVCSR systems

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Edinburgh Research Explorer

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

(Sub)Gradient Descent

ECE-492 SENIOR ADVANCED DESIGN PROJECT

Running head: DELAY AND PROSPECTIVE MEMORY 1

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Generative models and adversarial training

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A Pipelined Approach for Iterative Software Process Model

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

CS Machine Learning

University of Groningen. Systemen, planning, netwerken Bosman, Aart

On the Formation of Phoneme Categories in DNN Acoustic Models

Voice conversion through vector quantization

Investigation on Mandarin Broadcast News Speech Recognition

Centre for Evaluation & Monitoring SOSCA. Feedback Information

Rhythm-typology revisited.

Modeling user preferences and norms in context-aware systems

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Rule Learning With Negation: Issues Regarding Effectiveness

Certified Six Sigma Professionals International Certification Courses in Six Sigma Green Belt

Firms and Markets Saturdays Summer I 2014

Lower and Upper Secondary

PELLISSIPPI STATE TECHNICAL COMMUNITY COLLEGE MASTER SYLLABUS APPLIED MECHANICS MET 2025

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

School of Innovative Technologies and Engineering

Reducing Features to Improve Bug Prediction

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Leveraging MOOCs to bring entrepreneurship and innovation to everyone on campus

Transcription:

Increasing Speaker Recognition Algorithm Agility and Effectiveness for Unseen Conditions Fred Goodman, MITRE Corporation

Talk Outline Issues when using Speech as a Biometric Evaluating Speaker Recognition Systems Speaker Recognition Techniques Expanding Speaker Recognition Applications Dealing with Unseen Conditions Conclusions Page 2

Speech as a Biometric Speech is performed, while many other biometrics (fingerprint and iris) are not. Performances are affected by internal factors ( intrinsic ) as well as external ones ( extrinsic ). Modern speaker e recognition o is concerned ce ed with text-t independent matching. Testing assumes the talker is not cooperative ; i.e. the talker is unaware of the system. Most testing uses a verification paradigm (i.e. an identity is claimed; the system says yea or nay). This generalizes to predict closed-set or even open-set testing results. Note: Human SID performance is generally worse than machine performance! (exception: close friends, loved ones). Page 3

Sources of Speaker Variability Page 4

Generic SID Biometric Block Diagram Enrollment Text-Independent, Unaware Bob Feature Extraction Model Training Sally Bob s Model Sally s Model Recognition????? Feature Extraction Scoring & Decision Sally! N.B. Must permit none-of-the-above Page 5

What comes out of a SID verifier? A number representing the likelihood that the current speaker is the same as the model speaker The figure shows actual score histograms (NIST 2008 eval.) Target PDF: =4.5, =2.01 True Trial scores MD Impostor PDF: =0, =1.00 FA Non-Target Trial scores More FA, Fewer misses Fewer FA, More misses MD: Missed Detection ti FA: False Accept Decision i Thresholdh Page 6

Characterizing Performance: The DET Curve The Detection Error Tradeoff curve shows performance at all threshold settings simultaneously Actual Experimental Decision Points ( calibration ) Desired FA rate is specified (e g 1%) Individual SID systems (or subsystems) Fewer FA, More Misses Notice: If P(tgt) = (e.g. 1%).001 & EER=1%, EER More FA, for 1000 trials, Fewer Misses we get ~1 true hits & ~11 FAs Page 7

Issues when using Speech as a Biometric Evaluating Speaker Recognition Systems Speaker Recognition Techniques Expanding Speaker Recognition Applications Dealing with Unseen Conditions Conclusions Page 8

Sources of Speaker Identity (Features) Low-level (10 30 msec) Anatomical structure of vocal tract (e.g. nasal passages) Acoustical characteristics of glottal source Medium-level (100s of msec) Prosodics: rhythm, speed, intonation, volume Idiosyncrasies (e.g. lip smacks, uh-huh ) High-level (100 1000 msec) Word choices Grammatical usages Accent/Dialect/Language Page 9

Speech Spectrograms Analysis Window Analysis Window ~=100 samples (WB) Greasy wash water all year ~=400 samples (NB) Page 10

Spectro-Temporal Receptive Fields (STRFs) Greasy wash water all year STRF features are extremely robust to wideband noise Page 11

Prosodic Features in SID Pitch, energy & duration short-time values are converted into features as shown below: Those features are turned into even more sophisticated features using N-grams, rank normalization, etc; ultimately a classifier is applied (e.g. Support Vector Machine). Good performance requires several minutes of speech Fuses very well with other methods Page 12

MLLR: Deviation from the Average Speaker The MLLR (Maximum Likelihood Linear Regression) technique originally used in speech recognition, has proven valuable for SID Transformations are of the form _new = A* + b Where A is a matrix & b is a vector (A is 39x39 and b is 39x1) Up to 8 phone classes used MLLR relies on speech recognition to find phone boundaries Page 13

Gaussian Mixture Modeling (GMM) With a small number of parameters, complex shapes can be modeled (3 1-Dim. Gaussians shown below): 2-D Example*: Training uses EM iterative algorithm) to build 3-element model Random Starting points Final- (8 iterations 3 s, 3 s, 3 wts later) [* Actually 40-dim features, 1-2k mixtures] Page 14

Supervectors & Dimension Reduction Concatenate GMM mixture means to make a Supervector (up to 2k*40)=80k length vector Reduce noise dimensions by applying Joint Factor Analysis or i-vector/plda UBM [T] Subject Model Unknown Data GMM Generation Supervector Creation JFA or i-vector/ PLDA Match Process Score [Picture courtesy of IBM] unwanted dimensions Page 15

Expanding Speaker Recognition Applications Landline Telephone: 1970 Consistent Calibration : 1996 Cellular Telephone: 2001 Language (Multiple/Cross) : 2004 Interview (Cross) Microphone: 2008 Cross-Channel (tel. vs. interview): 2008 Aging: 2010 Vocal Effort/Lombard: 2010 Additive Noise: 2011 Room Reverberation: 2011 Cross-Room ( bright vs. dead ): 2011 Minimal/No Training Data: 2011 Confidence : 2011 Page 16

Issues when using Speech as a Biometric Evaluating Speaker Recognition Systems Speaker Recognition Techniques Expanding Speaker Recognition Applications Dealing with Unseen Conditions Conclusions Page 17

Defining the Unseen Data Problem Traditional pattern recognition techniques require substantial training data from the same source Without such training data, getting a valid log-likelihood ratio is problematic But real-world applications may not cooperate with our needs Infinite number of room sizes, microphone positions, wall materials, noise sources, etc. Unlike telephone where standards limit variation Algorithms historically never self-modified, based on conditions. Even now, they do very little. What can be done to limit the damage when a new source of data appears? Solving g this problem means getting g close to clean performance Page 18

Solving the Unseen Data Problem Use simulation to create extrinsic conditions (noise, reverb) Feed simulated data to make backend (JFA, i-vectors) better Collect intrinsic conditions Whisper to shout (effort), fast to slow (rate) Read vs. oration vs. telephony vs. interview (style) Illness, drunk, sleepy, aging Understand the effects on Speaker models Automatically detect conditions (e.g. SNR, speech rate) Modify algorithms according to the differences between training and test conditions For a brand-new condition: Use unsupervised adaptation to improve performance over time Learn to detect data too bad to process effectively (no-decision) Use supervised adaptation with a few known true cuts Page 19

Example Condition-Driven Algorithm Mods Modify front-end feature extraction based on conditions, because a feature set is robust against reverb Decide to weight certain speech sounds (phonemes) differently because noise is distorting them (fricatives, mixed-excitation sounds zh ) Change fusion weights based on SNR or Reverb (RT) because (e.g.) prosodic energy features degrade quickly in that condition. Modify decision threshold to reflect large differences in either extrinsic or intrinsic conditions (e.g. vocal effort) between training and recognition samples Page 20

Conclusions Speaker recognition is still a serious research issue 40 years after its birth The expansion of application conditions since 2006 has been dramatic But we are coming to a crossroads: Collecting hundreds of speakers is expensive Exposing them to many extrinsic/intrinsic conditions is timeconsuming & difficult Encouraging algorithm developers to use simulated extrinsic data to become more robust Must continue to collect intrinsic variations until better models of speech behavior can be built Encourage algorithm developers to estimate extrinsics/intrinsics & modify algorithms accordingly Page 21

Thanks for inviting me and listening! Page 22

Extra Slides Page 23

Mel-Warped Cepstrum Features mel=2595log 10 ((f/700)+1) Triangular, Mel-Weighted Filter Bank The mel-scale, based on human perception, is ~linear <1000 Hz and logarithmic >1000 Hz. 12<N>20, plus Velocity and (perhaps) Acceleration terms Window DFT Mel-Warp log DCT Take Time 1 st N Diff. Page 24

Frequency Domain Linear Prediction Alternative Feature set, shows robustness to reverb DCT Sub-band band Windowing (96 bands) FDLP Gain Norm. Mel-scale Short-term Integration (32 ms) Cepstral Xform Page 25

I-Vector Generation/PLDA M = m + Tw (m is the UBM Supervector, M is the incoming Supervector) Estimate the Total variability matrix T, given training GMM Supervectors (using the EM algorithm). The i-vectors (w) are the speaker/session factors of the T matrix (analogous to the factors in JFA) Results in a ~400 element vector w PLDA breaks it down further, with the i-vectors as an input: w = m+ Vy + Ux + where V = speaker subspace (y are the factors) U = channel subspace (x are the factors) m = mean vector over all training data = residual noise (covariance matrix ) Page 26

Shoebox Room Reverberation Simulation Allows the user to specify: Materials for the 4 walls, ceiling & floor Dimensions (x,y,z) Positions of the sound source & receiver HRTF for receiver Results in a Room Impulse Response Characterized by RT60 metric Which can then be convolved with clean speech Key Limitation: can t put humans in the room bodies soak up sound. As a result RIR is overly bright. Much more sophisticated room simulations exist ($$$) Page 27

Collecting Interview Room Data (NIST/LDC) Room #1 Room #2 Typical Experiments Train Test 1, mic N 1, mic N 1, mic N 1, mic K 1, mic N 2, mic N 1, mic N 2, mic K 1, mic N Tel. 2, mic N Tel. Each room has ~16 microphones. In addition, telephone calls are made by the same speakers Page 28

Vocal Effort Collections? Lombard Effect White, Pink, Babble db Level Fixed or Variable MIXER Noisy Clear Voice VE Effect (Oration) Output 5 meters 10 meters 2.5 meters Page 29

Score-Level Fusion Fusion weights and offset developed using a small development data set Fusion offset (b) Fusion DET Curve Subsystem #1 Subsystem #2 Subsystem #7 Subsystem #8 X X X X S U M score Fusion Weights (A) Page 30