L16: Speaker recognition

Similar documents
Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Emotion Recognition Using Support Vector Machine

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Speaker recognition using universal background model on YOHO database

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speech Recognition at ICSI: Broadcast News and beyond

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

A study of speaker adaptation for DNN-based speech synthesis

Human Emotion Recognition From Speech

WHEN THERE IS A mismatch between the acoustic

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Learning Methods in Multilingual Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Calibration of Confidence Measures in Speech Recognition

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Lecture 1: Machine Learning Basics

Support Vector Machines for Speaker and Language Recognition

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Why Did My Detector Do That?!

Segregation of Unvoiced Speech from Nonspeech Interference

Probabilistic Latent Semantic Analysis

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Automatic Pronunciation Checker

Generative models and adversarial training

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Recognition by Indexing and Sequencing

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Word Segmentation of Off-line Handwritten Documents

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A Case Study: News Classification Based on Term Frequency

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Python Machine Learning

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Edinburgh Research Explorer

arxiv: v2 [cs.cv] 30 Mar 2017

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CEFR Overall Illustrative English Proficiency Scales

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Speaker Recognition. Speaker Diarization and Identification

Voice conversion through vector quantization

INPE São José dos Campos

Mandarin Lexical Tone Recognition: The Gating Paradigm

CS Machine Learning

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Spoofing and countermeasures for automatic speaker verification

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Assignment 1: Predicting Amazon Review Ratings

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Semi-Supervised Face Detection

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Switchboard Language Model Improvement with Conversational Data from Gigaword

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

On the Formation of Phoneme Categories in DNN Acoustic Models

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Lecture 9: Speech Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Corpus Linguistics (L615)

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Knowledge Transfer in Deep Convolutional Neural Nets

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

An Online Handwriting Recognition System For Turkish

Statewide Framework Document for:

Large vocabulary off-line handwriting recognition: A survey

Affective Classification of Generic Audio Clips using Regression Models

Investigation on Mandarin Broadcast News Speech Recognition

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Universityy. The content of

Body-Conducted Speech Recognition and its Application to Speech Support System

IEEE Proof Print Version

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Evolutive Neural Net Fuzzy Filtering: Basic Description

Introduction to Causal Inference. Problem Set 1. Required Problems

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Artificial Neural Networks written examination

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Transcription:

L16: Speaker recognition Introduction Measurement of speaker characteristics Construction of speaker models Decision and performance Applications [This lecture is based on Rosenberg et al., 2008, in Benesty et al., (Eds)] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 1

Introduction Speaker identification vs. verification Speaker identification The goal is to match a voice sample from an unknown speaker to one of several of labeled speaker models No identity is claimed by the user Open-set identification: it is possible that the unknown speaker is not in the set of speaker models If no satisfactory match is found, a no-match decision is provided Closed-set : the unknown speaker is one of the known speakers Speaker may be cooperative or uncooperative Performance degrades as the number of comparisons increases Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 2

Speaker verification Introduction User makes a claim as to his/her identity, and the goal is to determine the authenticity of the claim In this case, the voice samples are compared only with the speaker model of the claimed identity Can be thought of as a special case of open-set identification (one vs. all) Speaker is generally assumed to be cooperative Because only one comparison is made, performance is independent of the size of the speaker population Speaker identification Speaker verification http://www.ll.mit.edu/mission/communications/ist/publications/aaas00-dar-pres.pdf Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 3

Components of a speaker verification system From http://www.ll.mit.edu/mission/communications/ist/publications/aaas00-dar-pres.pdf Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 4

Two distinct phases to any speaker verification system From http://www.ll.mit.edu/mission/communications/ist/publications/aaas00-dar-pres.pdf Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 5

Text-dependent vs. text-independent Text-dependent recognition Recognition system knows the text spoken by the person, either fixed passwords or prompted phrases These systems assume that the speaker is cooperative Suited for security applications To prevent impostors from playing back recorded passwords from authorized speakers, random prompted phrases can be used Text-independent recognition Recognition system does not know text spoken by person, which could be user-selected phrases or conversational speech Unsuited for security applications (e.g., impostor playing back a recording from an authorized speaker) Suited for identification of uncooperative speakers More flexible system but also more difficult problem Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 6

Measurement of speaker characteristics Types of speaker characteristics Low-level features Associated with the periphery in the brain s perception of speech Segmental: formants are relatively hard to track reliably, so one generally uses short-term spectral measurements (e.g., LPC, filter-bank analysis) Supra-segmental: Pitch periodicity is easy to extract, but also requires a prior voiced/unvoiced detector Long term averages of these measures may be used if one does not need to resolve detailed individual differences High-level features Associated with more central locations in the perception mechanism Perception of words and their meaning Syntax and prosody Dialect and idiolect (variety of a language unique to a person) These features are relatively harder to extract than low-level features Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 7

Low-level features Short-time spectra, generally MFCCs Isn t this counterintuitive? Speech recognition should be speaker independent, whereas speaker recognition should be speech independent This would suggest that the optimal acoustic features would be different, However, the best speech representation turns out to be also a good speaker representation (!) perhaps the optimal representation contains both speech and speaker information? Cepstral mean subtraction Subtracts the cepstral average over a sufficiently long speech recording Removes convolutional distortions in slowly varying channels Dynamic information Derivatives Δ and second derivatives Δ 2 of the above features are also useful (both for speech and for speaker recognition) Pitch and energy averages Robust pitch extraction is hard and pitch has large intra-speaker variation Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 8

Linguistic measurements Can only be used with long recordings (i.e., indexing broadcast, passive surveillance), not with conventional text-dependent systems Word usage Vocabulary choices, word frequencies, part-of-speech frequencies Spontaneous speech, such as fillers and hesitations Susceptible to errors introduced by LVCSR systems Phone sequences and lattices Models of phone sequences output by ASR using phonotactic grammars can be used to represent speaker characteristics However, lexical constraints generally used to improve ASR may prevent extraction of phone sequences that are unique to a speaker Other linguistic features Pronunciation modeling of carefully chosen words Pitch and energy contours, duration of phones and pauses Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 9

Construction of speaker models Speaker recognition models can be divided into two classes Non-parametric models These models make few structural assumptions about the data Effective when there is sufficient enrollment data to be matched to the test data Models are based on techniques such as Template matching (DTW) Nearest-neighbors models Parametric models Offer a parsimonious representation of structural constraints Can make effective use of enrollment data if constraints are chosen properly Models are based on techniques such as Vector quantization, Gaussian mixture models, Hidden Markov models, and Support vector machines (will not be discussed here) Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 10

Non-parametric models Template matching The simplest form of speaker modeling; rarely used in real applications today Appropriate for fixed-password speaker verification systems Enrollment data consists of a small number of repetitions of the password Test data is compared against each of the enrollment utterances and the identity claim is accepted if the distance is below a threshold Feature vectors for test and enrollment data are aligned with DTW Nearest-neighbors modeling It can be shown that, given enrollment data from a speaker X, the local density (likelihood) for test utterance y is (see CSCE 666 lecture notes) 1 p nn y; X = V d nn y, X = 1 V min y x j x j X where V r ~r D is the volume of a D-dimensional hyper-sphere of radius r Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 11

Taking logs and removing constant terms, we can define a similarity measure between Y and X as s nn Y; X = ln d nn y, X y j Y and the speaker with greatest s nn Y; X is identified It has been shown that the following measure provides significantly better results than s nn Y; X s nn Y; X = 1 min N y 2 j x i y x i X y j Y + 1 N x min y i Y y i x j 2 x j X 1 N y 1 N x y j Y min y i Y;j i min x i X;j i x j X y i y j 2 x i x j 2 Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 12

Parametric models Vector quantization Generally based on k-means, which we presented in an earlier lecture Since k is unknown, an iterative technique based on the Linde-Buzo-Gray (LBG) algorithm is generally used LBG: start with k = 1, choose the cluster with largest variance and partition into two by adding a small perturbation to their means μ ± ε, and repeat Once VQ models are available for the target speaker, evaluate sumsquared-error measure D to determine authenticity of the claim D = J j=1 x i μ j x i μ j where μ j is the sample mean of test vectors assigned to the j-th cluster VQ may be used for text-dependent and text-independent systems Temporal aspects may be included by clustering sequences of feature vectors While VQ is still useful, it has been superseded by more advanced models such as GMMs and HMMs Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 13

Gaussian mixture models GMMs can be thought of as a generalization of k-means where each cluster is allowed to have its own covariance matrix As we saw in an earlier lecture, model parameters (mean, covariance, mixing coefficients) are learned with the EM algorithm Given trained model λ, test utterance scores are obtained as the average log-likelihood given by T s Y λ = 1 log p y T y λ t=1 When used for speaker verification, the final decision is based on a likelihood ratio test of the form p Y λ p Y λ BG where λ BG represents a background model trained on a large independent speech database As we will see, the target speaker model λ can also be obtained by adapting λ BG, which tends to give more robust results GMMs are suitable for text-independent speaker recognition but do not model the temporal aspects of speech Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 14

Hidden Markov Models For text-dependent systems, HMMs have been shown to be very effective HMMs may be trained at the phone, word or sentence level, depending on the password vocabulary (e.g., digit sequences are commonly used) HMMs are generally trained using maximum likelihood (Baum-Welch) Discriminative training techniques may be used if examples from competing speakers are available (e.g., closed-set identification) For text-independent systems, ergodic HMMs may be used Unlike the left-right HMMs generally used in ASR, ergodic HMMs allow all possible transitions between states In this way emission probabilities will tend to represent different spectral characteristics (associated with different phones), whereas transition probabilities allow some modeling of temporal information Experimental comparison of GMMs and ergodic HMMs, however, show that the addition of the transition probabilities in HMMs has little effect on performance Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 15

Adaptation In most speaker recognition scenarios, the speech data available for enrollment is too limited to train models In fixed-password speaker authentication systems, the enrollment data may be recorded in a single call As a result, enrollment and test conditions may be mismatched: different telephone handsets and networks (landline vs. cellular), background noises In text-independent models, additional problems may result from mismatches in linguistic content For these reasons, adaptation techniques may be used to build models for specific target speakers When used in fixed-password systems, model adaptation can reduce error rates significantly Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 16

Adapting a hypothesized speaker model (for GMMs) * [Reynolds & Campbell, 2008, in Benesty et al., (Eds)] *UBM: universal background model Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 17

Decision rules Decision and performance The previous models provide a score s Y λ that measures the match between a given test utterance Y and a speaker model λ Identification systems produce a set of scores, one for each target speaker In this case, the decision is to choose the speaker S with maximum score S = arg max s Y λ j j Verification systems output only one score, that of the claimed speaker Here, a verification decision is obtained by comparing the score against a predetermined threshold s Y λ i θ Y λ i Open-set identification relies on two steps a closed-step identification to find the most likely speaker, and a verification step to test whether the match is good enough Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 18

Threshold setting and score normalization When the score is obtained in a probabilistic framework, one may employ Bayesian decision theory to determine the threshold θ Given false acceptance c fa and false rejection c fr rates and the prior probability of an impostor p imp, the optimal threshold θ is p imp θ = c fa c fr 1 p imp In practice, however, the score s Y λ does not behave as theory predicts due to modeling errors To address this issue, various forms of normalization have been proposed over the years, such as Z-norm, H-norm, T-norm, etc. [Reynolds & Campbell, 2008, in Benesty et al., (Eds)] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 19

Errors and DET SID systems are evaluated based on the probability of misclassification Verification systems, in contrast, are evaluated based on two types of errors: false acceptance errors, and false rejection errors The probability of these two errors p fa, p fr varies in opposite directions when the decision threshold θ is varied The tradeoff between the two types of errors is often displayed as a curve known as the receiver operating characteristic (ROC) in decision theory Detection error threshold (DET) In speaker verification, the two errors are converted to normal deviates μ = 0; σ = 1 and plotted in log scale, and the curve is known as a DET The DET highlights differences between systems more clearly If the two errors are Gaussian with σ = 1 the curve is linear with slope 1, which helps rank systems based on how close their DET is to the ideal Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 20

Generating ROC curves http://en.wikipedia.org/wiki/receiver_operating_characteristic ROC DET http://genome.cshlp.org/content/18/2/206/f4.expansion http://www.limsi.fr/rs2003gb/chm2003gb/tlp2003/ TLP9/modelechmgb.html Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 21

Selecting a detection threshold The DET shows how the system behaves over a range of thresholds, but does not indicate which threshold should be used Two criteria are commonly used to select an operating point Equal error rate (EER) The threshold at which the two errors are equal p fa = p fr Detection cost function (DCF) The threshold that minimizes the expected risk based on the prior probability of impostors and the relative cost of the two types of errors C = p imp c fa p fa + 1 p imp c fr p fr Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 22

Transaction authentication Applications Toll fraud prevention, telephone credit card purchases, telephone brokerage (e.g., stock trading) Access control Physical facilities, computers and data networks Monitoring Remote time and attendance logging, home parole verification, prison telephone usage Information retrieval Customer information for call centers, audio indexing (speech skimming device), speaker diarisation Forensics Voice sample matching From http://www.ll.mit.edu/mission/communications/ist/publications/aaas00-dar-pres.pdf Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 23