Lecture 16 Speaker Recognition

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Speech Emotion Recognition Using Support Vector Machine

Learning Methods in Multilingual Speech Recognition

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Human Emotion Recognition From Speech

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A study of speaker adaptation for DNN-based speech synthesis

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Case Study: News Classification Based on Term Frequency

CEFR Overall Illustrative English Proficiency Scales

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Modeling function word errors in DNN-HMM based LVCSR systems

On the Formation of Phoneme Categories in DNN Acoustic Models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speaker recognition using universal background model on YOHO database

Modeling function word errors in DNN-HMM based LVCSR systems

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Speech Recognition by Indexing and Sequencing

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Word Segmentation of Off-line Handwritten Documents

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Python Machine Learning

Edinburgh Research Explorer

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

English Language and Applied Linguistics. Module Descriptions 2017/18

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

content First Introductory book to cover CAPM First to differentiate expected and required returns First to discuss the intrinsic value of stocks

Investigation on Mandarin Broadcast News Speech Recognition

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

SIE: Speech Enabled Interface for E-Learning

Characterizing and Processing Robot-Directed Speech

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Lecture 1: Basic Concepts of Machine Learning

Large vocabulary off-line handwriting recognition: A survey

An Online Handwriting Recognition System For Turkish

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Probabilistic Latent Semantic Analysis

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

The Good Judgment Project: A large scale test of different methods of combining expert predictions

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Rule Learning With Negation: Issues Regarding Effectiveness

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

Aviation English Solutions

Calibration of Confidence Measures in Speech Recognition

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Applications of memory-based natural language processing

Individual Differences & Item Effects: How to test them, & how to test them well

One Stop Shop For Educators

Textbook Evalyation:

REVIEW OF CONNECTED SPEECH

SLINGERLAND: A Multisensory Structured Language Instructional Approach

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Abbey Academies Trust. Every Child Matters

Natural Language Processing. George Konidaris

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

Large Kindergarten Centers Icons

Welcome to MyOutcomes Online, the online course for students using Outcomes Elementary, in the classroom.

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Why Did My Detector Do That?!

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Achievement Level Descriptors for American Literature and Composition

Detailed Instructions to Create a Screen Name, Create a Group, and Join a Group

K 1 2 K 1 2. Iron Mountain Public Schools Standards (modified METS) Checklist by Grade Level Page 1 of 11

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Artificial Neural Networks written examination

Teaching ideas. AS and A-level English Language Spark their imaginations this year

Children are ready for speech technology - but is the technology ready for them?

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

To appear in The TESOL encyclopedia of ELT (Wiley-Blackwell) 1 RECASTING. Kazuya Saito. Birkbeck, University of London

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Transcription:

Lecture 16 Speaker Recognition Information College, Shandong University @ Weihai

Definition Method of recognizing a Person form his/her voice. Depends on Speaker Specific Characteristics To determine whether a specified speaker is speaking in a given segment of speech This task is the one closest to biometric identification using speech

Voice is a popular Biometric Voice Biometric: Natural signal to produce Does not require a specialized input device Can be used on site or remotely Telephone banking, Voice mail browsing,. Security Keys, card,... Passwords, PIN,... Fingerprint, voiceprint, Iris-print

Similar Tasks Speaker Verification Extract information from the stream of speech. Verifies that a person is who she/he claims to be. One-to-one comparison. Speaker Recognition Extract information from the stream of speech. Assigns an identity to the voice of an unknown person. One-to-many comparison. Speech Recognition Extracts information from the stream of speech. Figures out what a person is saying.

Task of Today Speech Recognition History Scheme Speaker Features Methods

Recognition Milestone 1920, first electromechanical toy: Rex', (Elmwood Co. ) Late 1940s, US Defense, Automatic Translation Machine Project failed, but sparked the research at MIT, CMU, commercial institutions. During 1950's, first system capable of recognizing digits spoken over the telephone was developed by Bell Labs. 1962, Shoebox form IBM In early 1970's, the system HARPY capable of sentences, limited grammar, by Carnegie-Mellon University. HARPY required so much computing power as in 50 contemporary computers. Moreover, the system recognized discrete speech, where words are separated by longer pauses than usual.

Recognition Milestone In the 1980 s, significant progress in speech recognition technology: Word error rates continue to drop by factor of 2 every two years. IBM in 1985, in real time, isolated words from set of 20,000 after 20- minute training, with error rate < 5%. AT&T, call routing system, speaker independent word-spotting technology, few key phrases. Several very large vocabulary dictation systems: require speakers to pause between words. Better for specific domain. In 1990's: VoiceBroker deployed by Charles Schwab, stock brokerage, in 1996. ViaVoice by IBM, first distributed with the now almost forgotten operating system OS/2 in 1996. 1997, Dragon introduced Naturaly Speaking, first continuous speech recognition package Today: Airline reservations with British Airways, Train reservation for Amtrak, Weather forecasts & telephone directory information

Terminology of Speech Recognition Speaker Dependent Recognition The recognition system is designed to work with just one or a small number of individual speakers Speaker Independent Recognition These systems are designed to work with all the speakers from a given linguistic community

Terminology of Speech Recognition Large Vocabulary Recognition Example are domain specific recognition systems such as used by medical consultants for dictating notes on their ward rounds Very difficult to make accurate large vocabulary, speaker independent systems Small Vocabulary Recognition Typically recognition of a few keywords such as digits or a set of commands. Example: voice operated telephone number dialing

Terminology of Speech Recognition Isolated Word Recognition: Systems which can only recognize individual words which are preceded and followed by relatively long period of silence Connected Word Recognition: Systems which can recognize a limited sequence of words spoken in succession (e.g. Ninety-eight thirty-five four thousand ) Continuous Word Recognition: These systems can recognize speech as it occurs and recognize the speech in real time. Such system usually work with large vocabulary, but with moderate accuracy.

Speech Recognition Scheme Three steps in Speech recognition are performed in ANY recognition system: Feature Extraction Measurement of similarity Decision making

Recognition Systems speech Derive a compact representation of the speech waveform feature extraction test pattern c 0 (t) c 1 (t)... c 0 (t) c 1 (t) reference patterns c M (t) pattern matching c M (t) Pattern matching is constrained in many ways, e.g. the rules of language (grammar), spelling and possible pronunciations decision rule Find the word with the greatest similarity to the input speech accept/ reject 2 c 0 (t) 2 c 1 (t) 2 c M (t)

Speech Model & Features

Speaker Recognition Features The features are low-level speech signal representation parameters that convey complete information about the signal. High-level characteristics like accent, intonation, etc. are encoded within the representation in a very complex and cryptic manner. The features contain speaker-dependent components. Uniqueness and permanence of the features is problematic.

Questions Do the features that uniquely characterize people exist? Uniqueness and permanence of most of the features used in biometric systems have not been proven. Is the human s ability to identify a person a limit that no automatic system can overcome? Automated systems might be able to identify people better than average person can do. In practice, expert systems do not perform the task better than the experts who built them.

Questions How important are the algorithms versus the knowledge of features and their relationships to achieve high identification accuracy? Knowledge of features and their relationships is fundamental for accurate biometric systems. The algorithms play an important, still secondary, role in the process as no algorithm can compensate for the lack of the adequate features.

Speaker models Used to represent the speaker specific information conveyed in the feature vectors Several different modeling techniques have been applied: Template Matching Nearest Neighbor Neural Networks Hidden Markov Models State-of-the-art speaker recognition algorithms are based on statistical models of short-term acoustic measurements on the input speech signal

Speaker models Use long-term averages of acoustic features (spectrum, pitch ) first and earliest Idea : To average out the factors influencing intraspeaker variation, leave only the speaker dependent component. Drawback : required long speech utterance(>20s) Training SD model for each speaker Explicit segmentation: HMM Implicit segmentation: VQ,GMM

Speaker models HMM: Advantage : Text-independent Drawback : A significant increase in computational complexity VQ: Advantage : Unsupervised clustering Drawback : Text-dependent GMM : Advantage : Text-Independent, Probabilistic framework (robust), Computationally efficient, Easily to be implemented.

Speaker models Discriminative Neural Network Model the decision function which best discriminate speakers Advantage : Less parameters, higher performance compared to VQ model. Drawback : The network must be retrained when a new speaker is added to the system.

Progressing VQ NN 1985 HMM 1995 VQ NN GMM HMM VQ NN 40 35 30 25 20 Speaker Independent Dictation Broadcast News Easy 15 10 5 Telephone Conversations Hard 0 1993 1995 State of the Art: Speech Recognition 1997 1999 2001 2003 21

QV Example Acoustic Space 2 distortion Speaker A This sample has less distortion for A than for B Speaker B Acoustic Space 1

HMM Example Two model of tomato 0.2 [ow] 0.5 [ey] [t] 0.8 [ah] [m] 0.5 [aa] [t] [ow] Word in the vocabulary is presented with phonemes. Each phoneme is viewed as an HMM A word model is constructed by combining HMMs for the phonemes

Gaussian Mixture Model (GMM) Speech Recognition (GMM) State Level

Gaussian Mixture Model (GMM) Speaker Recognition Speaker k µ 1 Σ 1 µ 2 Σ 2 p 1 p 2 µ i 1 Σ i 1 pi 1 µ i Σ i p i

Limits The best performing algorithms for text-independent speaker verification use Gaussian Mixture Models (GMM) (single state HMM) The linguistic structure of the speech signal is not taken into account and all sounds are represented using a unique model The sequential information is ignored There is a recent trend in using High-level features Large Vocabulary Continuous Speech Recognition System Good results for a small set of languages Need huge amount of annotated speech databases (an enormous amount of time and human effort ) Language and task dependent