Speaker Discriminative Weighting Method for VQ-based Speaker identification

Similar documents
Neural Network Model of the Backpropagation Algorithm

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

Fast Multi-task Learning for Query Spelling Correction

More Accurate Question Answering on Freebase

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

1 Language universals

MyLab & Mastering Business

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speech Emotion Recognition Using Support Vector Machine

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

A study of speaker adaptation for DNN-based speech synthesis

Speaker Identification by Comparison of Smart Methods. Abstract

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Modeling function word errors in DNN-HMM based LVCSR systems

Human Emotion Recognition From Speech

Matching Similarity for Keyword-Based Clustering

E mail: Phone: LIBRARY MBA MAIN OFFICE

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Modeling function word errors in DNN-HMM based LVCSR systems

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

SIE: Speech Enabled Interface for E-Learning

Automatic segmentation of continuous speech using minimum phase group delay functions

Speech Recognition at ICSI: Broadcast News and beyond

Xinyu Tang. Education. Research Interests. Honors and Awards. Professional Experience

WHEN THERE IS A mismatch between the acoustic

Automatic intonation assessment for computer aided language learning

Speaker recognition using universal background model on YOHO database

TEAM NEWSLETTER. Welton Primar y School SENIOR LEADERSHIP TEAM. School Improvement

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

Data Fusion Models in WSNs: Comparison and Analysis

On the Formation of Phoneme Categories in DNN Acoustic Models

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Learning Methods in Multilingual Speech Recognition

Guidelines and additional provisions for the PhD Programmes at VID Specialized University

Mandarin Lexical Tone Recognition: The Gating Paradigm

Learning Methods for Fuzzy Systems

Modelling interaction during small-group synchronous problem-solving activities: The Synergo approach.

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Question 1 Does the concept of "part-time study" exist in your University and, if yes, how is it put into practice, is it possible in every Faculty?

3/6/2009. Residence Halls & Strategic t Planning Overview. Residence Halls Overview. Residence Halls: Marapai Supai Kachina

Course Law Enforcement II. Unit I Careers in Law Enforcement

MERGA 20 - Aotearoa

Global School-based Student Health Survey (GSHS) and Global School Health Policy and Practices Survey (SHPPS): GSHS

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Introduction to Mobile Learning Systems and Usability Factors

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

SARDNET: A Self-Organizing Feature Map for Sequences

Evaluation of Teach For America:

Word Segmentation of Off-line Handwritten Documents

Transfer Learning Action Models by Measuring the Similarity of Different Domains

On-the-Fly Customization of Automated Essay Scoring

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

AUTOMATED FABRIC DEFECT INSPECTION: A SURVEY OF CLASSIFIERS

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

ModellingSpace: A tool for synchronous collaborative problem solving

Computer Organization I (Tietokoneen toiminta)

/$ IEEE

Grade 6: Correlated to AGS Basic Math Skills

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Reducing Features to Improve Bug Prediction

DESIGN, DEVELOPMENT, AND VALIDATION OF LEARNING OBJECTS

Spoofing and countermeasures for automatic speaker verification

Australian Journal of Basic and Applied Sciences

If we want to measure the amount of cereal inside the box, what tool would we use: string, square tiles, or cubes?

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

REVIEW OF CONNECTED SPEECH

In Workflow. Viewing: Last edit: 10/27/15 1:51 pm. Approval Path. Date Submi ed: 10/09/15 2:47 pm. 6. Coordinator Curriculum Management

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Mathematics subject curriculum

UASCS Summer Planning Committee

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Travis Park, Assoc Prof, Cornell University Donna Pearson, Assoc Prof, University of Louisville. NACTEI National Conference Portland, OR May 16, 2012

Ordered Incremental Training with Genetic Algorithms

Level 1 Mathematics and Statistics, 2015

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

The Learning Model S2P: a formal and a personal dimension

Faculty Schedule Preference Survey Results

G.R. Memon, Muhammad Farooq Joubish and Muhammad Ashraf Khurram. Department of Education, Karachi University, Pakistan 2

North Carolina Information and Technology Essential Standards

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Designing Autonomous Robot Systems - Evaluation of the R3-COP Decision Support System Approach

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Dynamic Pictures and Interactive. Björn Wittenmark, Helena Haglund, and Mikael Johansson. Department of Automatic Control

Rule Learning with Negation: Issues Regarding Effectiveness

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Transcription:

Speaker Discriaive Weighing Mehod for VQ-based Speaker idenificaion Tomi Kinnunen and Pasi Fräni Universiy of Joensuu, Deparmen of Compuer Science, P.O. Box, 800 JOENSUU, FINLAND {kinnu,frani}@cs.joensuu.fi Absrac: We consider he maching funcion in vecor quanizaion based speaker idenificaion sysem. The model of a speaker is a codebook generaed from he se of feaure vecors from he speakers voice sample. The maching is performed by evaluaing he similariy of he unknown speaker and he models in he daabase. In his paper, we propose o use weighed maching mehod ha akes ino accoun he correlaions beween he known models in he daabase. Larger weighs are assigned o vecors ha have high discriaing power beween he speakers and vice versa. Experimens show ha he new mehod provides significanly higher idenificaion accuracy and i can deec he correc speaker from shorer speech samples more reliable han he unweighed maching mehod.. Inroducion Various phoneic sudies have showed ha differen pars of speech signal have unequal discriaion properies beween speakers. Tha is, he iner-speaker variaion of cerain phonemes are clearly differen from oher phonemes. Therefore, i would be useful o ake his knowledge ino accoun when designing speaker recogniion sysems. There are several alernaive approaches o uilize he above phenomen. One approach is o use a fron-end pre-classifier ha would auomaically recognize he acousic unis and give a higher significance for unis ha have beer discriaing power. Anoher approach is o use weighing mehod in he fron-end processing. This is usually realized by a mehod called cepsral lifering, which has been applied boh in he speaker [3,9] and speech recogniion []. However, all fron-end weighing sraegies depend on he paramerizaion (vecorizaion) of he speech and, herefore, do no provide a general soluion o he speaker idenificaion problem. In his paper, we propose a new weighed maching mehod o be used in vecor quanizaion (VQ) based speaker recogniion. The maching akes ino accoun he correlaions beween he known models and assigns larger weighs for code vecors ha have high discriaing power. The mehod does no require any a priori knowledge abou he naure of he feaure vecors, or any phoneic knowledge abou he discriaion powers of he differen phonemes. Insead, he mehod adaps o he saisical properies of he feaure vecors in he given daabase.

2. Vecor Quanizaion in Speaker Recogniion In VQ-based recogniion sysem [4, 5, 6, 8], a speaker is modeled as a se of feaure vecors generaed from his/her voice sample. The speaker models are consruced by clusering he feaure vecors in K separae clusers. Each cluser is hen represened by a code vecor, which is he cenroid (average vecor) of he cluser. The resuling se of code vecors is called a codebook, and i is sored in he speaker daabase. In he codebook, each vecor represens a single acousic uni ypical for he paricular speaker. Thus, he disribuion of he feaure vecors is represened by a smaller se of sample vecors wih similar disribuion han he full se of feaure vecors of he speaker model. The codebook should be se reasonably high since he previous resuls indicae ha he maching performance improves wih he size of he codebook [5, 7, 8]. For he clusering we use he randomized local search (RLS) algorihm as described in [2]. The maching of an unknown speaker is hen performed by measuring he similariy/dissimilariy beween he feaure vecors of he unknown speaker o he models (codebooks) of he known speakers in he daabase. Denoe he sequence of feaure vecors exraced from he unknown speaker as X = {x,..., x T }. The goal is o find he bes maching codebook C bes from he daabase of N codebooks C = {C,..., C N }. The maching is usually evaluaed by a disorion measure, or dissimilariy measure ha calculaes he average disance of he mapping d: X C R [5, 8]. The bes maching codebook can hen be defined by he codebook ha imizes he dissimilariy measure. Insead of he previous approaches, we use a similariy measure. In his way, we can define he weighing maching mehod inuiively more clearly. Thus, he bes maching codebook is now defined as he codebook ha maximizes he similariy measure of he mapping s : X C R, i.e.: C bes = arg max { s( X, C )}. (2.) i N Here he similariy measure is defined as he average of he inverse disance values: s( X, Ci ) = T T = d ( x, c i ), (2.2) where c denoes he neares code vecor o x in he codebook C and i d : R P R P R is a given disance funcion in he feaure space, whose selecion depends of he properies of he feaure vecors. If he disance funcion d saisfies 0 < d <, hen s is a well-defined and 0 < s <. In he res of he paper, we use Euclidean disance for simpliciy. Noe ha in pracice, we limi he disance values o he range < d < and, hus, he effecive values of he similariy measure are 0 < s <.

3. Speaker Discriaive Maching Consider he example shown in Fig., in which he code vecors of hree differen speakers are marked by recangles, circles and riangles. There is also a se of vecors from an unknown speaker marked by sars. The region a he op righmos corner canno disinc he speakers from each oher since i conains code vecors from all speakers. The region a he op lefmos corner is somewha beer in his sense because samples here indicae ha he unknown speaker is no riangle. The res of he code vecors, on he oher hand, have much higher discriaion power because hey are isolaed from he oher code vecors. Le us consider he unknown speaker sar, whose sample vecors are concenraed mainly around hree clusers. One cluser is a he op righmos corner and i canno disinc, which speaker he sample vecors originae from. The second cluser a he op lefmos corner can rule ou he speaker riangle bu only he hird cluser makes he difference. The cluser a he righ middle indicaes only o he speaker recangular and, herefore, we can conclude ha he sample vecors of he unknown speaker originae from he speaker recangular. The siuaion is no so eviden if we use he unweighed similariy score of he formula (2.2). I gives equal weigh o all sample vecors despie he fac ha hey do no have he same significance in he maching. Insead, he similariy value should depend on wo separae facors: he disance o he neares code vecor, and he discriaion power of he code vecor. Ouliers and noise vecors ha do no mach well o any code vecor should have small impac, bu also vecors ha mach o code vecors of many speakers should have smaller impac on he maching score. Fig. : Illusraion of code vecors having differen discriaing power. 3. Weighed similariy measure Our approach is o assign weighs o he code vecors according o heir discriaion power. In general, he weighing scheme can be formulaed by modifying he formula (2.2) as follows: T sw ( X, Ci ) = w( c T d( x, c ) = ), (3.)

b where w is he weighing funcion. When muliplying he local similariy score, d( x, c ), wih he weigh associaed wih he neares code vecor, c, he produc can be hough as a local operaor ha moves he decision surface owards more significan code vecors. 3.2 Compuing he weighs Consider a daabase of speaker codebooks C,...,C. The codebooks are N pos-processed o assign weighs for he code vecors, and he resul of he process is a se of weighed codebooks ( Ci, Wi ), i =,..., N, where Wi = { w( ci),..., w( cik )} are he weighs assigned for he ih codebook. In his way, he weighing approach does no increase he compuaional load of he maching process as i can be done in he raining phase when creaing he speaker daabase. The weighs are compued using he following algorihm:! "# %$& ' % ( )* " +,-. / 0*" $ % # 23 04 657. +# 8:9;<=;>*?;@ @A;BC;;D $ % # 23 04AEF. +# 65 8G9;<H>AB;I<=;AJ;>K ;L! MONP $ % # 23 04 6QSR;D R;. +4T 8G$ UBU;?> JVAB;I<=;AJ;> W BYX 5 Z! MO [?U;A"+ ;?> J A E \ Q )KP8W,]> L?@ @^;>AB;C;D L! MO^L_O`aFB X 5 ZJP + $* A5 E )! MO`aFP + $* P + $* P 4. Experimenal Resuls For esing purposes, we colleced a daabase of 25 speakers (4 males + females) using sampling rae of 8.0 khz wih 6 bis/sample. The average duraion of he raining samples was 66.5 seconds per speaker. For maching purposes we recorded anoher senence of he lengh 8.85 seconds, which was furher divided ino hree differen subsequences of he lenghs 8.85 s (00%),.77 s (20%) and 0.77 s (2%). The feaure exracion was performed using he following seps: High-emphasis filering wih filer H ( z) = 0.97z. 2 h order mel-cepsral analysis wih 30 ms Hamg window, shifed by 0 ms. The feaure vecors were composed of he 2 lowes mel-cepsral coefficiens (excep he 0 h coefficien, which corresponds o he oal energy of he frame). We concaenaed he feaure vecors also wih he - and -coefficiens ( s and 2 nd ime derivaives of he cepsral coefficiens) o capure he dynamic behavior of he vocal rac. The dimension of he final feaure vecor is herefore 3 2 = 36. The idenificaion raes are summarized hrough Fig. 2-4 for he hree differen subsequences by varying he codebook sizes from K= o 256.

The proposed mehod (weighed similariy) ouperforms he reference mehod (unweighed similariy) in all cases. I reaches 00% idenificaion rae wih K 32 using only.7 seconds of speech (corresponding o 72 es vecors). Even wih a very shor es sequence of 0.77 seconds (7 es vecors) he proposed mehod can reach idenificaion rae of 84% whereas he reference mehod is pracically useless. Idenificaion rae Sample lengh 8.850 s 00 % 90 % 80 % weighed 70 % 60 % unweighed 50 % 40 % 30 % 20 % 0 % 0 % 2 4 8 6 32 64 28 256 codebook size Fig. 2. Performance evaluaion using he full es sequence. Idenificaion rae 00 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 0 % 0 % Sample lengh.770 s weighed unweighed 2 4 8 6 32 64 28 256 codebook size Fig. 3. Performance evaluaion using 20 % of he es sequence. Idenificaion rae Sample lengh 0.77 s 00 % 90 % 80 % 70 % weighed 60 % 50 % 40 % 30 % unweighed 20 % 0 % 0 % 2 4 8 6 32 64 28 256 codebook size Fig. 4. Performance evaluaion using 2 % of he es sequence.

5 Conclusions We have proposed and evaluaed a weighed maching mehod for exindependen speaker recogniion. Experimens show ha he mehod gives remendous improvemen over he reference mehod, and i can deec he correc speaker from much shorer speech samples. I is herefore well applicable in real-ime sysems. Furhermore, he mehod can be generalized o any oher paern recogniion asks because i is no designed for any paricular feaures or disance meric. References [] Deller Jr. J.R., Hansen J.H.L., Proakis J.G.: Discree-ime Processing of Speech Signals. Macmillan Publishing Company, New York, 2000. [2] Fräni P., Kivijärvi J.: Randomized local search algorihm for he clusering problem, Paern Analysis and Applicaions, 3(4): 358-369, 2000. [3] Furui S.: Cepsral analysis echnique for auomaic speaker verificaion. IEEE Transacions on Acousics, Speech and Signal Processing, 29(2): 254-272, 98. [4] He J., Liu L., Palm G.: A discriaive raining algorihm for VQbased speaker idenificaion, IEEE Transacions on Speech and Audio Processing, 7(3): 353-356, 999. [5] Kinnunen T., Kilpeläinen T., Fräni P.: Comparison of clusering algorihms in speaker idenificaion, Proc. IASTED In. Conf. Signal Processing and Communicaions (SPC): 222-227. Marbella, Spain, 2000. [6] Kyung Y.J., Lee H.S.: Boosrap and aggregaing VQ classifier for speaker recogniion. Elecronics Leers, 35(2): 973-974, 999. [7] Pham T., Wagner M., Informaion based speaker idenificaion, Proc. In. Conf. Paern Recogniion (ICPR), 3: 282-285, Barcelona, Spain, 2000. [8] Soong F.K., Rosenberg A.E., Juang B-H., Rabiner L.R.: A vecor quanizaion approach o speaker recogniion, AT&T Technical Journal, 66: 4-26, 987. [9] Zhen B., Wu X., Liu Z., Chi H.: On he use of bandpass lifering in speaker recogniion, Proc. 6 h In. Conf. of Spoken Lang. Processing (ICSLP), Beijing, China, 2000.