Speaker Discriaive Weighing Mehod for VQ-based Speaker idenificaion Tomi Kinnunen and Pasi Fräni Universiy of Joensuu, Deparmen of Compuer Science, P.O. Box, 800 JOENSUU, FINLAND {kinnu,frani}@cs.joensuu.fi Absrac: We consider he maching funcion in vecor quanizaion based speaker idenificaion sysem. The model of a speaker is a codebook generaed from he se of feaure vecors from he speakers voice sample. The maching is performed by evaluaing he similariy of he unknown speaker and he models in he daabase. In his paper, we propose o use weighed maching mehod ha akes ino accoun he correlaions beween he known models in he daabase. Larger weighs are assigned o vecors ha have high discriaing power beween he speakers and vice versa. Experimens show ha he new mehod provides significanly higher idenificaion accuracy and i can deec he correc speaker from shorer speech samples more reliable han he unweighed maching mehod.. Inroducion Various phoneic sudies have showed ha differen pars of speech signal have unequal discriaion properies beween speakers. Tha is, he iner-speaker variaion of cerain phonemes are clearly differen from oher phonemes. Therefore, i would be useful o ake his knowledge ino accoun when designing speaker recogniion sysems. There are several alernaive approaches o uilize he above phenomen. One approach is o use a fron-end pre-classifier ha would auomaically recognize he acousic unis and give a higher significance for unis ha have beer discriaing power. Anoher approach is o use weighing mehod in he fron-end processing. This is usually realized by a mehod called cepsral lifering, which has been applied boh in he speaker [3,9] and speech recogniion []. However, all fron-end weighing sraegies depend on he paramerizaion (vecorizaion) of he speech and, herefore, do no provide a general soluion o he speaker idenificaion problem. In his paper, we propose a new weighed maching mehod o be used in vecor quanizaion (VQ) based speaker recogniion. The maching akes ino accoun he correlaions beween he known models and assigns larger weighs for code vecors ha have high discriaing power. The mehod does no require any a priori knowledge abou he naure of he feaure vecors, or any phoneic knowledge abou he discriaion powers of he differen phonemes. Insead, he mehod adaps o he saisical properies of he feaure vecors in he given daabase.
2. Vecor Quanizaion in Speaker Recogniion In VQ-based recogniion sysem [4, 5, 6, 8], a speaker is modeled as a se of feaure vecors generaed from his/her voice sample. The speaker models are consruced by clusering he feaure vecors in K separae clusers. Each cluser is hen represened by a code vecor, which is he cenroid (average vecor) of he cluser. The resuling se of code vecors is called a codebook, and i is sored in he speaker daabase. In he codebook, each vecor represens a single acousic uni ypical for he paricular speaker. Thus, he disribuion of he feaure vecors is represened by a smaller se of sample vecors wih similar disribuion han he full se of feaure vecors of he speaker model. The codebook should be se reasonably high since he previous resuls indicae ha he maching performance improves wih he size of he codebook [5, 7, 8]. For he clusering we use he randomized local search (RLS) algorihm as described in [2]. The maching of an unknown speaker is hen performed by measuring he similariy/dissimilariy beween he feaure vecors of he unknown speaker o he models (codebooks) of he known speakers in he daabase. Denoe he sequence of feaure vecors exraced from he unknown speaker as X = {x,..., x T }. The goal is o find he bes maching codebook C bes from he daabase of N codebooks C = {C,..., C N }. The maching is usually evaluaed by a disorion measure, or dissimilariy measure ha calculaes he average disance of he mapping d: X C R [5, 8]. The bes maching codebook can hen be defined by he codebook ha imizes he dissimilariy measure. Insead of he previous approaches, we use a similariy measure. In his way, we can define he weighing maching mehod inuiively more clearly. Thus, he bes maching codebook is now defined as he codebook ha maximizes he similariy measure of he mapping s : X C R, i.e.: C bes = arg max { s( X, C )}. (2.) i N Here he similariy measure is defined as he average of he inverse disance values: s( X, Ci ) = T T = d ( x, c i ), (2.2) where c denoes he neares code vecor o x in he codebook C and i d : R P R P R is a given disance funcion in he feaure space, whose selecion depends of he properies of he feaure vecors. If he disance funcion d saisfies 0 < d <, hen s is a well-defined and 0 < s <. In he res of he paper, we use Euclidean disance for simpliciy. Noe ha in pracice, we limi he disance values o he range < d < and, hus, he effecive values of he similariy measure are 0 < s <.
3. Speaker Discriaive Maching Consider he example shown in Fig., in which he code vecors of hree differen speakers are marked by recangles, circles and riangles. There is also a se of vecors from an unknown speaker marked by sars. The region a he op righmos corner canno disinc he speakers from each oher since i conains code vecors from all speakers. The region a he op lefmos corner is somewha beer in his sense because samples here indicae ha he unknown speaker is no riangle. The res of he code vecors, on he oher hand, have much higher discriaion power because hey are isolaed from he oher code vecors. Le us consider he unknown speaker sar, whose sample vecors are concenraed mainly around hree clusers. One cluser is a he op righmos corner and i canno disinc, which speaker he sample vecors originae from. The second cluser a he op lefmos corner can rule ou he speaker riangle bu only he hird cluser makes he difference. The cluser a he righ middle indicaes only o he speaker recangular and, herefore, we can conclude ha he sample vecors of he unknown speaker originae from he speaker recangular. The siuaion is no so eviden if we use he unweighed similariy score of he formula (2.2). I gives equal weigh o all sample vecors despie he fac ha hey do no have he same significance in he maching. Insead, he similariy value should depend on wo separae facors: he disance o he neares code vecor, and he discriaion power of he code vecor. Ouliers and noise vecors ha do no mach well o any code vecor should have small impac, bu also vecors ha mach o code vecors of many speakers should have smaller impac on he maching score. Fig. : Illusraion of code vecors having differen discriaing power. 3. Weighed similariy measure Our approach is o assign weighs o he code vecors according o heir discriaion power. In general, he weighing scheme can be formulaed by modifying he formula (2.2) as follows: T sw ( X, Ci ) = w( c T d( x, c ) = ), (3.)
b where w is he weighing funcion. When muliplying he local similariy score, d( x, c ), wih he weigh associaed wih he neares code vecor, c, he produc can be hough as a local operaor ha moves he decision surface owards more significan code vecors. 3.2 Compuing he weighs Consider a daabase of speaker codebooks C,...,C. The codebooks are N pos-processed o assign weighs for he code vecors, and he resul of he process is a se of weighed codebooks ( Ci, Wi ), i =,..., N, where Wi = { w( ci),..., w( cik )} are he weighs assigned for he ih codebook. In his way, he weighing approach does no increase he compuaional load of he maching process as i can be done in he raining phase when creaing he speaker daabase. The weighs are compued using he following algorihm:! "# %$& ' % ( )* " +,-. / 0*" $ % # 23 04 657. +# 8:9;<=;>*?;@ @A;BC;;D $ % # 23 04AEF. +# 65 8G9;<H>AB;I<=;AJ;>K ;L! MONP $ % # 23 04 6QSR;D R;. +4T 8G$ UBU;?> JVAB;I<=;AJ;> W BYX 5 Z! MO [?U;A"+ ;?> J A E \ Q )KP8W,]> L?@ @^;>AB;C;D L! MO^L_O`aFB X 5 ZJP + $* A5 E )! MO`aFP + $* P + $* P 4. Experimenal Resuls For esing purposes, we colleced a daabase of 25 speakers (4 males + females) using sampling rae of 8.0 khz wih 6 bis/sample. The average duraion of he raining samples was 66.5 seconds per speaker. For maching purposes we recorded anoher senence of he lengh 8.85 seconds, which was furher divided ino hree differen subsequences of he lenghs 8.85 s (00%),.77 s (20%) and 0.77 s (2%). The feaure exracion was performed using he following seps: High-emphasis filering wih filer H ( z) = 0.97z. 2 h order mel-cepsral analysis wih 30 ms Hamg window, shifed by 0 ms. The feaure vecors were composed of he 2 lowes mel-cepsral coefficiens (excep he 0 h coefficien, which corresponds o he oal energy of he frame). We concaenaed he feaure vecors also wih he - and -coefficiens ( s and 2 nd ime derivaives of he cepsral coefficiens) o capure he dynamic behavior of he vocal rac. The dimension of he final feaure vecor is herefore 3 2 = 36. The idenificaion raes are summarized hrough Fig. 2-4 for he hree differen subsequences by varying he codebook sizes from K= o 256.
The proposed mehod (weighed similariy) ouperforms he reference mehod (unweighed similariy) in all cases. I reaches 00% idenificaion rae wih K 32 using only.7 seconds of speech (corresponding o 72 es vecors). Even wih a very shor es sequence of 0.77 seconds (7 es vecors) he proposed mehod can reach idenificaion rae of 84% whereas he reference mehod is pracically useless. Idenificaion rae Sample lengh 8.850 s 00 % 90 % 80 % weighed 70 % 60 % unweighed 50 % 40 % 30 % 20 % 0 % 0 % 2 4 8 6 32 64 28 256 codebook size Fig. 2. Performance evaluaion using he full es sequence. Idenificaion rae 00 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 0 % 0 % Sample lengh.770 s weighed unweighed 2 4 8 6 32 64 28 256 codebook size Fig. 3. Performance evaluaion using 20 % of he es sequence. Idenificaion rae Sample lengh 0.77 s 00 % 90 % 80 % 70 % weighed 60 % 50 % 40 % 30 % unweighed 20 % 0 % 0 % 2 4 8 6 32 64 28 256 codebook size Fig. 4. Performance evaluaion using 2 % of he es sequence.
5 Conclusions We have proposed and evaluaed a weighed maching mehod for exindependen speaker recogniion. Experimens show ha he mehod gives remendous improvemen over he reference mehod, and i can deec he correc speaker from much shorer speech samples. I is herefore well applicable in real-ime sysems. Furhermore, he mehod can be generalized o any oher paern recogniion asks because i is no designed for any paricular feaures or disance meric. References [] Deller Jr. J.R., Hansen J.H.L., Proakis J.G.: Discree-ime Processing of Speech Signals. Macmillan Publishing Company, New York, 2000. [2] Fräni P., Kivijärvi J.: Randomized local search algorihm for he clusering problem, Paern Analysis and Applicaions, 3(4): 358-369, 2000. [3] Furui S.: Cepsral analysis echnique for auomaic speaker verificaion. IEEE Transacions on Acousics, Speech and Signal Processing, 29(2): 254-272, 98. [4] He J., Liu L., Palm G.: A discriaive raining algorihm for VQbased speaker idenificaion, IEEE Transacions on Speech and Audio Processing, 7(3): 353-356, 999. [5] Kinnunen T., Kilpeläinen T., Fräni P.: Comparison of clusering algorihms in speaker idenificaion, Proc. IASTED In. Conf. Signal Processing and Communicaions (SPC): 222-227. Marbella, Spain, 2000. [6] Kyung Y.J., Lee H.S.: Boosrap and aggregaing VQ classifier for speaker recogniion. Elecronics Leers, 35(2): 973-974, 999. [7] Pham T., Wagner M., Informaion based speaker idenificaion, Proc. In. Conf. Paern Recogniion (ICPR), 3: 282-285, Barcelona, Spain, 2000. [8] Soong F.K., Rosenberg A.E., Juang B-H., Rabiner L.R.: A vecor quanizaion approach o speaker recogniion, AT&T Technical Journal, 66: 4-26, 987. [9] Zhen B., Wu X., Liu Z., Chi H.: On he use of bandpass lifering in speaker recogniion, Proc. 6 h In. Conf. of Spoken Lang. Processing (ICSLP), Beijing, China, 2000.