YCSLA 298 No. of Pages 28, DTD = September 2005 Disk Used ARTICLE IN PRESS. Computer Speech and Language xxx (2005) xxx xxx UNCORRECTED PROOF

Size: px
Start display at page:

Download "YCSLA 298 No. of Pages 28, DTD = September 2005 Disk Used ARTICLE IN PRESS. Computer Speech and Language xxx (2005) xxx xxx UNCORRECTED PROOF"

Transcription

1 2 Sep-by-sep and inegraed approaches in broadcas 3 news speaker diarizaion 4 Sylvain Meignier a,c, Daniel Moraru b, Corinne Fredouille a, *, 5 Jean-François Bonasre a, Lauren Besacier b 6 a Laboraoire Informaique dõavignon (LIA)/CNRS, Deparmen of Compuing, Universiy of Avignon, 7 BP1228, Avignon Cedex 9, France 8 b CLIPS, IMAG (UJF & CNRS), BP 53, Grenoble Cedex 9, France 9 c LIUM/CNRS, Universié du Maine, Avenue Laennec, Le Mans Cedex 9, France 12 Absrac Compuer Speech and Language xxx (2005) xxx xxx Received 2 November 2004; received in revised form 1 Augus 2005; acceped 3 Augus 2005 COMPUTER SPEECH AND LANGUAGE 13 This paper summarizes he collaboraion of he LIA and CLIPS laboraories on speaker diarizaion of 14 broadcas news during he spring NIST Rich Transcripion 2003 evaluaion campaign (NIST-RTÕ03S). The 15 speaker diarizaion ask consiss of segmening a conversaion ino homogeneous segmens which are hen 16 grouped ino speaker classes. 17 Two approaches are described and compared for speaker diarizaion. The firs one relies on a classical 18 wo-sep speaker diarizaion sraegy based on a deecion of speaker urns followed by a clusering process, 19 while he second one uses an inegraed sraegy where boh segmen boundaries and speaker ying of he 20 segmens are exraced simulaneously and challenged during he whole process. These wo mehods are 21 used o invesigae various sraegies for he fusion of diarizaion resuls. 22 Furhermore, segmenaion ino acousic macro-classes is proposed and evaluaed as a priori sep o 23 speaker diarizaion. The objecive is o ake advanage of he a priori acousic informaion in he diariza- 24 ion process. Along wih enriching he resuling segmenaion wih informaion abou speaker gender, * Corresponding auhor. Tel.: ; fax: addresses: sylvain.meignier@univ-lemans.fr (S. Meignier), daniel.moraru@imag.fr (D. Moraru), corinne. fredouille@lia.univ-avignon.fr (C. Fredouille), jean-francois.bonasre@lia.univ-avignon.fr (J.-F. Bonasre), lauren. besacier@imag.fr (L. Besacier) /$ - see fron maer Ó 2005 Elsevier Ld. All righs reserved. doi: /j.csl

2 2 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx 25 channel qualiy or background sound, his approach brings gains in speaker diarizaion performance 26 hanks o he diversiy of acousic condiions found in broadcas news. 27 The las par of his paper describes some ongoing works carried ou by he CLIPS and LIA laboraories 28 and presens some resuls obained since 2002 on speaker diarizaion for various corpora. 29 Ó 2005 Elsevier Ld. All righs reserved. 30 Keywords: Speaker indexing; Speaker segmenaion and clusering; Speaker diarizaion; E-HMM; Inegraed appro- 31 ach; Sep-by-sep approach Inroducion 34 The design of efficien indexing algorihms o faciliae he rerieval of relevan informaion is 35 vial o provide easy access o mulimedia documens. Unil recenly, indexing audio-specific 36 documens such as radio broadcas news or he audio channel of video maerials mosly con- 37 sised of running auomaic speech recognizers (ASRs) on he audio channel in order o exrac 38 synacic or higher level informaion. Tex-based informaion rerieval approaches were hen ap- 39 plied o he ranscripion issued from speech recogniion. The ranscripion ask alone repre- 40 sened one of he main challenges of speech processing during he pas decade (see he 41 DARPA workshop proceedings a Darpa speech recogniion evaluaion workshop) and no spe- 42 cific effor was dedicaed o oher informaion embedded in he audio channel. Progress made in 43 broadcas news ranscripion (Kim e al., 2003; Nguyen and Xiang, 2004) shifs he focus o a 44 new ask, denoed Rich Transcripion (NIST-RTÕ03S, 2003), where synacic informaion is 45 only one elemen among various ypes of informaion. A he firs level, acousic-based infor- 46 maion like speaker urns, he number of speakers, speaker gender, speaker ideniy, oher 47 sounds (music, laughs) as well as speech bandwidh or characerisics (sudio qualiy or ele- 48 phone speech, clean speech or speech over music) can be exraced and added o synacic infor- 49 maion. A he second level, informaion direcly linked o he sponaneous naure of speech, 50 like disfluencies (hesiaions, repeiions, ec.) or emoion is also relevan for rich ranscripion. 51 On a higher level, linguisic or pragmaic informaion such as named eniy or opic exracion 52 for insance is paricularly ineresing for seamless navigaion or mulimedia informaion rerie- 53 val. Finally, some ypes of informaion exracion relevan o documen srucure do no fall 54 exacly ino one caegory; for example, he deecion of senence boundaries can be based on 55 acousic cues bu also on linguisic ones. 56 This paper concerns informaion exracion on he firs level described above. I is mainly 57 dedicaed o he deecion of speaker informaion, such as speaker urns, speaker gender, and 58 speaker ideniy. These speaker-relaed asks correspond o speaker segmenaion and clusering, 59 also denoed speaker diarizaion in he NIST rich ranscripion (RT) evaluaion campaign 60 erminology. 61 The speaker diarizaion ask consiss of segmening a conversaion involving muliple speakers 62 ino homogeneous pars which conain he voice of only one speaker, and grouping ogeher all 63 he segmens ha correspond o he same speaker. The firs par of he process is also-called 64 speaker change deecion while he second one is known as he clusering process. Generally, 65 no prior informaion is available regarding he number of speakers involved or heir ideniies.

3 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx 3 66 Esimaing he number of speakers is one of he main difficulies for he speaker diarizaion ask. 67 To summarize, his ask consiss of: 68 finding he speaker urns, 69 grouping he speaker-homogeneous segmens ino clusers, 70 esimaing he number of speakers involved in he documen Classical approaches for speaker diarizaion (Siu e al., 1992; Wilcox e al., 1994; Siegler e al., ; Gauvain e al., 1998; Chen and Gopalakrishnan, 1998) deal wih hese hree poins succes- 74 sively: firs finding he speaker urns using by example he symmeric Kullback Leibler (KL2), he 75 generalized likelihood raio (GLR), or he Bayesian informaion crierion (BIC) disance ap- 76 proaches, hen grouping he segmens during a hierarchical clusering phase, and finally esima- 77 ing he number of speakers a poseriori. If his sraegy presens some advanages like dealing wih 78 quie long and pure segmens for he clusering, i also has some drawbacks. For example, knowl- 79 edge issued from he clusering (like speaker-voice models) could be very useful o esimae seg- 80 men boundaries as well as o faciliae he deecion of oher speakers. Conrasing wih his 81 sep-by-sep sraegy, an inegraed approach, for which he hree seps involved in speaker diar- 82 izaion are performed simulaneously, uses all he informaion currenly available for each of he 83 subasks (Meignier e al., 2001; Ajmera and Wooers, 2003). The main disadvanage of he ine- 84 graed approach lies in he need o learn robus speaker models using very shor segmens (raher 85 han a cluser of segmens as in classical approaches), even hough he speaker models ge refined 86 along he process. Mixed sraegies are also proposed (Wilcox e al., 1994; Reynolds e al., 2000; 87 Moraru e al., 2004), where classical sep-by-sep segmenaion and clusering are firs applied and 88 hen refined using a re-segmenaion process during which he segmen boundaries, he segmen 89 clusering and someimes he number of speakers are challenged joinly. 90 In addiion o he inrinsic speaker diarizaion subasks presened above (denoed p1 in he lis 91 below), various problems need o be solved in order o segmen an audio documen ino speakers, 92 depending on he environmen or he naure of he documen: 93 o idenify he speaker urns and he speaker clusers, and o esimae he number of speakers 94 involved in he documen, wihou any a priori informaion (p1); 95 o be able o process speech documens as well as documens conaining music, silence, and 96 oher sounds (p2); 97 o be able o process sponaneous speech wih overlapping voices of speakers, disfluencies, ec. 98 (p3). 99 The NISTÕ02 speaker recogniion evaluaion provided an overview of he performance ha can be 100 obained for: 101 conversaional elephone speech, involving wo speakers and a single acousic class of signals; 102 broadcas news daa which ofen includes various qualiies or ypes of signal (such as sudio/ 103 elephone speech, music, speech over music, ec.); 104 meeing room daa in which speech is more sponaneous han in he previous cases, and presens 105 several disorions due o disan microphones (e.g., able microphone) and noisy environmen.

4 4 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx Table 1 shows he various classes of problems encounered in each siuaion (p1, p2, and p3). 108 The increasing difficuly of he asks is obviously due o heir novely (he las wo asks were 109 inroduced for he 2002 evaluaion campaign) bu also and mainly o he accumulaion of prob- 110 lems described in he previous paragraph. 111 Since 2001, wo members of he ELISA Consorium, CLIPS and LIA, have been collaboraing 112 in order o paricipae in he yearly evaluaion campaigns for he ask of speaker segmenaion/ 113 diarizaion: NISTÕ01 (2001) (LIA only), NISTÕ02 (2002), NIST-RTÕ03S (2003), and NIST-RTÕ04S 114 (2004). Since speaker diarizaion may also be useful for indexing and segmening videos, CLIPS 115 has also paricipaed in experimens in he las hree TREC VIDEO evaluaions (Smeaon e al., ) since 2002 (Quéno e al., 2002; Quéno e al., 2003). 117 The ELISA Consorium was originally creaed by ENST, EPFL, IDIAP, IRISA and LIA in wih he aim of promoing scienific exchange beween members, developing a common 119 sae-of-he-ar speaker verificaion sysem and paricipaing in he yearly NIST speaker recog- 120 niion evaluaion campaigns. Wih he years, he composiion of he Consorium has changed 121 and oday CLIPS, DDL, ENST, IRISA, LIA, LIUM and he Friburg Universiy are members. 122 Since 1998, he members of he Consorium have paricipaed in he NIST evaluaion cam- 123 paigns in speaker verificaion; a comparaive sudy of he various sysems presened in he , 2000 and 2001 campaigns can be found in ELISA (2000) and Magrin-Chagnolleau 125 e al. (2001). 126 This paper presens an overview of his long-erm collaboraion by invesigaing wo main 127 issues. Firsly, he relaive advanages of he classical sep-by-sep approach as well as of a 128 more original inegraed sraegy are discussed (his par of he work can be linked o he 129 p1 poin menioned above: he inrinsic asks of speaker diarizaion). Several fusion srae- 130 gies ha use he advanages of boh approaches are also proposed. The second issue addressed 131 in his paper concerns he naure of he audio documens o be segmened (issue denoed as 132 p2 ). This par of he work is more precisely dedicaed o speaker diarizaion of broadcas 133 news daa. The ineres of applying an acousic macro-class segmenaion process before 134 speaker segmenaion (in order o divide he audio file ino bandwidh- or gender-homoge- 135 neous pars) is discussed. 136 This paper is organized as follows: Secion 2 is devoed o he descripion of sysems. The 137 acousic macro-class segmenaion process and he wo speaker diarizaion echniques are de- 138 scribed successively. Secion 3 focuses on he fusion of he wo approaches. Performance of he 139 various sysems is presened and discussed in Secion 4. All he experimenal proocols and daa 140 are issued from he NIST-RTÕ03S developmen and evaluaion corpora (excep for some resuls 141 on meeing daa repored in Secion 5, issued from he NIST-RTÕ04S meeing daa evaluaion 142 (NIST-RTÕ04S, 2004; Fredouille e al., 2004)). Secion 5 presens ongoing work on meeing daa Table 1 Increasing difficuly of he asks Task Telephone Broadcas news Meeing Diarizaion error rae Problems involved p1 (bu wih fixed number of speakers) p1 + p2 p1 + p2 + p3 Bes resuls for he speaker diarizaion ask in he NISTÕ02 speaker recogniion evaluaion.

5 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx and inegraion of a priori knowledge ino a speaker diarizaion sysem. Finally, concluding re- 144 marks are made in Secion Speaker diarizaion approaches 146 Two differen speaker diarizaion sysems are proposed in his paper and described in he nex 147 secions. They were developed individually by he CLIPS and LIA laboraories in he framework 148 of he ELISA consorium (Moraru e al., 2003; Moraru e al., 2004). The CLIPS sysem relies on a 149 classical sep-by-sep sraegy. I involves a disance based deecor sraegy (Delacour and Wel- 150 kens, 2000) followed by a hierarchical clusering. This approach will be denoed as sep-by-sep 151 sraegy in he res of his paper. The second sysem developed by he LIA follows an inegraed 152 sraegy. I is based on a HMM and will be denoed as inegraed sraegy in his paper. 153 As illusraed in Fig. 1, boh sysems use an acousic macro-class segmenaion as a preliminary 154 phase. During his acousic segmenaion, he signal is firs divided ino four acousic classes 155 according o differen condiions based on gender and wide/narrow band deecion. Then, he 156 (CLIPS and LIA) diarizaion sysems are individually applied on each isolaed acousic class. Fi- 157 nally, he four resuling segmenaion oupus are merged and consolidaed hrough a re-segmen 158 a ion phase. The separae applicaion of he speaker diarizaion sysems on each acousic class as- 159 sumes ha a paricular speaker is associaed wih one of hem only. Neverheless, he re-segmen- 160 aion process allows o quesion he relaionship beween a speaker and a unique acousic class. 161 Boh diarizaion approaches and acousic segmenaion were developed independenly before 162 invesigaing differen sraegies for combining he sysems. Therefore, he seings of each of 163 hem, like acousic feaures or learning mehods, may differ bu come from experimens conduced 164 over a common developmen corpus (see Secion 4.1). Acousic segmenaion - Male Wide Female Wide Male Narrow Female Narrow Speaker diarizaion Speaker diarizaion Speaker diarizaion Merging & re-segmenaion Speaker diarizaion Fig. 1. Overview of he speaker diarizaion sraegy.

6 6 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx Acousic macro-class segmenaion 166 Segmening an audio signal ino acousic classes was mainly inroduced o assis ASR sysems 167 wihin he special conex of broadcas news ranscripion (Hain and Woodland, 1998; Woodland, ; Gauvain e al., 2002). Indeed, one of he firs objecives of acousic segmenaion was o 169 provide ASR sysems wih an acousic even classificaion o discard non-speech signal (silence, 170 music, commercials) and o adap ASR acousic models o some paricular acousic environmens, 171 like speech over music, elephone speech or speaker gender. Many papers were dedicaed o his 172 paricular issue and o he evaluaion of acousic segmenaion in he conex of he ASR ask. 173 However, acousic segmenaion may be useful for oher asks linked o broadcas news corpora, 174 alhough his is rarely discussed in he lieraure. In his sense, one of he aims of his work is o 175 invesigae he impac of acousic segmenaion when i is applied as prior segmenaion for speak- 176 er diarizaion. 177 Speech/non-speech deecion is useful for he speaker diarizaion ask in order o avoid music 178 and silence porions being auomaically labeled as new speakers. This is paricularly rue in he 179 conex of he NIST-RT evaluaion in which boh miss and false alarm speech errors are aken 180 ino accoun for he speaker diarizaion scoring. 181 Moreover, an acousic segmenaion sysem can be designed o provide a finer classificaion. 182 For example, gender and frequency band deecion may inroduce a priori knowledge in he diar- 183 izaion process. In his paper, he prior acousic segmenaion is done a hree differen levels: 184 Speech/non-speech. 185 Clean speech/speech over music/elephone speech (narrow band). 186 Male/female speech Hierarchical approach 189 The sysem relies on a hierarchical segmenaion performed in hree successive seps as illus- 190 raed in Fig. 2: Speech / Non Speech segmenaion MixS model vs. NS MixS Speech segmenaion S model vs. T vs. SM NS S T SM Gender Deecion Gender Deecion Gender Deecion GS - Ma GS - Fe GDS - Fe GDS - Ma GT - Fe GT - Ma GSM - Ma GDS-Fe GSM - Fe GDS - Ma Fig. 2. Hierarchical acousic segmenaion.

7 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx During he firs sep, a speech/non-speech segmenaion is performed using wo models. The 192 firs model, MixS, represens all he speech condiions while he second one, NS, represens 193 he non-speech condiions. Basically, he segmenaion process relies on a frame-by-frame 194 bes model search. A se of morphological rules is hen applied o aggregae frames and o 195 label segmens. These rules mainly aim a consraining he duraion of segmens, by fixing 196 for insance minimum lenghs for boh speech and non-speech segmens. This sraegy was 197 preferred o a Vierbi decoding, which ends, in his conex, o misclassify non-speech 198 segmens. 199 During he second sep, a segmenaion based on hree classes, clean speech (S model), 200 speech over music (SM model) and elephone speech (T model), is performed only on he 201 speech segmens deeced during he previous segmenaion sep. All he models involved 202 during his sep are gender-independen. The segmenaion process is a Vierbi decoding 203 applied on an ergodic HMM, composed of hree saes (S, T, and SM models). The ransi- 204 ion probabiliies of his ergodic HMM are learn on he 1996 HUB 4 broadcas news 205 corpus. 206 The las sep is gender deecion. According o he label assigned during he previous sep, each 207 segmen will be idenified as female or male speech by he use of models dependen on boh 208 gender and acousic classes. GT-Fe and GT-Ma models represen female and male elephone 209 speech respecively, GS-Fe and GS-Ma represen female and male clean speech, while GSM- 210 Fe and GSM-Ma represen female and male speech over music. Two addiional models, 211 GDS-Fe and GDS-Ma, represening female and male speech recorded under degraded condi- 212 ions are also used o refine he final segmenaion. The segmenaion process described in 213 he previous sep is applied here again Sysem specificaions 216 The signal is characerized by 39 acousic feaures compued every 10 ms on 25 ms Hamming- 217 windowed frames: 12 Mel frequency cepsral coefficiens (MFCC) augmened by he normalized 218 log-energy, followed by he dela and dela dela coefficiens. The choice of parameers was 219 mainly guided by he lieraure (Hain and Woodland, 1998). 220 All he models menioned in he previous secion are diagonal Gaussian mixure models 221 (GMMs), rained on he 1996 HUB 4 broadcas news corpus. The NS and MixS models are char- 222 acerized by 1 and 512 Gaussian componens respecively, while he oher models are characer- 223 ized by 1024 Gaussian componens. All hese parameers have been chosen empirically following 224 a se of experimens no repored here Sep-by-sep speaker diarizaion 226 The CLIPS sysem is a sae-of-he-ar sysem based on he speaker change deecion followed 227 by a hierarchical clusering. The number of speakers involved in he conversaion is auomaically 228 esimaed. The sysem uses he acousic macro-class segmenaion described in Secion 2.1. The 229 CLIPS diarizaion is applied individually on every acousic class as explained in Secion 2 and 230 he resuls are merged a he end. The nex subsecions will provide a deailed descripion of every 231 module of he sysem.

8 8 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx Sep one: speaker change deecion 233 The goal of he speaker change deecion is o cu he audio recording ino segmens conain- 234 ing only he speech of one single speaker. The purpose of he speaker change deecion is o 235 find some audio signal disconinuiies ha will help us disinguish beween wo consecuive 236 speakers. Those segmens will be used as inpu daa for he clusering module. A disance based 237 approach (Delacour and Welkens, 2000; Chen and Gopalakrishnan, 1998) is used, implying 238 here he GLR. Given wo acousic sequences X and Y we es wheher hey were produced 239 by he same Gaussian model (he same speaker) M XY or by wo differen models (wo differen 240 speakers) M X and M Y. This quesion can be answered using he following GLR raio, 241 where R GLR ðx ; Y Þ¼log LðX jm X Þþlog LðY jm Y Þ log LðXY jm XY Þ. ð1þ 245 A high value of R GLR means ha he wo-model hypohesis is more likely han he one-model 246 hypohesis. The firs wo erms of R GLR is he log-likelihood of he wo-model hypohesis and 247 he las erm is he log-likelihood of he one-model hypohesis. A GLR curve is exraced from s adjacen windows ha move along he audio signal. The window size mus be small enough 249 o conain only one speaker and large enough o obain a reliable model. The wo windows ad- 250 vance frame by frame. Mono-Gaussian models wih diagonal covariance marices are used o 251 build he GLR curve. The maximum peaks of he curve are he mos likely speaker change poins. 252 A hreshold is hen applied on he GLR curve o find speaker changes. The hreshold is uned so 253 ha over-segmenaion (more speaker changes deeced) is provided, as we prefer o deec more 254 segmens (which can be furher merged by he clusering process) raher han miss speaker 255 changes (which will never be recovered laer). The hreshold is compued using he mean value 256 of he curren curve. Thus, i adaps iself from one file o anoher. 257 Anoher sysem was presened a he NISTÕ02 speaker recogniion evaluaion wih a priori seg- 258 menaion using fixed lengh segmens (0.75 s). I gave approximaely he same performance while 259 being 3 imes slower due o he uniform segmenaion ha leads o far more segmens as inpu of 260 he clusering module Sep wo: clusering 262 Now ha we have deeced he speaker changes, he segmens obained mus be grouped (clus- 263 ered) by speaker. The CLIPS clusering uses a hierarchical boom-up algorihm. A clusering 264 algorihm generally relies on wo imporan elemens: he disance beween classes and he sop 265 crierion. The disance used is he GLR disance and he sop crierion is he esimaed number 266 of speakers. The GLR disance is he GLR raio (see Eq. (1)) compued beween classes raher 267 han consecuive windows. Anoher difference is ha he models used are no longer mono-gauss- 268 ian as in he speaker change deecion bu GMMs. 269 Firs, a diagonal 32 GMM background model is rained on he enire file using a classical EM 270 algorihm. We need a background model o compensae for he lack of daa for each speaker. The 271 advanage of using a background model rained on he curren file is ha i is always suied for he 272 curren ask. A more complex background model (e.g., 512 GMM diagonal) rained on exernal 273 daa could perform beer bu makes he speaker diarizaion sysem daa dependen (he sysem 274 would work only on he ype of daa used o rain he background model). The size of he model is 275 a good compromise beween complexiy and performance: beyond 32 Gaussian componens

9 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx we only gain abou 0.5% absolue diarizaion error rae (DER) bu we increase he execuion 277 ime. 278 Segmen models are hen rained using a linear MAP adapaion (see Meignier e al., 2001, for 279 more deails on he linear MAP adapaion) of he background model (means only). GLR dis- 280 ances are hen compued beween models and he closes segmens are merged a each sep of 281 he algorihm unil N segmens are lef (corresponding o N speakers deeced in he 282 conversaion). 283 The number of speakers N is esimaed as described in he nex secion. The clusering is done 284 individually on each acousic macro-class (namely male/wide, female/wide, male/narrow and fe- 285 male/narrow) and he resuls are merged in he end Sep hree: esimaing he number of speakers 287 The algorihm ha esimaes he number of speakers is based on he penalized BIC (Schwarz, ). 289 A firs, he number of speakers is limied o beween 1 and 25. The upper limi usually depends 290 on he recording size. 291 We selec he number of speakers (N sp ) ha maximizes BICðMÞ ¼log LðX jmþ k m 2 N sp log N X ; 294 where M is he model composed of he N sp speaker models, N X is he oal number of speech 295 frames involved, m is a parameer ha depends on he complexiy of he speaker models and k 296 is a uning parameer empirically se a 0.6. In our case (32 diagonal GMM), m is equal o (2 imes 32) imes he number of acousic feaures. The firs erm is he overall log-likelihood 298 of he daa. The second erm is used o penalize he complexiy of he model. We need he second 299 erm because he log-likelihood of he daa increases wih he number of models (speakers) in- 300 volved in he calculaion of L(X M). 301 Le X i and M i be he daa and he model of speaker i respecively. The model is obained by 302 MAP adapaion of he background model over he speaker daa as in he previous secion. If 303 we make he hypohesis ha daa X i depends only on he speaker model M i hen we can prove 304 ha he overall log-likelihood of he daa becomes LðX jmþ ¼ YN sp LðX i jm i Þ. i¼1 307 Resuls concerning he esimaion of he number of speakers will be presened in Secion Sysem specificaions 309 The signal is characerized by 16 MFCC compued every 10 ms on 20 ms windows using filer banks. Then we add he energy parameer. The choice of he number of filers is due 311 o he fac ha we work on wide-band daa (broadcas news). No frame removal nor 312 coefficien normalizaion is applied. The parameerizaion is he same for all sysem modules 313 of his sep-by-sep diarizaion sysem, bu is differen from ha of he inegraed speaker 314 diarizaion sysem and he acousic segmenaion, which were all developed separaely in 315 differen places. ð2þ ð3þ

10 10 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx Inegraed speaker diarizaion 317 The LIA sysem is based on an evoluive Hidden Markov modeling (E-HMM) of he conversa- 318 ion (Meignier e al., 2000; Meignier e al., 2001; Moraru e al., 2003; Moraru e al., 2004). The 319 HMM is ergodic; all speaker changes are poenially available. Each sae of he HMM characer- 320 izes a speaker and he ransiions model he changes beween speakers (Fig. 3). In his ieraive ap- 321 proach, boh he segmenaion and he speaker models are used a each sep and are re-evaluaed a 322 he nex sep. During he diarizaion process, he speakers are deeced and added one by one a 323 each ieraion. This is he reason why we have named his diarizaion mehod inegraed approach. 324 The speaker diarizaion sysem relies on he acousic macro-class segmenaion described in 325 Secion 2.1. I is applied separaely on each of he acousic classes deeced (e.g., male/wide, fe- 326 male/wide, male/narrow and female/narrow). Finally, he separae speaker diarizaion oupus 327 are merged followed by a re-segmenaion process, described in Secion Speaker diarizaion process 329 During he diarizaion, he HMM is generaed using an ieraive process which deecs and adds 330 a new sae (i.e., a new speaker) a each ieraion. The speaker deecion process is performed in 331 four seps (Fig. 4). An example for a wo speaker show is given in Fig Sep 1 Iniializaion. A firs speaker model S 0 is rained on he whole show (broadcas news show for insance). The segmenaion is modeled by a one-sae HMM and he whole signal is assigned o speaker S 0. A he beginning of he ieraive process, S 0 represens all he speakers of he show. A he end of he process, once all he speakers have been deeced (he n 1 firs speakers) and heir segmens associaed wih, S 0 should represen a unique speaker, he las one (he nh speaker). 338 Sep 2 Adding a new speaker. A new speaker is exraced from he segmens currenly labeled S 0 represening he speakers ha are no deeced ye. The new speaker model is rained using he 3-s region of S 0 ha maximize he likelihood raio beween model S 0 and a universal background model (UBM; Reynolds e al., 2000, see Secion 2.3.2). The lengh of he iniial region mus be sufficien o iniialize a robus speaker model while conaining one speaker only. This sraegy selecs he closes daa o speaker model S 0. The 3-s lengh is chosen empirically.a corresponding sae, labeled S x (x is he number of ieraions), is added o he previous HMM. The ransiion probabiliies are updaed according o a se of rules (more deails are given in Secion 2.3.2). Finally, he seleced 3 s of es are moved from label S 0 o label S x in he segmenaion hypohesis.various selecion sra- S 1 S 2 S 0 S 0 S 1 S 0 S2 S 1 S 2 Fig. 3. Inegraed approach: evoluive HMM modeling of he conversaion and segmenaion. Example given for hree speakers (S 0,S 1,S 2 ).

11 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx 11 1 new speaker model: No new speaker model: Adaping speaker models: MAP adapaion, likelihood, Vierbi Are he las 2 segmenaions equal Validaion of speaker models Assessing he sop crierions egies have been esed, involving eiher he speaker or UBM models. The selecion mehod described here produces he bes accuracy in erms of puriy of he segmens and of speaker diarizaion error. 351 Sep 3 Adaping speaker models. This phase allows he deecion of he segmens belonging o a new speaker S x and he reallocaion of he daa beween all he speakers. Firs, all he speaker models are adaped according o he curren segmenaion. Then, Vierbi decoding produces a new segmenaion. The adapaion and decoding asks are performed while he segmenaion differs beween wo successive adapaion/decoding phases. Two segmenaions are differen when a leas a frame is assigned o wo differen speakers. 357 Sep 4 Speaker model validaion and assessmen of he sopping crierion. The likelihood of he previous soluion and he likelihood of he curren soluion are compued using he curren HMM model (for example, he soluion wih wo speakers deeced and he curren soluion wih hree speakers deeced). In order o compare he likelihoods of boh soluions, he previous one is rescored using he associaed HMM where a non-emiing sae is added (i.e., he ransiion probabiliies are se o he same values for boh HMM). The sopping crierion is reached when no gain in erms of likelihood is observed or when no more speech is lef o iniialize a new speaker During he developmen, experimens show ha wo heurisics help o minimize he speaker 367 diarizaion error 368 The firs one removes he curren speaker if he oal ime of he segmens allocaed o ha 369 speaker is less han 4 s. Moreover, he 3-s region used for is iniializaion is never re-employed 370 in he sep 2 and he process coninues wih he segmenaion of he previous ieraion. 371 The second one discards he previous speakers from he segmenaion if he lengh of heir seg- 372 mens is lower han he curren one. This rule, which forces he deecion of he longes speaker 373 firs, is closely relaed o he evaluaion meric used in NIST campaigns where i is more impor- 374 an o find he longes speaker segmens han he shores ones. Yes Fig. 4. Inegraed approach: differen seps of he process.

12 12 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx Sep 1: iniializaion is rained on he whole es uerance, models all he speakers of he show 375 Ieraion 1 : speaker Seps 2: adding he new speaker The bes subse of is used o learn model, a new HMM is buil Sep 3: Adaping speaker models Sep 4 : Speaker model validaion, assessmen of he sopping crierion Ieraion 2: speaker S2 Seps 2: adding he new speaker Sep 3: Adaping speaker models S2 Sep 4 : Speaker model validaion, assessmen of he sopping crierion S2 Bes 2 speakers indexing The bes subse of is used o learn S2 model, a new HMM is buil L0 Training + Training + Training + Vierbi Vierbi Vierbi S2 Training + Vierbi Bes 3 speakers indexing A gain is observed, a new speaker will be added No gain is observed, he process sops and reurn he 2 speaker indexing Sysem specificaions 377 The sysem specificaions are se empirically on a developmen corpus (see Secion 4.1). The 378 nex paragraphs give some deails on he parameerizaion of he signal, he speaker model adap- 379 aion and he HMM. S2 No gain observed, he adapaion of he S2 model is sopped Training + Vierbi S2 According o he subse seleced, his indexing is obained Bes one speaker indexing Bes 2 speakers indexing No gain observed, he adapaion of he model is sopped According o he subse seleced, we obain his indexing Fig. 5. Inegraed approach: diarizaion example for a wo speaker show Parameerizaion. The signal is characerized by 20h order MFCC compued a a 10 ms 381 frame rae using a 20 ms window and he normalized energy. No coefficien normalizaion is

13 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx applied; indeed he cepsral mean subracion (CMS) or he sliding CMS decreases he diarizaion 383 accuracy Speaker models. Speaker models and adapaion echniques used in he E-HMM are sim- 385 ilar o hose generally used for auomaic speaker recogniion. Speaker models are based on 386 GMMs derived from a UBM. Means only are adaped by a MAP echnique. The GMMs are 387 composed of 128 Gaussian componens wih diagonal covariance marices. 388 The UBM is rained wih a classical EM algorihm based on he ML principle on a subse of HUB 4 broadcas news corpus. The UBM learning se is composed of boh male and female 390 daa and boh wide- and narrow-band daa. Variance flooring is applied during he raining so 391 ha variance for each Gaussian is no less han 0.5 he variance of he corresponding UBM 392 Gaussian. A sliding CMS is applied on each raining daa se before learning; he sliding window 393 is 3 s long. The CMS is performed in order o remove he influence of he various channels (due o 394 he high number of speakers and records in he UBM corpus). Moreover, preliminary experimens 395 have shown an improvemen of he speaker diarizaion accuracy when he UBM feaures were 396 normalized. 397 The adapaion scheme is based on a varian of MAP developed by he LIA (Meignier 398 e al., 2001). The relaive weighs of he UBM and he esimae daa resul from a combi- 399 naion of he UBM and esimaed speaker Gaussian weighs (respecively w UBM i ; w E i for he 400 Gaussian i) and a priori weighs (respecively a, 1 a). The mean i of he speaker model is 401 obained by l i ¼ aw UBM i aw UBM i þð1 aþw E i l UBM i þ aw UBM i ð1 aþw E i l E þð1 aþw E i. i 404 Experimenally, a is fixed o 0.2 for he UBM. This seing corresponds o he value ha mini- 405 mizes he speaker diarizaion error over he developmen corpus HMM. The HMM emission probabiliies are esimaed by compuing he mean of he 407 frame-based log likelihoods over a 0.3 s sliding window for each sae. This 0.3-s score rae 408 (he sysems are generally based on a frame score rae) permis o smooh ou local speaker 409 changes and o modify he inrinsic exponenial duraion law of he saes. 410 The HMM ransiion probabiliies are fixed according o he following rules: 411 Each ransiion probabiliy, a i,i (from sae S i o sae S i ) is equal o an a-priori value g. 412 Each ransiion probabiliy, a i,j (from sae S i o sae S j ) is equal o ð1 gþ a i;j ¼ ðn 1Þ 415 wih i 6¼ j and n is he number of saes (i.e., speakers) In his paper, he g value is se o 0.6. This seing corresponds o he value ha minimizes he 418 speaker diarizaion error over he developmen corpus. ð4þ ð5þ

14 14 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx Speaker re-segmenaion 420 The use of a re-segmenaion phase a he end of a clusering process was earlier proposed, for 421 example in Wilcox e al. (1994), Gauvain e al. (2001), Reynolds e al. (2000) and Adami e al. 422 (2002). The wo main mehods are based on GMM/HMM models and make decisions a he 423 frame level: 424 hanks o Vierbi decoding (Wilcox e al., 1994; Gauvain e al., 2001); 425 or over scores compued in a sliding window (Reynolds e al., 2000; Adami e al., 2002). 426 The process can be run ieraively bu (Reynolds e al., 2000) has shown ha i degrades he 427 performance. 428 The ELISA re-segmenaion sage is also based on a Vierbi decoding (similar o he Adap- 429 ing speaker models sep, described in Secion 2.3.1). Firsly, he four gender- and channel- 430 dependen segmenaions are merged by simply pooling he segmenaions (here is no overlap 431 beween sub-segmenaions). Secondly, he speaker-model adapaion and Vierbi decoding are 432 performed ieraively. A he end of each ieraion, he speakers wih less han 4 s of signal are 433 removed. 434 During he re-segmenaion process, he parameers are similar o hose used for he E-HMM 435 clusering process, excep for he model raining mehod. In his case, he classical mean-only 436 MAP adapaion is performed o obain speaker models (Gauvain and Lee, 1994; Reynolds 437 e al., 2000) insead of he varian MAP echnique proposed by he LIA and described in Secion The adapaion rae of he means is conrolled by he relevan facor (Reynolds e al., 2000) 439 which is experimenally se a 16. Moreover, a iny gain is obained over he developmen corpus 440 when he HMM emission probabiliy score rae is reduced from 0.3 o 0.2 s since his reducion 441 helps o refine he boundaries of he oupu segmenaion Fusion of sysems 443 Since he NIST 2002 evaluaion, CLIPS and LIA have invesigaed differen sraegies for com- 444 bining he sysems. In his paper, only sraegies for broadcas news daa are described. 1 Basically, 445 he aim of hese sraegies is o benefi from he advanages of boh speaker diarizaion ap- 446 proaches, described in previous secions. Two kinds of sraegy are proposed: firsly, a hybridiza- 447 ion sraegy and secondly, merging various segmenaion oupus. The laer is a new way of 448 combining resuls coming from muliple and unlimied diarizaion sysems Hybridizaion sraegy ( piped sysem) 450 The purpose of his hybridizaion sraegy is o use he resuls of one sysem o iniialize a sec- 451 ond one. In his paper, he speakers deeced by he sep-by-sep sysem (number of speakers and 1 The reader is invied o look a Moraru e al. (2003) for elephone sraegy.

15 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx 15 CLIPS segmenaion LIA re-segmenaion 452 associaed audio segmens) are insered in he re-segmenaion module of he inegraed sysem 453 (he models are rained using he informaion provided by he clusering phase) as illusraed in 454 Fig. 6. This soluion associaes he advanages of longer and (quie) pure segmens, provided 455 by he sep-by-sep approach, wih he HMM modeling and decoding power of he inegraed 456 sraegy Merging sraegy ( fusion sysem) Fig. 6. Example of a piped sysem. 458 The aim of he fusion sysem consiss of using he segmenaion oupus issued from as many 459 expers as possible. For example, in his paper he oal number of expers is four (see Fig. 7): he 460 sep-by-sep sysem, he inegraed sysem, a varian of he inegraed sysem, and he piped sys- 461 em (see Secion 3.1). The merging sraegy relies on a frame-based decision which consiss of 462 grouping he labels proposed by each of he sysems a he frame level. An example (for four sys- 463 ems denoed A, B, C and D) is presened below: 464 Frame i: Sysem A gives he speaker label A 1 : Sysem B gives B 4, Sysem C gives C 1 and Sysem 465 D gives D 1. A 1 B 4 C 1 D 1 is hen he merged label. 466 Frame i + 1: Sysem A gives A 2, Sysem B gives B 4, Sysem C gives C 1 and Sysem D gives D A 2 B 4 C 1 D 1 is hen he merged label. 468 This label merging mehod generaes (before re-segmenaion) a large se of poenial speakers. 469 The re-segmenaion module of he inegraed sysem can be applied on he merged diarizaion. 470 Beween each adapaion/decoding phase, he poenial speakers for whom oal ime is shorer A1 A0 B1 B0 C1 C0 D1 D0 A1B1C1D1 A0B1C0D1 A0B1C1D0 A1B1C1D0 A0B0C0D0 Label Merge LIA re-segmenaion Fig. 7. Example of a merging sysem.

16 16 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx 471 han 3 s are deleed. Indeed, 3 s of signal correspond o he minimal lengh needed o learn a 472 speaker model. The daa of hese deleed speakers will furher be dispached beween he remain- 473 ing speakers during he nex adapaion/decoding phase Experimens and resuls 475 The experimens were carried ou in he framework of he NIST-RTÕ03S speaker diarizaion 476 evaluaion on American broadcas news (NIST-RTÕ03S, 2003) Developmen and evaluaion corpora 478 Following he NIST-RTÕ03S evaluaion campaign, wo corpora are available for he speaker 479 diarizaion ask. One of hem is used for he developmen of he sysems, which are validaed on 480 he second one during a blind evaluaion. The developmen corpus used in his paper is no he 481 official one. I is issued from he RTÕ02 broadcas news evaluaion and was used in he Decem- 482 ber 2002 RT dry-run. Iniially, exraced from he HUB-4 evaluaion campaign corpus, i is 483 composed of six broadcas news shows of abou 10 min each, recorded in 1998 from channels 484 MNB, CNN, NBC, PRI, VOA and ABC. On he oher side, he RTÕ03S evaluaion corpus is 485 composed of hree 30-min shows recorded in 2001 from channels PRI, VOA and MNB. Some 486 informaion (averaged size, averaged number of speakers, ec.) relaed o hese corpora are 487 given in Table In his paper, hese developmen and evaluaion corpora are named respecively RTÕ03S-Dev 489 and RTÕ03S-Eva. Two addiional corpora are used during he experimens. Boh of hem are de- 490 rived from RTÕ03S-Dev and RTÕ03S-Eva by discarding all he adverisemen porions manually 491 before being processed. 2 Besides, some speech maerial, no used in he RTÕ03S-Dev and 492 RTÕ03S-Eva corpora, was mainained in he second ones. They are named ELISA-Dev (derived 493 from RTÕ03S-Dev) and ELISA-Eva (derived from RTÕ03S-Eva) and serve he same role as he ori- 494 ginal corpora, i.e., sysem developmen and evaluaion purposes. The use of hese addiional cor- 495 pora during experimens may explain ha some resuls presened in his paper do no correspond 496 exacly o he official NIST-RTÕ03S resuls. 497 In order o evaluae he accuracy of he acousic macro-class segmenaion, a reference segmen- 498 aion including he differen argeed acousic class (speech/non-speech, gender labels, and ele- 499 phone/non-elephone speech) was necessary. Since NIST does no provide any official reference 500 for he bandwidh classificaion, he auhors have marked heir own. Boh he boundaries and la- 501 bels were manually idenified. This reference segmenaion will be referred o as HandS/NS Gen- 502 der T/NT laer in his paper. 503 Moreover, i is worh noing ha due o he small size of he differen corpora, all he resuls 504 presened in his paper have o be considered wih cauion. 2 Commercials, presen in he audio documens, are no scored for he RTÕ03S evaluaion campaign. Neverheless, heir presence during he segmenaion process may disurb he sysems since hey involve addiional speakers, enirely irrelevan in he oupu segmenaion.

17 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx 17 Table 2 Descripion of he differen corpora, in erms of number of shows, duraion average, and speaker number average Corpus Number of shows Duraion average (s) Speaker number average RTÕ03S-Dev >12 a RTÕ03S-Eva >19 Elisa-Dev Elisa-Eva a As commercials are no manually ranscribed, he exac number of speakers is unknown Evaluaion meric 506 The speaker diarizaion performance is evaluaed by comparing he hypohesis segmenaion, 507 given by he sysem, wih he reference segmenaion provided by NIST. This reference segmen- 508 aion was generaed by hand according o a se of rules described in NIST-RTÕ03S (2003) and 509 NIST (2003). 510 The evaluaion meric is based on he NIST speaker diarizaion meric defined in he NIST- 511 RTÕ03S evaluaion plan (NIST-RTÕ03S, 2003). I is called he diarizaion meric, and expressed 512 in erms of diarizaion error rae (DER). I akes hree kinds of error ino accoun (named SE, 513 MisE, FaE, respecively in he nex secions) 514 A speaker error defined below (SE). 515 A missed speaker error relaive o a misclassificaion of speech segmens as non-speech seg- 516 mens (MisE). 517 A false alarm speaker error relaive o a misclassificaion of non-speech segmens as speech seg- 518 mens (FaE). 519 To compue he speaker error, he scoring algorihm opimally maps he reference speakers o he 520 hypohesis speakers. Each reference speaker is mapped ono one hypohesis speaker a mos and 521 conversely each hypohesis speaker is mapped ono one reference speaker a mos. The mapping 522 maximizes he overlap in duraion beween all pairs of reference and hypohesis speakers. The 523 speaker error is finally expressed as he duraion of non-maching zones beween reference and 524 hypohesis segmens. 525 Concerning he gender- and bandwidh-misclassificaion errors, hey are measured a a frame 526 level by comparing he hypohesis classificaion wih he reference segmenaion proposed by 527 he auhors HandS/NS Gender T/NT Acousic macro-class segmenaion experimens 529 This secion presens he evaluaion proocol used o measure he impac of he acousic macro- 530 class segmenaion when combined wih speaker diarizaion and discusses he experimenal resuls 531 obained in his framework. Differen levels of acousic segmenaion granulariy are evaluaed on 532 boh speaker diarizaion sysems:

18 18 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx 533 Speech/non-speech classificaion only (S/NS). This segmenaion corre sponds o he firs level 534 of he acousic macro-class segmenaion described in Secion Segmenaion based on speech/non-speech and gender deecion (S/NS- Gender). This segmen- 536 aion is obained by merging all he labels GS-XX, GSM-XX, GDS-XX and GT-XX yielded by 537 he acousic macro-class segmenaion (see Fig. 2) in a single XX label where XX represens 538 eiher Ma or Fe. 539 Segmenaion based on speech/non-speech, gender and elephone/non- elephone speech 540 deecion (S/NS-Gender-T/NT). NT segmenaion is obained by merging all he GS-XX, 541 GSM-XX, and GDS-XX (see Fig. 2) in a single NT-XX label where XX represens eiher 542 Ma or Fe. 543 Segmenaion based on speech/non-speech, gender and elephone/clean speech/speech over 544 music/degraded speech (S/NS-Gender-T/S/MS/DS). In his segmenaion, all he labels yielded 545 by he hird level of he acousic macro-class segmenaion sysem are used (see Fig. 2). 546 For comparison purposes, speaker diarizaion resuls based on he reference acousic macro-class 547 segmenaion, Hand S/NS-Gender-T/NT, are also presened Inrinsic performance of acousic macro-class segmenaion 549 Table 3 provides he performance of he acousic macro-class segmenaion on boh RTÕ03S- 550 Dev and RTÕ03S-Eva corpora. Some deails abou he amoun of daa for each argeed class 551 are repored in Table The speech/non-speech segmenaion error is around 4.9% (in erms of duraion) compared o % for he bes sysem during he NIST-RTÕ03S evaluaion campaign (NIST). The gender deec- Table 3 Classificaion error raes made by he acousic macro-class segmenaion sysem on he RTÕ03S-Dev and RTÕ03S-Eva ses according o he differen classes available on he audio maerial (speech, non-speech, gender and elephone/nonelephone) Corpus Classificaion error rae (%) Speech Non-speech Gender Telephone/non-elephone RTÕ03S-Dev RTÕ03S-Eva Auomaic and manual acousic segmenaions are compared a he frame level. Table 4 Amoun of daa for each argeed acousic class: speech/non-speech classes, female and male speech classes, elephone and non-elephone speech classes Corpus Daa amoun (s) of each acousic class Speech Non-speech Female Male Telephone non-elephone RTÕ03S-Dev RTÕ03S-Eva

19 S. Meignier e al. / Compuer Speech and Language xxx (2005) xxx xxx ion error goes from 1.5% for he RTÕ03S-Dev se a 5.5% for he RTÕ03S-Eva se. As said in he 555 descripion of he corpora, he reference segmenaion provided by NIST does no include ele- 556 phone/non-elephone informaion. Therefore, he accuracy of he acousic segmenaion sysem 557 for he elephone and non-elephone classificaion is evaluaed using reference boundaries marked 558 by he auhors (HandS/NS Gender T/NT): less han 0.1% for he RTÕ03S-Dev corpus and 3% 559 for he RTÕ03S-Eva Performance of speaker diarizaion 561 This secion presens he experimenal resuls obained when applying differen levels of acous- 562 ic macro-class segmenaion prior o he speaker diarizaion sysems (inegraed and sep-by-sep 563 mehods). Experimens are conduced on ELISA-Dev and ELISA-Eva corpora. 564 Table 5 provides he resuls obained individually by each speaker diarizaion sysem before 565 applying he re-segmenaion sep described in Secion 2.4 whereas Table 6 provides he resuls 566 obained afer he re-segmenaion sep. Three kinds of observaion may be poined ou hrough 567 hese resuls, expressed in erms of missed speaker error rae (MiE), false alarm speaker error rae 568 (FaE), speaker error rae (SE) and diarizaion error rae (DER): 569 (a) Concerning he corpora (ELISA-Dev and ELISA-Eva), a large variaion in erms of perfor- 570 mance may be observed beween he speaker diarizaion sysems depending on he corpus used. 571 Indeed, he performance of he inegraed sysem drasically decreases on ELISA-Eva corpus 572 compared wih ELISA-Dev (e.g., from 14.8% o 27.3% for S/NS-Gender- T/NT acousic seg- 573 menaion) while he sep-by-sep sysem performance remains quie seady whaever he corpus 574 used. Table 5 Error raes, expressed in erms of missed speaker (MiE), false alarm speaker (FaE), speaker (SE) and diarizaion speaker (DER) error raes (%), obained by each speaker diarizaion sysem before applying he re-segmenaion sep when combined wih differen levels of acousic macro-class segmenaion Acousic segmenaion ELISA-Dev ELISA-Eva MiE FaE SE DER MiE FaE SE DER Sep-by-sep sysem Hand S/NS-Gender-T/NT S/NS S/NS-Gender S/NS-Gender-T/NT S/NS-Gender-T/S/MS/DS Inegraed sysem Hand S/NS-Gender-T/NT S/NS S/NS-Gender S/NS-Gender-T/NT S/NS-Gender-T/S/MS/DS Experimens conduced on ELISA-Dev and ELISA-Eva corpora.

Neural Network Model of the Backpropagation Algorithm

Neural Network Model of the Backpropagation Algorithm Neural Nework Model of he Backpropagaion Algorihm Rudolf Jakša Deparmen of Cyberneics and Arificial Inelligence Technical Universiy of Košice Lená 9, 4 Košice Slovakia jaksa@neuron.uke.sk Miroslav Karák

More information

More Accurate Question Answering on Freebase

More Accurate Question Answering on Freebase More Accurae Quesion Answering on Freebase Hannah Bas, Elmar Haussmann Deparmen of Compuer Science Universiy of Freiburg 79110 Freiburg, Germany {bas, haussmann}@informaik.uni-freiburg.de ABSTRACT Real-world

More information

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments Inernaional Journal of Elecrical and Compuer Engineering (IJECE) Vol. 6, No. 5, Ocober 2016, pp. 2415~2424 ISSN: 2088-8708, DOI: 10.11591/ijece.v6i5.10639 2415 An Effiecien Approach for Resource Auo-Scaling

More information

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices Z. Zhang e al.: Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices 525 Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion

More information

Fast Multi-task Learning for Query Spelling Correction

Fast Multi-task Learning for Query Spelling Correction Fas Muli-ask Learning for Query Spelling Correcion Xu Sun Dep. of Saisical Science Cornell Universiy Ihaca, NY 14853 xusun@cornell.edu Anshumali Shrivasava Dep. of Compuer Science Cornell Universiy Ihaca,

More information

1 Language universals

1 Language universals AS LX 500 Topics: Language Uniersals Fall 2010, Sepember 21 4a. Anisymmery 1 Language uniersals Subjec-erb agreemen and order Bach (1971) discusses wh-quesions across SO and SO languages, hypohesizing:...

More information

MyLab & Mastering Business

MyLab & Mastering Business MyLab & Masering Business Efficacy Repor 2013 MyLab & Masering: Business Efficacy Repor 2013 Edied by Michelle D. Speckler 2013 Pearson MyAccouningLab, MyEconLab, MyFinanceLab, MyMarkeingLab, and MyOMLab

More information

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports Downloaded from ascelibrary.org by Basil Sephanis on 07/13/16. Copyrigh ASCE. For personal use only; all righs reserved. Informaion Propagaion for informing Special Populaion Subgroups abou New Ground

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING Gábor Gosztolya 1, Tamás Grósz 1, László Tóth 1, David Imseng 2 1 MTA-SZTE Research Group on Artificial

More information

WHEN THERE IS A mismatch between the acoustic

WHEN THERE IS A mismatch between the acoustic 808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION Mitchell McLaren 1, Yun Lei 1, Luciana Ferrer 2 1 Speech Technology and Research Laboratory, SRI International, California, USA 2 Departamento

More information

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One

More information

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project Phonetic- and Speaker-Discriminant Features for Speaker Recognition by Lara Stoll Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California

More information

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese

More information

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren Speech Technology and Research Laboratory, SRI International,

More information

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method Sanket S. Kalamkar and Adrish Banerjee Department of Electrical Engineering

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH 2009 423 Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition George

More information

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

More information

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012 Text-independent Mono and Cross-lingual Speaker Identification with the Constraint of Limited Data Nagaraja B G and H S Jayanna Department of Information Science and Engineering Siddaganga Institute of

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Amit Juneja and Carol Espy-Wilson Department of Electrical and Computer Engineering University of Maryland,

More information

Investigation on Mandarin Broadcast News Speech Recognition

Investigation on Mandarin Broadcast News Speech Recognition Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang 1, Xin Lei 1, Wen Wang 2, Takahiro Shinozaki 1 1 Univ. of Washington, Dept. of Electrical Engineering, Seattle, WA 98195 USA 2

More information

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy Sheeraz Memon

More information

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT Takuya Yoshioka,, Anton Ragni, Mark J. F. Gales Cambridge University Engineering Department, Cambridge, UK NTT Communication

More information

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition Seltzer, M.L.; Raj, B.; Stern, R.M. TR2004-088 December 2004 Abstract

More information

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction INTERSPEECH 2015 Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa Department of Computer

More information

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques Lorene Allano 1*1, Andrew C. Morris 2, Harin Sellahewa 3, Sonia Garcia-Salicetti 1, Jacques Koreman 2, Sabah Jassim

More information

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Navdeep Jaitly 1, Vincent Vanhoucke 2, Geoffrey Hinton 1,2 1 University of Toronto 2 Google Inc. ndjaitly@cs.toronto.edu,

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

Reduce the Failure Rate of the Screwing Process with Six Sigma Approach

Reduce the Failure Rate of the Screwing Process with Six Sigma Approach Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Reduce the Failure Rate of the Screwing Process with Six Sigma Approach

More information

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny By the End of Year 8 All Essential words lists 1-7 290 words Commonly Misspelt Words-55 working out more complex, irregular, and/or ambiguous words by using strategies such as inferring the unknown from

More information

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion

More information

Speaker recognition using universal background model on YOHO database

Speaker recognition using universal background model on YOHO database Aalborg University Master Thesis project Speaker recognition using universal background model on YOHO database Author: Alexandre Majetniak Supervisor: Zheng-Hua Tan May 31, 2011 The Faculties of Engineering,

More information

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017 Instructor Syed Zahid Ali Room No. 247 Economics Wing First Floor Office Hours Email szahid@lums.edu.pk Telephone Ext. 8074 Secretary/TA TA Office Hours Course URL (if any) Suraj.lums.edu.pk FINN 321 Econometrics

More information

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS Elliot Singer and Douglas Reynolds Massachusetts Institute of Technology Lincoln Laboratory {es,dar}@ll.mit.edu ABSTRACT

More information

Support Vector Machines for Speaker and Language Recognition

Support Vector Machines for Speaker and Language Recognition Support Vector Machines for Speaker and Language Recognition W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, P. A. Torres-Carrasquillo MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA

More information

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany Jana Kitzmann and Dirk Schiereck, Endowed Chair for Banking and Finance, EUROPEAN BUSINESS SCHOOL, International

More information

Dimensions of Classroom Behavior Measured by Two Systems of Interaction Analysis

Dimensions of Classroom Behavior Measured by Two Systems of Interaction Analysis Dimensions of Classroom Behavior Measured by Two Systems of Interaction Analysis the most important and exciting recent development in the study of teaching has been the appearance of sev eral new instruments

More information

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh The Effect of Discourse Markers on the Speaking Production of EFL Students Iman Moradimanesh Abstract The research aimed at investigating the relationship between discourse markers (DMs) and a special

More information

E mail: Phone: LIBRARY MBA MAIN OFFICE

E mail: Phone: LIBRARY MBA MAIN OFFICE MASTER OF BUSINESS ADMINISTRATION 1 Jennifer Brandow, MBA Director E mail: mba@wsc.edu Phone: 402.375.7587 MBA OFFICE Gardner Hall 106 1111 Main St. Wayne, NE 68787 ADMISSIONS 402.375.7234 admissions@wsc.edu

More information

On document relevance and lexical cohesion between query terms

On document relevance and lexical cohesion between query terms Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,

More information

Evaluation of Teach For America:

Evaluation of Teach For America: EA15-536-2 Evaluation of Teach For America: 2014-2015 Department of Evaluation and Assessment Mike Miles Superintendent of Schools This page is intentionally left blank. ii Evaluation of Teach For America:

More information

New Insights Into Hierarchical Clustering And Linguistic Normalization For Speaker Diarization

New Insights Into Hierarchical Clustering And Linguistic Normalization For Speaker Diarization New Insights Into Hierarchical Clustering And Linguistic Normalization For Speaker Diarization Simon BOZONNET A doctoral dissertation submitted to: TELECOM ParisTech in partial fulfillment of the requirements

More information

Tap vs. Bottled Water

Tap vs. Bottled Water Tap vs. Bottled Water CSU Expository Reading and Writing Modules Tap vs. Bottled Water Student Version 1 CSU Expository Reading and Writing Modules Tap vs. Bottled Water Student Version 2 Name: Block:

More information

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS Annamaria Mesaros 1, Toni Heittola 1, Antti Eronen 2, Tuomas Virtanen 1 1 Department of Signal Processing Tampere University of Technology Korkeakoulunkatu

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Active Learning. Yingyu Liang Computer Sciences 760 Fall Active Learning Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven,

More information

Disambiguation of Thai Personal Name from Online News Articles

Disambiguation of Thai Personal Name from Online News Articles Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online

More information

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Review in ICAME Journal, Volume 38, 2014, DOI: /icame Review in ICAME Journal, Volume 38, 2014, DOI: 10.2478/icame-2014-0012 Gaëtanelle Gilquin and Sylvie De Cock (eds.). Errors and disfluencies in spoken corpora. Amsterdam: John Benjamins. 2013. 172 pp.

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Corpus Linguistics (L615)

Corpus Linguistics (L615) (L615) Basics of Markus Dickinson Department of, Indiana University Spring 2013 1 / 23 : the extent to which a sample includes the full range of variability in a population distinguishes corpora from archives

More information

Edinburgh Research Explorer

Edinburgh Research Explorer Edinburgh Research Explorer Personalising speech-to-speech translation Citation for published version: Dines, J, Liang, H, Saheer, L, Gibson, M, Byrne, W, Oura, K, Tokuda, K, Yamagishi, J, King, S, Wester,

More information

Calibration of Confidence Measures in Speech Recognition

Calibration of Confidence Measures in Speech Recognition Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE

More information

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Yanzhang He, Eric Fosler-Lussier Department of Computer Science and Engineering The hio

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Segregation of Unvoiced Speech from Nonspeech Interference

Segregation of Unvoiced Speech from Nonspeech Interference Technical Report OSU-CISRC-8/7-TR63 Department of Computer Science and Engineering The Ohio State University Columbus, OH 4321-1277 FTP site: ftp.cse.ohio-state.edu Login: anonymous Directory: pub/tech-report/27

More information

Voice conversion through vector quantization

Voice conversion through vector quantization J. Acoust. Soc. Jpn.(E)11, 2 (1990) Voice conversion through vector quantization Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara A TR Interpreting Telephony Research Laboratories,

More information

In Workflow. Viewing: Last edit: 10/27/15 1:51 pm. Approval Path. Date Submi ed: 10/09/15 2:47 pm. 6. Coordinator Curriculum Management

In Workflow. Viewing: Last edit: 10/27/15 1:51 pm. Approval Path. Date Submi ed: 10/09/15 2:47 pm. 6. Coordinator Curriculum Management 1 of 5 11/19/2015 8:10 AM Date Submi ed: 10/09/15 2:47 pm Viewing: Last edit: 10/27/15 1:51 pm Changes proposed by: GODWINH In Workflow 1. BUSI Editor 2. BUSI Chair 3. BU Associate Dean 4. Biggio Center

More information

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working

More information

On the Formation of Phoneme Categories in DNN Acoustic Models

On the Formation of Phoneme Categories in DNN Acoustic Models On the Formation of Phoneme Categories in DNN Acoustic Models Tasha Nagamine Department of Electrical Engineering, Columbia University T. Nagamine Motivation Large performance gap between humans and state-

More information

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING Sheng Li 1, Xugang Lu 2, Shinsuke Sakai 1, Masato Mimura 1 and Tatsuya Kawahara 1 1 School of Informatics, Kyoto University, Sakyo-ku, Kyoto 606-8501,

More information

Foundations of Knowledge Representation in Cyc

Foundations of Knowledge Representation in Cyc Foundations of Knowledge Representation in Cyc Why use logic? CycL Syntax Collections and Individuals (#$isa and #$genls) Microtheories This is an introduction to the foundations of knowledge representation

More information

What do Medical Students Need to Learn in Their English Classes?

What do Medical Students Need to Learn in Their English Classes? ISSN - Journal of Language Teaching and Research, Vol., No., pp. 1-, May ACADEMY PUBLISHER Manufactured in Finland. doi:.0/jltr...1- What do Medical Students Need to Learn in Their English Classes? Giti

More information

On-Line Data Analytics

On-Line Data Analytics International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob

More information

Preprint.

Preprint. http://www.diva-portal.org Preprint This is the submitted version of a paper presented at Privacy in Statistical Databases'2006 (PSD'2006), Rome, Italy, 13-15 December, 2006. Citation for the original

More information

A Study of the Effectiveness of Using PER-Based Reforms in a Summer Setting

A Study of the Effectiveness of Using PER-Based Reforms in a Summer Setting A Study of the Effectiveness of Using PER-Based Reforms in a Summer Setting Turhan Carroll University of Colorado-Boulder REU Program Summer 2006 Introduction/Background Physics Education Research (PER)

More information

Characteristics of the Text Genre Informational Text Text Structure

Characteristics of the Text Genre Informational Text Text Structure LESSON 4 TEACHER S GUIDE by Taiyo Kobayashi Fountas-Pinnell Level C Informational Text Selection Summary The narrator presents key locations in his town and why each is important to the community: a store,

More information

STUDENTS' RATINGS ON TEACHER

STUDENTS' RATINGS ON TEACHER STUDENTS' RATINGS ON TEACHER Faculty Member: CHEW TECK MENG IVAN Module: Activity Type: DATA STRUCTURES AND ALGORITHMS I CS1020 LABORATORY Class Size/Response Size/Response Rate : 21 / 14 / 66.67% Contact

More information

TEAM NEWSLETTER. Welton Primar y School SENIOR LEADERSHIP TEAM. School Improvement

TEAM NEWSLETTER. Welton Primar y School SENIOR LEADERSHIP TEAM. School Improvement Welton Primar y School February 2016 SENIOR LEADERSHIP TEAM NEWSLETTER SENIOR LEADERSHIP TEAM Nikki Pidgeon Head Teacher Sarah Millar Lead for Behaviour, SEAL and PE Laura Leitch Specialist Leader in Education,

More information

Reducing Features to Improve Bug Prediction

Reducing Features to Improve Bug Prediction Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting English-French Cognates Using Orthographic Edit Distance Detecting English-French Cognates Using Orthographic Edit Distance Qiongkai Xu 1,2, Albert Chen 1, Chang i 1 1 The Australian National University, College of Engineering and Computer Science 2 National

More information

How to Judge the Quality of an Objective Classroom Test

How to Judge the Quality of an Objective Classroom Test How to Judge the Quality of an Objective Classroom Test Technical Bulletin #6 Evaluation and Examination Service The University of Iowa (319) 335-0356 HOW TO JUDGE THE QUALITY OF AN OBJECTIVE CLASSROOM

More information

A General Class of Noncontext Free Grammars Generating Context Free Languages

A General Class of Noncontext Free Grammars Generating Context Free Languages INFORMATION AND CONTROL 43, 187-194 (1979) A General Class of Noncontext Free Grammars Generating Context Free Languages SARWAN K. AGGARWAL Boeing Wichita Company, Wichita, Kansas 67210 AND JAMES A. HEINEN

More information

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders

More information

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language Z.HACHKAR 1,3, A. FARCHI 2, B.MOUNIR 1, J. EL ABBADI 3 1 Ecole Supérieure de Technologie, Safi, Morocco. zhachkar2000@yahoo.fr.

More information

Phonological and Phonetic Representations: The Case of Neutralization

Phonological and Phonetic Representations: The Case of Neutralization Phonological and Phonetic Representations: The Case of Neutralization Allard Jongman University of Kansas 1. Introduction The present paper focuses on the phenomenon of phonological neutralization to consider

More information

Cued Recall From Image and Sentence Memory: A Shift From Episodic to Identical Elements Representation

Cued Recall From Image and Sentence Memory: A Shift From Episodic to Identical Elements Representation Journal of Experimental Psychology: Learning, Memory, and Cognition 2006, Vol. 32, No. 4, 734 748 Copyright 2006 by the American Psychological Association 0278-7393/06/$12.00 DOI: 10.1037/0278-7393.32.4.734

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES Po-Sen Huang, Kshitiz Kumar, Chaojun Liu, Yifan Gong, Li Deng Department of Electrical and Computer Engineering,

More information

CHANCERY SMS 5.0 STUDENT SCHEDULING

CHANCERY SMS 5.0 STUDENT SCHEDULING CHANCERY SMS 5.0 STUDENT SCHEDULING PARTICIPANT WORKBOOK VERSION: 06/04 CSL - 12148 Student Scheduling Chancery SMS 5.0 : Student Scheduling... 1 Course Objectives... 1 Course Agenda... 1 Topic 1: Overview

More information

Data Fusion Models in WSNs: Comparison and Analysis

Data Fusion Models in WSNs: Comparison and Analysis Proceedings of 2014 Zone 1 Conference of the American Society for Engineering Education (ASEE Zone 1) Data Fusion s in WSNs: Comparison and Analysis Marwah M Almasri, and Khaled M Elleithy, Senior Member,

More information

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Digital Signal Processing: Speaker Recognition Final Report (Complete Version) Digital Signal Processing: Speaker Recognition Final Report (Complete Version) Xinyu Zhou, Yuxin Wu, and Tiezheng Li Tsinghua University Contents 1 Introduction 1 2 Algorithms 2 2.1 VAD..................................................

More information

Dates and Prices 2016

Dates and Prices 2016 Dates and Prices 2016 ICE French Language Courses www.ihnice.com 27, Rue Rossini - 06000 Nice - France Phone: +33(0)4 93 62 60 62 / Fax: +33(0)4 93 80 53 09 E-mail: info@ihnice.com 1 FRENCH COURSES - 2016

More information

Measurement. When Smaller Is Better. Activity:

Measurement. When Smaller Is Better. Activity: Measurement Activity: TEKS: When Smaller Is Better (6.8) Measurement. The student solves application problems involving estimation and measurement of length, area, time, temperature, volume, weight, and

More information

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models Jung-Tae Lee and Sang-Bum Kim and Young-In Song and Hae-Chang Rim Dept. of Computer &

More information

PUPIL PREMIUM REVIEW

PUPIL PREMIUM REVIEW PUPIL PREMIUM REVIEW 2015-2016 Pupil Premium Review 2015/2016 Ambition The school aims to provide pupils with a consistently good quality of provision for all pupils. We aim to maximise the progress of

More information

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE Pierre Foy TIMSS Advanced 2015 orks User Guide for the International Database Pierre Foy Contributors: Victoria A.S. Centurino, Kerry E. Cotter,

More information