THE NIST 2004 SPRING RICH TRANSCRIPTION EVALUATION: TWO-AXIS MERGING STRATEGY IN THE CONTEXT OF MULTIPLE DISTANT MICROPHONE BASED MEETING SPEAKER SEGMENTATION Corinne Fredouille (2), Daniel Moraru (1), Sylvain Meignier (2), Lauren Besacier (1), Jean-François Bonasre (2) 1 CLIPS-IMAG (UJF & CNRS) - BP 53-38041 Grenoble Cedex 9 - France 2 LIA-Avignon - BP1228-84911 Avignon Cedex 9 France (daniel.moraru,lauren.besacier)@imag.fr (sylvain.meignier,corinne.fredouille,jean-francois.bonasre)@lia.univ-avignon.fr ABSTRACT This paper presens he ELISA speaker segmenaion approach applied on muliple audio channel meeing recordings in he framework of NIST RT 04s meeing (spring) evaluaion campaign. As done for BN daa speaker segmenaion, he ELISA meeing sysem involves wo speaker segmenaion sysems developed individually by he CLIPS and LIA laboraories. The main originaliy consiss in a wo-axis merging sraegy, proposed o deal wih boh muliple exper segmenaion oupus and muliple microphone segmenaion oupus. While exper merging sraegy did no really lead o an improvemen of he performance, he individual microphone segmenaion merging sraegy allowed o provide a global segmenaion oupu from several audio channels (microphones) wih accepable performance. The bes sysem obained 22.6% of diarizaion error rae during he NIST RT 04s meeing evaluaion. 1. INTRODUCTION The goal of speaker diarizaion (or segmenaion) is o segmen a N-speaker audio documen in homogeneous pars conaining he voice of only one speaker (also called speaker change deecion process) and o associae he resuling segmens by maching hose belonging o a same speaker (clusering process). In speaker diarizaion he inrinsic difficuly of he ask increases according o he daa concerned: (wo-speaker) elephone conversaions, broadcas news, meeing daa. This paper is relaed o speaker diarizaion on meeing daa in he framework of NIST 2004 spring meeing Rich Transcripion (RT 04s) evaluaion. Meeing daa presen hree main specificiies compared o BN daa [1]. Firsly, he speech is fully-sponaneous, highly ineracive across paricipans, and presens a large number of disfluencies as well as speaker segmen overlaps. Secondly, he meeing room recording condiions associaed wih disan (able) microphones lead o noisy recordings, including background noises, reverberaions and disan speakers. Thirdly, meeing conversaions are recorded in smar spaces where muliple sensors are used. Thus, he speaker diarizaion sysem has o rea muliple speech channels coming from muliple microphones. The choice of an efficien merging sraegy in order o discard he irrelevan informaion is hen an imporan issue. This las poin is he core problem addressed in his paper. Secion 2 of his paper presens he wo ELISA speaker diarizaion sysems. Secion 3 describes he sraegies used o specifically rea meeing daa by merging muliple microphone segmenaion oupus and opionally muliple expers. Secion 4 presens he experimenal proocols and resuls. Finally, secion 5 concludes his work. 2. SPEAKER SEGMENTATION SYSTEMS Two speaker segmenaion sysems are involved in his work, developed individually by he CLIPS and LIA laboraories in he framework of he ELISA consorium [2]. Boh of hem paricipaed a he Rich Transcripion 2003 evaluaion campaign (RT 03) for he speaker segmenaion ask on broadcas news daa [3]. No paricular uning has be done on boh sysems o paricipae a RT 04s evaluaion campaign excep he use of a speech/non speech segmenaion as a preliminary phase o deal wih he specificiies of meeing daa. 2.1 Speech/non speech segmenaion The speech/non speech segmenaion sysem consiss in a silence deecion based only on a bi-gaussian modeling of he energy disribuion associaed wih a deecion hreshold. The silence segmen minimal lengh is se o 0.5s. 2.2. The LIA Sysem The LIA sysem is based on Hidden Markov Modeling (HMM) of he conversaion. Each sae of he HMM characerizes a speaker and he ransiions model he changes beween speakers. The speaker segmenaion sysem is applied on he speech segmens deeced by he speech/non speech segmenaion described in secion 2.1. During he segmenaion, he HMM is generaed using an ieraive process, which deecs and adds a new sae (i.e. a new speaker) a each ieraion. This speaker deecion process is hen followed by a re-segmenaion phase (ieraive adapaion and decoding process) which allows o refine speaker segmenaion. The enire speaker segmenaion process is largely described in [3][4].
Concerning he fron end processing, he signal is characerized by 20 linear Cepsral feaures (LFCC) compued every 10 ms using a 20ms window. The Cepsral feaures are augmened by he energy. No frame removal or any coefficien normalizaion is applied. 2.3 The CLIPS Sysem The CLIPS sysem is based on a BIC [5] (Bayesian Informaion Crierion) speaker change deecor followed by an hierarchical clusering. The clusering sop condiion is he esimaion of he number of speakers using a penalized BIC crierion. The enire speaker segmenaion process is largely described in [3][4]. Finally, he re-segmenaion phase of he LIA sysem is also applied on he CLIPS segmenaion for refinemen 1. Like he LIA sysem, he CLIPS sysem is applied on he speech segmens deeced by he speech/non speech segmenaion. The signal is characerized by 16 mel Cepsral feaures (MFCC) compued every 10ms on 20ms windows using 56 filer banks. Then he Cepsral feaures are augmened by he energy. No frame removal or any coefficien normalizaion is applied. 3. MEETING SPEAKER SEGMENTATION STRATEGIES Since meeings are generally recorded wih muliple disan microphones, he speaker segmenaion ask differs grealy from oher domains like broadcas news or elephone conversaions. Indeed, speaker segmenaion sysem has o deal wih muliple speech signals (from he differen disan microphones) when he objecive is o provide a single meeing speaker segmenaion oupu. Moreover, according o he disan microphone posiion in he able, he qualiy of signal may hugely differ from one microphone o anoher. For insance, he main speaker uerances may be caugh by one or wo disan microphones while he oher microphones mainly provide background voices, long silence, or background noise only. To deal wih hese differen issues, wo cooperaive merging sraegies are presened in his paper. The firs one, called exper merging sraegy aims a merging segmenaions provided by differen expers (wo expers in his paper). I is applied independenly on each recording issued from a disan microphone. The second one, called Individual Microphone Segmenaion Merging sraegy (), is used o produce a single speaker segmenaion oupu from hose obained on each individual disan microphone. The applicaion of boh sraegies, also referred as wo merging axes horizonal and verical, is illusraed on figure 1. Figure 1: Two cooperaive merging sraegies horizonal and verical merging combinaion 1 This combinaion of CLIPS sysem and LIA re-segmenaion phase was also proposed as a merging sraegy during RT 03 evaluaion [4] and obained he bes performance over all he paricipans wih 12,88% of speaker diarizaion error rae. 3.1 Exper Merging Sraegy The idea of his sraegy is o merge he segmenaions issued from wo expers CLIPS and LIA sysems compued independenly on a given disan microphone. This sraegy was already used by he LIA and CLIPS labs for he RT 03 speaker segmenaion evaluaion campaign on broadcas news daa [4]. I relies on a frame based decision which consiss in grouping he labels proposed by boh he sysems a he frame level before applying a re-segmenaion process (see figure 2). An example of he label merging approach is illusraed below: Frame i: Sys1= S1, Sys2= T4 label S1T4, Frame i+1 : Sys1= S2, Sys2= T4 label S2T4 Label merge New segmenaion Figure 2: Exper merging sraegy Re-segmenaion This label merging mehod generaes (before resegmenaion) a large se of virual speakers composed of: Virual speakers ha have a large amoun of daa assigned. These speakers could be considered he correc hypohesis speakers; Virual speakers generaed by only one of he wo sysems, for example he speakers associaed wih only one shor segmen (~3s up o 10s). These hypohesis speakers could be suppressed (he weigh of hese speakers on he final scoring is marginal); Virual speakers ha have a smaller amoun of daa scaered beween muliple small segmens and ha could be considered zones of indecision. Based on hese consideraions, he LIA re-segmenaion is hen applied on he merged segmenaion. During his ieraive process, he virual speakers for whom oal ime is shorer han 3s are deleed. The daa of hese deleed speakers will furher be dispached beween he remaining speakers during he nex ieraion. Afer he firs ieraion he number of speakers is already drasically reduced since speakers associaed wih indecision zones do no cach any daa during he Vierbi decoding and are auomaically removed. However, he merging sraegy canno generally solve he wrong behaviour of iniial sysems ha could spli a rue speaker in wo hypohesis speakers, each ied o a long segmen. Suppose all sysems agreed on a long segmen excep one which splis i in wo pars. This would produce wo virual speakers (associaed wih long duraion segmens) afer he label merging phase and since no clusering is applied before re-segmenaion, i leads o a "rue" speaker spli in wo virual speakers. 3.2 Individual Microphone Segmenaion Merging Sraegy The goal of his sraegy is o merge he muliple disan microphone segmenaions in a single meeing speaker segmenaion oupu. Since no single signal is represenaive of he overall meeing, his sraegy mus rely on some segmen selecion rules over he muliple disan microphone speaker segmenaions.
In his way, a specific merging algorihm is proposed in his paper. Developed by he LIA and CLIPS labs, i relies on an ieraive process which aims a deecing he longes speaker inervenions over he se of disan microphone segmenaions. This algorihm consiss in 3 seps : Sep 1: selecing he longes speaker inervenion over all microphone segmenaion oupus aken separaely. The longes speaker inervenion means all he segmens (coniguous or no) aribued o he speaker over a specific microphone segmenaion. These segmens are definiely aribued o a new speaker in he resuling segmenaion. Sep 2: deleing in each disan microphone segmenaion all he segmens aribued o he new speaker a he end of sep 1. Sep 3: verifying he presence of no seleced segmens over all he disan microphone segmenaions. If segmens are sill presen and heir oal lengh is greaer han 30s, hen back o sep 1 for a new ieraion, else sop he process and assign he segmens o a las speaker label (his las speaker can be seen as a rash speaker relaed o all he shor remaining segmens). One rule is used during his ieraive process : if he longes speaker inervenion seleced during sep 1 is longer han 60% of he overall signal duraion, i is no considered (unless i is he las available inervenion). This rule aims a discarding some very long speaker segmenaion oupus, which may resul from poor individual microphone segmenaions (he badness of an individual microphone segmenaion may be due, for insance, o he major presence of background voice/noise over he microphone signal, involving a large rae of speech/non speech segmenaion errors). 4. EXPERIMENTS AND RESULTS 4.1 Evaluaion proocols RT 04s meeing evaluaion campaign [6], proposed wo main asks: speech-o-ex ranscripion (STT) and/or speaker segmenaion (so called diarizaion). For boh asks, differen microphone condiions were available: muliple disan microphones, single disan microphone and individual head microphone (he laer was available for STT only). This paper addresses only speaker segmenaion over muliple disan microphones. This secion describes he evaluaion proocols used o measure he performance, presens some resuls and discusses he behaviour of he wo axis merging sraegy. Scoring In order o measure performance, an opimum one-o-one mapping of reference speaker IDs o sysem oupu speaker IDs is compued, followed by a ime based speaker segmenaion error rae. This scoring, proposed by NIST, is described in deails in he RT 04s evaluaion plan [7]. Speaker segmenaion performance is expressed in erms of speaker diarizaion error, comprising missed and false alarm speaker errors as well as speaker segmenaion errors. NB: In his paper, he areas of overlap beween speaker uerances are no scored. Daabase Since his work was done in he conex of RT 04s evaluaion campaign, wo meeing corpora are available, named in his paper Dev corpus for he developmen of sysems and Eva corpus for he evaluaion. Boh of hem are composed of wo 10mn meeing excerps recorded over four differen sies (CMU, ICSI, LDC, and NIST). Table 1 provides some deails on he differen corpora, including, for each meeing excerp, he number of available disan microphones. For each disan microphone, heir posiion in he meeing room is available as furher informaion and may be used o help speaker segmenaion process. Neverheless, approaches presened in his paper do no ake advanage of his kind of informaion. Finally, as for any speaker segmenaion evaluaion, no prior informaion abou he number of speakers and heir ideniy is available. Dev Eva Meeings micro nb Meeings micro nb CMU_20020319-1400 1 CMU_20030109-1530 1 CMU_20020320-1500 1 CMU_20030109-1600 1 ICSI_20010208-1430 6 ICSI_20000807-1000 6 ICSI_20010322-1450 6 ICSI_20011030-1030 6 LDC_20011116-1400 7 LDC_20011121-1700 10 LDC_20011116-1500 8 LDC_20011207-1800 4 NIST_20020214-1148 7 NIST_20030623-1409 7 NIST_20020305-1007 6 NIST_20030925-1517 7 Table 1: Number of disan microphones for each meeing of Dev and Eva corpora. 4.2 Resuls Tables 2 and 3 provide he experimenal resuls obained on Dev and Eva corpora for he ask of muliple disan microphone speaker segmenaion. These resuls, expressed in erms of speaker diarizaion error raes, are given for hree differen sysems: LIA+: he LIA speaker segmenaion sysem applied on each individual disan microphones and followed by he Individual Microphone Segmenaion Merging () process; CLIPS+: he same process is applied using he CLIPS speaker segmenaion sysem followed by he process; Two axis merging: applicaion of he exper merging sraegy on he LIA and CLIPS segmenaions followed by he process. These resuls show: imporan differences in performance beween he LIA and CLIPS sysems on a same meeing file (e.g. 14.1% vs 53.4% for CMU_20020320-1500 on Dev corpus and 37.9% vs 19.1% for ICSI_20000807-1000 on Eva corpus); imporan differences in performance beween he meeings (e.g. 7.4% vs 54.1% for he LIA beween LDC_20011116-1400 and NIST_20020305-1007 on Dev corpus); a significan difference of performance beween Dev and Eva corpora (from 22.6% for he bes overall error rae on Eva vs 28.3% on Dev) as well as a differen behaviour of sysems beween corpora (LIA sysem is he bes one on Dev and CLIPS sysem he bes one on Eva);
a small performance improvemen observed wih he wo axis merging sraegy compared o he individual sysems, and only on few meeing files, (e.g. 25.3% for wo axis merging vs 28.4% for he LIA and 26.7% for he CLIPS for LDC_20011207-1800). Neverheless, no gain is reached on he overall performance, compared o he bes individual sysem. 4.3 Discussion According o he difficuly of he ask (compared o broadcas news or conversaional elephone daa), he performance obained by he various sysems is quie saisfying, especially on Eva corpus: 22.6% for he bes sysem, o be compared wih 12,88% 1 obained on BN daa during RT 03. Neverheless, he exper merging sraegy applied individually on each individual microphone ( wo axis merging ) does no provide addiional performance gain compared o he bes sysem. This resul differs from RT 03 ones [4] where a 16% relaive decrease of he diarizaion error was observed (from 16,90% for he bes individual sysem o 14,24% for he exper merging based sysem). Moreover, he behaviour of his sraegy grealy depends on he qualiy of individual segmenaions, when hemselves are dependen on he qualiy of each sream caugh by each individual microphone. One explanaion of he disappoining behaviour of he exper merging sraegy may be ha each exper is applied separaely on a missing daa file (i.e. on each individual microphone recording). Thus, he performance of he wo expers may be very differen for a same meeing file, which is a well known drawback in fusion (i is generally well acceped ha an efficien fusion mus be done beween expers ha have no oo large differences in erms of performance). Table 4 shows he differences beween he microphones aken independenly, on wo differen meeing examples 2. In he firs example (LDC_20011116-1500), he resul shows a large variabiliy in erms of speaker error raes beween he microphones (d3, d5, d6 ). Conrarily, regarding he speech/non speech deecion, a small variabiliy beween he microphones is noed. On his same meeing, he overall score is very close o he bes individual microphone resul, which performs quie well. The second example (NIST_20020305-1007) shows an inverse behaviour: comparable and quie reasonable speaker error raes over he se of microphones vs. high missed speech error raes wih a large variabiliy beween he microphones. The differences observed beween he meeings show he difficuly o define an efficien merging sraegy. To summarize, some commens could be proposed regarding he resuls: If one microphone is able o cach he informaion from all he speakers (d2, LDC_20011116-1500 for example), his microphone could be used alone achieving good performance (14,5% of diarizaion error on he previous example o be compared wih 12,88 % on BN daa); 2 Speaker diarizaion error raes provided in able 4 for each disan microphone are compued by mapping each individual microphone segmenaion o he corresponding single meeing reference segmenaion. 3 The speaker error rae is compued only on well deeced speech segmens (speech segmens presen boh in he reference and in he sysem oupu). If he informaion is presen simulaneously on differen microphones (wih differen signal qualiies), he fusion process is disurbed, since i is no able o group wo (or more) pars of a given speaker deeced on differen microphones ogeher; To ake advanage of he muliple microphones, i is necessary o focus on he useful informaion/speakers presen in each recording, i.e. he speech/non speech process should delee he far speakers (low SNR pars, background voices ). 5. CONCLUSION We have presened he ELISA speaker segmenaion approach applied on meeing speech daa for NIST RT 04s (spring) evaluaion campaign. The bes sysem obained 28.3% of diarizaion error on he developmen corpus (Dev) and 22.6% on he evaluaion corpus (Eva), o be compared wih he 12,88% obained on BN daa during NIST RT 03 evaluaion. A simple wo-axis merging sraegy was proposed o rea muliple exper segmenaion oupus and muliple microphone segmenaion oupus. While exper merging sraegy did no really lead o an improvemen of he performance, he individual microphone segmenaion merging sraegy allowed o provide a global segmenaion oupu from several audio channels (microphones) wih accepable performance. To be efficien when he speaker voices are differenly caugh by he microphones, our simple merging sraegy needs microphone independen segmenaions focused only on he well caugh speakers (he background/far speakers should be suppressed). Despie he simpliciy of he merging sraegy proposed in his paper, he ELISA primary sysem presened o he RT 04s (spring) meeing evaluaion obained he bes performance on he speaker diarizaion ask. 6. REFERENCES [1] hp://www.nis.gov/speech/es_beds/mr_proj/ [2] I. Magrin-Chagnolleau, G. Gravier, and R. Bloue for he ELISA consorium, Overview of he 2000-2001 ELISA consorium research aciviies, A Speaker Odyssey, pp.67 72, Chania, Cree, June 2001. [3] D. Moraru, S. Meignier, L. Besacier, J.-F. Bonasre, and I. Magrin-Chagnolleau, The ELISA consorium approaches in speaker segmenaion during he NIST 2002 speaker recogniion evaluaion. ICASSP 03, Hong Kong. [4] D. Moraru, S. Meignier, C. Fredouille, L. Besacier, and J.- F. Bonasre, The ELISA consorium approaches in Broadcas News speaker segmenaion during he NIST 2003 Rich Transcripion evaluaion. ICASSP 04, Monreal, Canada, May 2004. [5] P. Delacour and C. Wellekens, DISTBIC: a speakerbased segmenaion for audio daa indexing, Speech Communicaion, Vol. 32, No. 1-2, Sepember 2000. [6] hp://nis.gov/speech/ess/r/r2004/spring/ [7] hp://nis.gov/speech/ess/r/r2004/spring/documens/r04 s-meeing-eval-plan-v1.pdf
Speaker diarizaion error (in %) Dev Meeing Corpus LIA+ CLIPS+ Two axis merging CMU_20020319-1400 58.5 42.4 47 CMU_20020320-1500 14.1 53.4 52.7 ICSI_20010208-1430 16.9 25.9 18.9 ICSI_20010322-1450 26.5 26.8 27.1 LDC_20011116-1400 7.4 7.5 7.5 LDC_20011116-1500 13.9 16.4 18.1 NIST_20020214-1148 30.8 31.4 33.3 NIST_20020305-1007 54.1 36.8 35.5 Overall (miss. and fa non speech err.=5.6%) 28.3 29.9 29.8 Table 2: Performance (in erms of speaker diarizaion error rae) of individual speaker segmenaion sysems (LIA and CLIPS) applied on each disan microphones followed by Individual Microphone Segmenaion Merging () Sraegy and of wo axis merging sraegy based sysem. Performance given for each Dev corpus meeing signal and for he overall. Error raes (in %) LDC_20011116-1500 NIST_20020305-1007 Mis+fa Speaker Mis+fa Speaker err. rae Micro err. rae err. Rae err. rae d1 3.7 18.6 34.4 22.9 d2 4.9 9.6 21.8 26.3 d3 4.9 47.9 XX XX d4 7.4 11.6 20 29 d5 4.0 48.5 36.2 13.9 d6 3.1 48.5 29.2 19.3 d7 4.5 48.3 25.2 16.2 d8 7.3 47.6 XX XX 2.5 11.4 10.2 43.9 Table 4: wo examples of Individual Microphone Segmenaion Merging () sraegy behaviour for he LIA+ sysem. Speaker diarizaion error (in %) Eva Meeing Corpus LIA+ CLIPS+ Two axis merging CMU_20030109-1530 20.8 39.8 41.2 CMU_20030109-1600 13.7 17.8 18.8 ICSI_20000807-1000 37.9 19.1 17.2 ICSI_20011030-1030 52.1 44.2 42.2 LDC_20011121-1700 16.5 7.7 18.0 LDC_20011207-1800 28.4 26.7 25.3 NIST_20030623-1409 10.3 13.9 10.6 NIST_20030925-1517 22.9 22.7 23.8 Overall (miss. and fa non speech err.=7%) 24.4 22.6 23.4 Table 3: Performance (in erms of speaker diarizaion error rae) of individual speaker segmenaion sysems (LIA and CLIPS) applied on each disan microphones followed by Individual Microphone Segmenaion Merging () Sraegy and of wo axis merging sraegy based sysem. Performance given for each Eva corpus meeing signal and for he overall.