Corinne Fredouille (2), Daniel Moraru (1), Sylvain Meignier (2), Laurent Besacier (1), Jean-François Bonastre (2)

Similar documents
Neural Network Model of the Backpropagation Algorithm

More Accurate Question Answering on Freebase

Fast Multi-task Learning for Query Spelling Correction

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

1 Language universals

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

MyLab & Mastering Business

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Learning Methods in Multilingual Speech Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Investigation on Mandarin Broadcast News Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Emotion Recognition Using Support Vector Machine

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

WHEN THERE IS A mismatch between the acoustic

Human Emotion Recognition From Speech

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

New Insights Into Hierarchical Clustering And Linguistic Normalization For Speaker Diarization

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

HIGHLIGHTS OF FINDINGS FROM MAJOR INTERNATIONAL STUDY ON PEDAGOGY AND ICT USE IN SCHOOLS

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Support Vector Machines for Speaker and Language Recognition

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Data Fusion Models in WSNs: Comparison and Analysis

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

On the Formation of Phoneme Categories in DNN Acoustic Models

LESSON: CHOOSING A TOPIC 2 NARROWING AND CONNECTING TOPICS TO THEME

A study of speaker adaptation for DNN-based speech synthesis

Preprint.

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Lab 1 - The Scientific Method

The Talent Development High School Model Context, Components, and Initial Impacts on Ninth-Grade Students Engagement and Performance

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Moderator: Gary Weckman Ohio University USA

Individual Differences & Item Effects: How to test them, & how to test them well

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Proceedings of Meetings on Acoustics

Probabilistic Latent Semantic Analysis

Alpha provides an overall measure of the internal reliability of the test. The Coefficient Alphas for the STEP are:

COMM370, Social Media Advertising Fall 2017

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Speaker Identification by Comparison of Smart Methods. Abstract

Corpus Linguistics (L615)

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Evaluation Report Output 01: Best practices analysis and exhibition

Tap vs. Bottled Water

How to Judge the Quality of an Objective Classroom Test

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Procedia - Social and Behavioral Sciences 191 ( 2015 ) WCES 2014

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speech Recognition by Indexing and Sequencing

Eyebrows in French talk-in-interaction

WHY SOLVE PROBLEMS? INTERVIEWING COLLEGE FACULTY ABOUT THE LEARNING AND TEACHING OF PROBLEM SOLVING

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

Improvements to the Pruning Behavior of DNN Acoustic Models

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

A heuristic framework for pivot-based bilingual dictionary induction

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Using 'intsvy' to analyze international assessment data

Reduce the Failure Rate of the Screwing Process with Six Sigma Approach

Houghton Mifflin Online Assessment System Walkthrough Guide

Speaker recognition using universal background model on YOHO database

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

On-Line Data Analytics

Semi-Supervised Face Detection

Cross Language Information Retrieval

ACTIVITY: Comparing Combination Locks

Abstractions and the Brain

Specification of the Verity Learning Companion and Self-Assessment Tool

SURVIVING ON MARS WITH GEOGEBRA

Number of students enrolled in the program in Fall, 2011: 20. Faculty member completing template: Molly Dugan (Date: 1/26/2012)

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Speaker Recognition. Speaker Diarization and Identification

Dates and Prices 2016

Calibration of Confidence Measures in Speech Recognition

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Bullying Prevention in. School-wide Positive Behaviour Support. Information from this presentation comes from: Bullying in schools.

Transcription:

THE NIST 2004 SPRING RICH TRANSCRIPTION EVALUATION: TWO-AXIS MERGING STRATEGY IN THE CONTEXT OF MULTIPLE DISTANT MICROPHONE BASED MEETING SPEAKER SEGMENTATION Corinne Fredouille (2), Daniel Moraru (1), Sylvain Meignier (2), Lauren Besacier (1), Jean-François Bonasre (2) 1 CLIPS-IMAG (UJF & CNRS) - BP 53-38041 Grenoble Cedex 9 - France 2 LIA-Avignon - BP1228-84911 Avignon Cedex 9 France (daniel.moraru,lauren.besacier)@imag.fr (sylvain.meignier,corinne.fredouille,jean-francois.bonasre)@lia.univ-avignon.fr ABSTRACT This paper presens he ELISA speaker segmenaion approach applied on muliple audio channel meeing recordings in he framework of NIST RT 04s meeing (spring) evaluaion campaign. As done for BN daa speaker segmenaion, he ELISA meeing sysem involves wo speaker segmenaion sysems developed individually by he CLIPS and LIA laboraories. The main originaliy consiss in a wo-axis merging sraegy, proposed o deal wih boh muliple exper segmenaion oupus and muliple microphone segmenaion oupus. While exper merging sraegy did no really lead o an improvemen of he performance, he individual microphone segmenaion merging sraegy allowed o provide a global segmenaion oupu from several audio channels (microphones) wih accepable performance. The bes sysem obained 22.6% of diarizaion error rae during he NIST RT 04s meeing evaluaion. 1. INTRODUCTION The goal of speaker diarizaion (or segmenaion) is o segmen a N-speaker audio documen in homogeneous pars conaining he voice of only one speaker (also called speaker change deecion process) and o associae he resuling segmens by maching hose belonging o a same speaker (clusering process). In speaker diarizaion he inrinsic difficuly of he ask increases according o he daa concerned: (wo-speaker) elephone conversaions, broadcas news, meeing daa. This paper is relaed o speaker diarizaion on meeing daa in he framework of NIST 2004 spring meeing Rich Transcripion (RT 04s) evaluaion. Meeing daa presen hree main specificiies compared o BN daa [1]. Firsly, he speech is fully-sponaneous, highly ineracive across paricipans, and presens a large number of disfluencies as well as speaker segmen overlaps. Secondly, he meeing room recording condiions associaed wih disan (able) microphones lead o noisy recordings, including background noises, reverberaions and disan speakers. Thirdly, meeing conversaions are recorded in smar spaces where muliple sensors are used. Thus, he speaker diarizaion sysem has o rea muliple speech channels coming from muliple microphones. The choice of an efficien merging sraegy in order o discard he irrelevan informaion is hen an imporan issue. This las poin is he core problem addressed in his paper. Secion 2 of his paper presens he wo ELISA speaker diarizaion sysems. Secion 3 describes he sraegies used o specifically rea meeing daa by merging muliple microphone segmenaion oupus and opionally muliple expers. Secion 4 presens he experimenal proocols and resuls. Finally, secion 5 concludes his work. 2. SPEAKER SEGMENTATION SYSTEMS Two speaker segmenaion sysems are involved in his work, developed individually by he CLIPS and LIA laboraories in he framework of he ELISA consorium [2]. Boh of hem paricipaed a he Rich Transcripion 2003 evaluaion campaign (RT 03) for he speaker segmenaion ask on broadcas news daa [3]. No paricular uning has be done on boh sysems o paricipae a RT 04s evaluaion campaign excep he use of a speech/non speech segmenaion as a preliminary phase o deal wih he specificiies of meeing daa. 2.1 Speech/non speech segmenaion The speech/non speech segmenaion sysem consiss in a silence deecion based only on a bi-gaussian modeling of he energy disribuion associaed wih a deecion hreshold. The silence segmen minimal lengh is se o 0.5s. 2.2. The LIA Sysem The LIA sysem is based on Hidden Markov Modeling (HMM) of he conversaion. Each sae of he HMM characerizes a speaker and he ransiions model he changes beween speakers. The speaker segmenaion sysem is applied on he speech segmens deeced by he speech/non speech segmenaion described in secion 2.1. During he segmenaion, he HMM is generaed using an ieraive process, which deecs and adds a new sae (i.e. a new speaker) a each ieraion. This speaker deecion process is hen followed by a re-segmenaion phase (ieraive adapaion and decoding process) which allows o refine speaker segmenaion. The enire speaker segmenaion process is largely described in [3][4].

Concerning he fron end processing, he signal is characerized by 20 linear Cepsral feaures (LFCC) compued every 10 ms using a 20ms window. The Cepsral feaures are augmened by he energy. No frame removal or any coefficien normalizaion is applied. 2.3 The CLIPS Sysem The CLIPS sysem is based on a BIC [5] (Bayesian Informaion Crierion) speaker change deecor followed by an hierarchical clusering. The clusering sop condiion is he esimaion of he number of speakers using a penalized BIC crierion. The enire speaker segmenaion process is largely described in [3][4]. Finally, he re-segmenaion phase of he LIA sysem is also applied on he CLIPS segmenaion for refinemen 1. Like he LIA sysem, he CLIPS sysem is applied on he speech segmens deeced by he speech/non speech segmenaion. The signal is characerized by 16 mel Cepsral feaures (MFCC) compued every 10ms on 20ms windows using 56 filer banks. Then he Cepsral feaures are augmened by he energy. No frame removal or any coefficien normalizaion is applied. 3. MEETING SPEAKER SEGMENTATION STRATEGIES Since meeings are generally recorded wih muliple disan microphones, he speaker segmenaion ask differs grealy from oher domains like broadcas news or elephone conversaions. Indeed, speaker segmenaion sysem has o deal wih muliple speech signals (from he differen disan microphones) when he objecive is o provide a single meeing speaker segmenaion oupu. Moreover, according o he disan microphone posiion in he able, he qualiy of signal may hugely differ from one microphone o anoher. For insance, he main speaker uerances may be caugh by one or wo disan microphones while he oher microphones mainly provide background voices, long silence, or background noise only. To deal wih hese differen issues, wo cooperaive merging sraegies are presened in his paper. The firs one, called exper merging sraegy aims a merging segmenaions provided by differen expers (wo expers in his paper). I is applied independenly on each recording issued from a disan microphone. The second one, called Individual Microphone Segmenaion Merging sraegy (), is used o produce a single speaker segmenaion oupu from hose obained on each individual disan microphone. The applicaion of boh sraegies, also referred as wo merging axes horizonal and verical, is illusraed on figure 1. Figure 1: Two cooperaive merging sraegies horizonal and verical merging combinaion 1 This combinaion of CLIPS sysem and LIA re-segmenaion phase was also proposed as a merging sraegy during RT 03 evaluaion [4] and obained he bes performance over all he paricipans wih 12,88% of speaker diarizaion error rae. 3.1 Exper Merging Sraegy The idea of his sraegy is o merge he segmenaions issued from wo expers CLIPS and LIA sysems compued independenly on a given disan microphone. This sraegy was already used by he LIA and CLIPS labs for he RT 03 speaker segmenaion evaluaion campaign on broadcas news daa [4]. I relies on a frame based decision which consiss in grouping he labels proposed by boh he sysems a he frame level before applying a re-segmenaion process (see figure 2). An example of he label merging approach is illusraed below: Frame i: Sys1= S1, Sys2= T4 label S1T4, Frame i+1 : Sys1= S2, Sys2= T4 label S2T4 Label merge New segmenaion Figure 2: Exper merging sraegy Re-segmenaion This label merging mehod generaes (before resegmenaion) a large se of virual speakers composed of: Virual speakers ha have a large amoun of daa assigned. These speakers could be considered he correc hypohesis speakers; Virual speakers generaed by only one of he wo sysems, for example he speakers associaed wih only one shor segmen (~3s up o 10s). These hypohesis speakers could be suppressed (he weigh of hese speakers on he final scoring is marginal); Virual speakers ha have a smaller amoun of daa scaered beween muliple small segmens and ha could be considered zones of indecision. Based on hese consideraions, he LIA re-segmenaion is hen applied on he merged segmenaion. During his ieraive process, he virual speakers for whom oal ime is shorer han 3s are deleed. The daa of hese deleed speakers will furher be dispached beween he remaining speakers during he nex ieraion. Afer he firs ieraion he number of speakers is already drasically reduced since speakers associaed wih indecision zones do no cach any daa during he Vierbi decoding and are auomaically removed. However, he merging sraegy canno generally solve he wrong behaviour of iniial sysems ha could spli a rue speaker in wo hypohesis speakers, each ied o a long segmen. Suppose all sysems agreed on a long segmen excep one which splis i in wo pars. This would produce wo virual speakers (associaed wih long duraion segmens) afer he label merging phase and since no clusering is applied before re-segmenaion, i leads o a "rue" speaker spli in wo virual speakers. 3.2 Individual Microphone Segmenaion Merging Sraegy The goal of his sraegy is o merge he muliple disan microphone segmenaions in a single meeing speaker segmenaion oupu. Since no single signal is represenaive of he overall meeing, his sraegy mus rely on some segmen selecion rules over he muliple disan microphone speaker segmenaions.

In his way, a specific merging algorihm is proposed in his paper. Developed by he LIA and CLIPS labs, i relies on an ieraive process which aims a deecing he longes speaker inervenions over he se of disan microphone segmenaions. This algorihm consiss in 3 seps : Sep 1: selecing he longes speaker inervenion over all microphone segmenaion oupus aken separaely. The longes speaker inervenion means all he segmens (coniguous or no) aribued o he speaker over a specific microphone segmenaion. These segmens are definiely aribued o a new speaker in he resuling segmenaion. Sep 2: deleing in each disan microphone segmenaion all he segmens aribued o he new speaker a he end of sep 1. Sep 3: verifying he presence of no seleced segmens over all he disan microphone segmenaions. If segmens are sill presen and heir oal lengh is greaer han 30s, hen back o sep 1 for a new ieraion, else sop he process and assign he segmens o a las speaker label (his las speaker can be seen as a rash speaker relaed o all he shor remaining segmens). One rule is used during his ieraive process : if he longes speaker inervenion seleced during sep 1 is longer han 60% of he overall signal duraion, i is no considered (unless i is he las available inervenion). This rule aims a discarding some very long speaker segmenaion oupus, which may resul from poor individual microphone segmenaions (he badness of an individual microphone segmenaion may be due, for insance, o he major presence of background voice/noise over he microphone signal, involving a large rae of speech/non speech segmenaion errors). 4. EXPERIMENTS AND RESULTS 4.1 Evaluaion proocols RT 04s meeing evaluaion campaign [6], proposed wo main asks: speech-o-ex ranscripion (STT) and/or speaker segmenaion (so called diarizaion). For boh asks, differen microphone condiions were available: muliple disan microphones, single disan microphone and individual head microphone (he laer was available for STT only). This paper addresses only speaker segmenaion over muliple disan microphones. This secion describes he evaluaion proocols used o measure he performance, presens some resuls and discusses he behaviour of he wo axis merging sraegy. Scoring In order o measure performance, an opimum one-o-one mapping of reference speaker IDs o sysem oupu speaker IDs is compued, followed by a ime based speaker segmenaion error rae. This scoring, proposed by NIST, is described in deails in he RT 04s evaluaion plan [7]. Speaker segmenaion performance is expressed in erms of speaker diarizaion error, comprising missed and false alarm speaker errors as well as speaker segmenaion errors. NB: In his paper, he areas of overlap beween speaker uerances are no scored. Daabase Since his work was done in he conex of RT 04s evaluaion campaign, wo meeing corpora are available, named in his paper Dev corpus for he developmen of sysems and Eva corpus for he evaluaion. Boh of hem are composed of wo 10mn meeing excerps recorded over four differen sies (CMU, ICSI, LDC, and NIST). Table 1 provides some deails on he differen corpora, including, for each meeing excerp, he number of available disan microphones. For each disan microphone, heir posiion in he meeing room is available as furher informaion and may be used o help speaker segmenaion process. Neverheless, approaches presened in his paper do no ake advanage of his kind of informaion. Finally, as for any speaker segmenaion evaluaion, no prior informaion abou he number of speakers and heir ideniy is available. Dev Eva Meeings micro nb Meeings micro nb CMU_20020319-1400 1 CMU_20030109-1530 1 CMU_20020320-1500 1 CMU_20030109-1600 1 ICSI_20010208-1430 6 ICSI_20000807-1000 6 ICSI_20010322-1450 6 ICSI_20011030-1030 6 LDC_20011116-1400 7 LDC_20011121-1700 10 LDC_20011116-1500 8 LDC_20011207-1800 4 NIST_20020214-1148 7 NIST_20030623-1409 7 NIST_20020305-1007 6 NIST_20030925-1517 7 Table 1: Number of disan microphones for each meeing of Dev and Eva corpora. 4.2 Resuls Tables 2 and 3 provide he experimenal resuls obained on Dev and Eva corpora for he ask of muliple disan microphone speaker segmenaion. These resuls, expressed in erms of speaker diarizaion error raes, are given for hree differen sysems: LIA+: he LIA speaker segmenaion sysem applied on each individual disan microphones and followed by he Individual Microphone Segmenaion Merging () process; CLIPS+: he same process is applied using he CLIPS speaker segmenaion sysem followed by he process; Two axis merging: applicaion of he exper merging sraegy on he LIA and CLIPS segmenaions followed by he process. These resuls show: imporan differences in performance beween he LIA and CLIPS sysems on a same meeing file (e.g. 14.1% vs 53.4% for CMU_20020320-1500 on Dev corpus and 37.9% vs 19.1% for ICSI_20000807-1000 on Eva corpus); imporan differences in performance beween he meeings (e.g. 7.4% vs 54.1% for he LIA beween LDC_20011116-1400 and NIST_20020305-1007 on Dev corpus); a significan difference of performance beween Dev and Eva corpora (from 22.6% for he bes overall error rae on Eva vs 28.3% on Dev) as well as a differen behaviour of sysems beween corpora (LIA sysem is he bes one on Dev and CLIPS sysem he bes one on Eva);

a small performance improvemen observed wih he wo axis merging sraegy compared o he individual sysems, and only on few meeing files, (e.g. 25.3% for wo axis merging vs 28.4% for he LIA and 26.7% for he CLIPS for LDC_20011207-1800). Neverheless, no gain is reached on he overall performance, compared o he bes individual sysem. 4.3 Discussion According o he difficuly of he ask (compared o broadcas news or conversaional elephone daa), he performance obained by he various sysems is quie saisfying, especially on Eva corpus: 22.6% for he bes sysem, o be compared wih 12,88% 1 obained on BN daa during RT 03. Neverheless, he exper merging sraegy applied individually on each individual microphone ( wo axis merging ) does no provide addiional performance gain compared o he bes sysem. This resul differs from RT 03 ones [4] where a 16% relaive decrease of he diarizaion error was observed (from 16,90% for he bes individual sysem o 14,24% for he exper merging based sysem). Moreover, he behaviour of his sraegy grealy depends on he qualiy of individual segmenaions, when hemselves are dependen on he qualiy of each sream caugh by each individual microphone. One explanaion of he disappoining behaviour of he exper merging sraegy may be ha each exper is applied separaely on a missing daa file (i.e. on each individual microphone recording). Thus, he performance of he wo expers may be very differen for a same meeing file, which is a well known drawback in fusion (i is generally well acceped ha an efficien fusion mus be done beween expers ha have no oo large differences in erms of performance). Table 4 shows he differences beween he microphones aken independenly, on wo differen meeing examples 2. In he firs example (LDC_20011116-1500), he resul shows a large variabiliy in erms of speaker error raes beween he microphones (d3, d5, d6 ). Conrarily, regarding he speech/non speech deecion, a small variabiliy beween he microphones is noed. On his same meeing, he overall score is very close o he bes individual microphone resul, which performs quie well. The second example (NIST_20020305-1007) shows an inverse behaviour: comparable and quie reasonable speaker error raes over he se of microphones vs. high missed speech error raes wih a large variabiliy beween he microphones. The differences observed beween he meeings show he difficuly o define an efficien merging sraegy. To summarize, some commens could be proposed regarding he resuls: If one microphone is able o cach he informaion from all he speakers (d2, LDC_20011116-1500 for example), his microphone could be used alone achieving good performance (14,5% of diarizaion error on he previous example o be compared wih 12,88 % on BN daa); 2 Speaker diarizaion error raes provided in able 4 for each disan microphone are compued by mapping each individual microphone segmenaion o he corresponding single meeing reference segmenaion. 3 The speaker error rae is compued only on well deeced speech segmens (speech segmens presen boh in he reference and in he sysem oupu). If he informaion is presen simulaneously on differen microphones (wih differen signal qualiies), he fusion process is disurbed, since i is no able o group wo (or more) pars of a given speaker deeced on differen microphones ogeher; To ake advanage of he muliple microphones, i is necessary o focus on he useful informaion/speakers presen in each recording, i.e. he speech/non speech process should delee he far speakers (low SNR pars, background voices ). 5. CONCLUSION We have presened he ELISA speaker segmenaion approach applied on meeing speech daa for NIST RT 04s (spring) evaluaion campaign. The bes sysem obained 28.3% of diarizaion error on he developmen corpus (Dev) and 22.6% on he evaluaion corpus (Eva), o be compared wih he 12,88% obained on BN daa during NIST RT 03 evaluaion. A simple wo-axis merging sraegy was proposed o rea muliple exper segmenaion oupus and muliple microphone segmenaion oupus. While exper merging sraegy did no really lead o an improvemen of he performance, he individual microphone segmenaion merging sraegy allowed o provide a global segmenaion oupu from several audio channels (microphones) wih accepable performance. To be efficien when he speaker voices are differenly caugh by he microphones, our simple merging sraegy needs microphone independen segmenaions focused only on he well caugh speakers (he background/far speakers should be suppressed). Despie he simpliciy of he merging sraegy proposed in his paper, he ELISA primary sysem presened o he RT 04s (spring) meeing evaluaion obained he bes performance on he speaker diarizaion ask. 6. REFERENCES [1] hp://www.nis.gov/speech/es_beds/mr_proj/ [2] I. Magrin-Chagnolleau, G. Gravier, and R. Bloue for he ELISA consorium, Overview of he 2000-2001 ELISA consorium research aciviies, A Speaker Odyssey, pp.67 72, Chania, Cree, June 2001. [3] D. Moraru, S. Meignier, L. Besacier, J.-F. Bonasre, and I. Magrin-Chagnolleau, The ELISA consorium approaches in speaker segmenaion during he NIST 2002 speaker recogniion evaluaion. ICASSP 03, Hong Kong. [4] D. Moraru, S. Meignier, C. Fredouille, L. Besacier, and J.- F. Bonasre, The ELISA consorium approaches in Broadcas News speaker segmenaion during he NIST 2003 Rich Transcripion evaluaion. ICASSP 04, Monreal, Canada, May 2004. [5] P. Delacour and C. Wellekens, DISTBIC: a speakerbased segmenaion for audio daa indexing, Speech Communicaion, Vol. 32, No. 1-2, Sepember 2000. [6] hp://nis.gov/speech/ess/r/r2004/spring/ [7] hp://nis.gov/speech/ess/r/r2004/spring/documens/r04 s-meeing-eval-plan-v1.pdf

Speaker diarizaion error (in %) Dev Meeing Corpus LIA+ CLIPS+ Two axis merging CMU_20020319-1400 58.5 42.4 47 CMU_20020320-1500 14.1 53.4 52.7 ICSI_20010208-1430 16.9 25.9 18.9 ICSI_20010322-1450 26.5 26.8 27.1 LDC_20011116-1400 7.4 7.5 7.5 LDC_20011116-1500 13.9 16.4 18.1 NIST_20020214-1148 30.8 31.4 33.3 NIST_20020305-1007 54.1 36.8 35.5 Overall (miss. and fa non speech err.=5.6%) 28.3 29.9 29.8 Table 2: Performance (in erms of speaker diarizaion error rae) of individual speaker segmenaion sysems (LIA and CLIPS) applied on each disan microphones followed by Individual Microphone Segmenaion Merging () Sraegy and of wo axis merging sraegy based sysem. Performance given for each Dev corpus meeing signal and for he overall. Error raes (in %) LDC_20011116-1500 NIST_20020305-1007 Mis+fa Speaker Mis+fa Speaker err. rae Micro err. rae err. Rae err. rae d1 3.7 18.6 34.4 22.9 d2 4.9 9.6 21.8 26.3 d3 4.9 47.9 XX XX d4 7.4 11.6 20 29 d5 4.0 48.5 36.2 13.9 d6 3.1 48.5 29.2 19.3 d7 4.5 48.3 25.2 16.2 d8 7.3 47.6 XX XX 2.5 11.4 10.2 43.9 Table 4: wo examples of Individual Microphone Segmenaion Merging () sraegy behaviour for he LIA+ sysem. Speaker diarizaion error (in %) Eva Meeing Corpus LIA+ CLIPS+ Two axis merging CMU_20030109-1530 20.8 39.8 41.2 CMU_20030109-1600 13.7 17.8 18.8 ICSI_20000807-1000 37.9 19.1 17.2 ICSI_20011030-1030 52.1 44.2 42.2 LDC_20011121-1700 16.5 7.7 18.0 LDC_20011207-1800 28.4 26.7 25.3 NIST_20030623-1409 10.3 13.9 10.6 NIST_20030925-1517 22.9 22.7 23.8 Overall (miss. and fa non speech err.=7%) 24.4 22.6 23.4 Table 3: Performance (in erms of speaker diarizaion error rae) of individual speaker segmenaion sysems (LIA and CLIPS) applied on each disan microphones followed by Individual Microphone Segmenaion Merging () Sraegy and of wo axis merging sraegy based sysem. Performance given for each Eva corpus meeing signal and for he overall.