Automatic Inference of Cross-modal Nonverbal Interactions in Multiparty Conversations

Auomaic Inference of Cross-modal Nonverbal Ineracions in Mulipary Conversaions "Who Responds o Whom, When, and How?" from Gaze, Head Gesures, and Uerances Kazuhiro Osuka NTT Communicaion Science Laboraories 3-1, Morinosao-Wakamiya Asugi, 247-0198 Japan osuka@eye.brl.n.co.jp Hiroshi Sawada NTT Communicaion Science Laboraories 2-4, Hikaridai, Seika-cho Kyoo, 619-0237 Japan sawada@cslab.n.co.jp Junji Yamao NTT Communicaion Science Laboraories 3-1, Morinosao-Wakamiya Asugi, 247-0198 Japan yamao@brl.n.co.jp ABSTRACT A novel probabilisic framework is proposed for analyzing cross-modal nonverbal ineracions in mulipary face-o-face conversaions. The goal is o deermine who responds o whom, when, and how from mulimodal cues including gaze, head gesures, and uerances. We formulae his problem as he probabilisic inference of he causal relaionship among paricipans behaviors involving head gesures and uerances. To solve his problem, his paper proposes a hierarchical probabilisic model; he srucures of ineracions are probabilisically deermined from high-level conversaion regimes (such as monologue or dialogue) and gaze direcions. Based on he model, he ineracion srucures, gaze, and conversaion regimes, are simulaneously inferred from observed head moion and uerances, using a Markov chain Mone Carlo mehod. The head gesures, including nodding, shaking and il, are recognized wih a novel Wavele-based echnique from magneic sensor signals. The uerances are deeced using daa capured by lapel microphones. Experimens on four-person conversaions confirm he effeciveness of he framework in discovering ineracions such as quesion-and-answer and addressing behavior followed by back-channel responses. Caegories and Subjec Descripors H1.2 [Models and Principles]: User/Machine Sysem Human Informaion Processing General Terms ALGORITHMS, HUMAN FACTORS Permission o make digial or hard copies of all or par of his work for personal or classroom use is graned wihou fee provided ha copies are no made or disribued for profi or commercial advanage and ha copies bear his noice and he full ciaion on he firs page. To copy oherwise, o republish, o pos on servers or o redisribue o liss, requires prior specific permission and/or a fee. ICMI November 12-15, 2007, Nagoya, Aichi, Japan. Copyrigh 2007 ACM 978-1-59593-817-6/07/0011...$5.00. Keywords Face-o-face mulipary conversaion, Eye gaze, Head gesures, Nonverbal behaviors, Bayesian nework, Markov chain Mone Carlo, Gibbs sampler, Semi-Markov process 1. INTRODUCTION Face-o-face conversaion is one of he mos basic forms of communicaion in our life and is used for conveying/sharing informaion, undersanding ohers inenion/emoion, and making decisions. To enhance our communicaion capabiliy beyond conversaions on he spo, he auomaic analysis of conversaion scenes is a basic echnical requisie o enable effecive eleconferencing, archiving/summarizing meeings, and o realize communicaion via social agens and robos. The conversaion scene analysis arges various aspecs of conversaions, from individual/group behaviors such as who is speaking now? and who is alking/lisening o whom?, o conex/menal saus such as who made him angry? and why is she laughing?. In he face-o-face seing, he messages include no only verbal bu also nonverbal messages. The nonverbal messages are expressed by nonverbal behaviors in mulimodal channels such as eye gaze, facial expressions, head moion, hand gesure, body posure and prosody; psychologiss have elucidaed is imporance in human communicaions [1, 26]. Therefore, i is expeced ha conversaion scenes can be largely undersood by observing people s nonverbal behaviors wih sensing devices such as cameras and microphones. As a preliminary sep, auhors have focused on eye gaze as a nonverbal cue for recognizing addressing/lisening behaviors [22], based on he imporance of gaze funcionaliy in conversaions [16, 12]. We have conduced a frequency analysis of he se of gaze direcions of all paricipans, we call i gaze paern, and hypohesized ha he opology of gaze paerns (convergence and muual gaze) can indicae he paern of conversaions such as monologue and dialogue; we call hese he conversaion regimes. As an example, one relaionship is hearers end o look a speaker in monologues. To model he relaionships beween gaze paerns and regimes, we proposed a probabilisic conversaion model based on a dynamic Bayesian nework; he conversaional regime conrols he dynamics of gaze paerns and uerances; he gaze paern is a hidden variable and is esi-

maed from head-direcion measuremens. Wih his model, he regimes and he gaze paerns are joinly esimaed from he uerances and he head direcions measured wih sensors [22] or face racking in videos [23]; he esimaion is implemened using he MCMC(Markov chain Mone Carlo) mehod. This paper ries o exend our framework o a new arge, he auomaic inference of nonverbal ineracion srucures in mulipary conversaions; he goal is o deermine who responds o whom, when, and how. In conras o auhors previous work [22, 23], his paper ries o recognize more direc nonverbal ineracions in conversaions. We paricularly focus on head gesures (nod, shake, and il) and uerances as nonverbal cues, and ry o discover he acion-reacion pairs of paricipans behaviors such as quesion-and-answer and addressing followed by back-channel responses. The arge ineracions will yield cross-modal formaions such as an uerance acknowledged by nodding where he nod riggers he oher s uerances. The ineracion srucure is he basic primiive in conversaions and can reveal how messages are exchanged among people; i can be a clue for inferring how he aiude and minds of paricipans change. As far as he auhors know, his paper is he firs one o shed ligh on he explici srucural analysis of nonverbal ineracions in conversaions. We formulae his problem as he probabilisic inference of he causal relaionships among paricipans behaviors, we call hese relaionships he ineracion srucures. To solve his problem, his paper proposes a hierarchical probabilisic model; he ineracion srucures are probabilisically generaed from gaze and conversaion regimes; he ineracion srucures hen deermine how head gesures and uerances relae o each oher, i.e. which behavior is riggered by which behavior, as well as which behaviors are sponaneous and which are reacive. Based on he model, he ineracion srucures, gaze paerns, and conversaion regimes, are simulaneously inferred from head direcions, head gesure inervals, and uerance inervals, using a MCMC. One of he key feaures differeniaing our model from exising ineracion models is he modeling concep: explici represenaion of he causal relaionships among behaviors in Bayesian nework form; he configuraion of which is ruled by upper-layer processes, i.e. regimes and gaze. Anoher key feaure is he use of a semi-markov process [14] o accuraely model he emporal srucures of ineracions; i permis arbirary disribuions of behavior iming, such as duraion and pause lengh. As one such iming disribuion, his sudy employs a Weibull disribuion [21] due o is expressiveness. So far, several ineracion models have been proposed for conversaion scene analysis, based on he coupled-hmm [4] and is derivaives such as he influence model [2]. However, due o he Markov propery of hese models, he only exponenial emporal disribuions are suppored, which does no necessarily mach acual phenomena. Moreover, ineracion modeling has, so far, mainly argeed audio modaliy [2, 7], and he modeling of mulimodal ineracions remains an open problem. This paper focuses on head gesures such as nodding, shaking, il, for he following reasons. Firs, i is well known ha head gesures play imporan roles in face-oface conversaion for boh speakers and hearers [18]. The speaker s head gesures appear as visible signs of acions such as addressing, quesioning, and sressing. The hearers head gesures can be inerpreed as signs of lisening, acknowledgemen, agreemen/disagreemen, and he level of undersanding. These gesures are used o regulae various ineracions in conversaions, such as quesion & answer, addressing & back-channel response, and urn-aking/yielding. Therefore, head gesures are considered o be a rich informaion source for undersanding conversaions. Several head gesure recogniion mehods have been proposed for man-machine inerfaces using echniques such as HMM [15] and FFT+SVM(Suppor Vecor Machine)[20]. Unlike ineracions wih arificial agens, human-human conversaions exhibi a wide variey of gesures, in erms of periodiciy, speed, and dynamic range, which are mixed ogeher wih oher head moions such as hose synchronized o uerances, urning head when changing gaze direcion, and so on. To handle such gesures, his paper proposes a novel gesure recogniion echnique ha consiss of Wavele-analysis of head pose sequences; SVM is used as a discriminaor. This paper is organized as follows. Secion 2 overviews relaed works. Secion 3 proposes our conversaion model, and Secion 4 presens an esimaion algorihm based on he model. Secion 5 describes he experimen conduced o verify he effeciveness of our mehod. Secion 6 presens our conclusion and some discussions. 2. RELATED WORKS In recen years, conversaion scene analysis has emerged as an aracive research area [10], and wo sreams of research have been gaining aenion: he auomaic recogniion of meeing acions, and using annoaed daa o explore human mechanisms in meeings. The former sudy sream aims o realize he auomaic recogniion of meeing acions such as monologue, dialogue, discussion, noe-aking, and presenaion, from audio/visual signals. To do his, mos prior sudies employed low-level feaures such as global image moion and geomeric image primiives deeced from video, and ried o build saisical models on machine learning echniques ha linked he signals o meeing acions. So far, a number of models have been proposed based on he HMM(Hidden Markov Model) [19], layered-hmm [27], coupled-hmm [2], and dynamic Bayesian neworks [9]. However, he explici measuremen and modeling of human behaviors in meeings remain as open problems. On he oher hand, anoher line of sudies, moivaed by he desire o explore human mechanisms in meeings, akes he psychological poin of view. In pioneering work, a group led by Quek focused on he floor conrol funcion; hey used a mulimodal meeing corpus [6], creaed by a human exper, for analyzing speech paerns such as inerrupion and delegaion of he floor [5]. So far, heir research has revealed ha mulimodal cues such as gaze, gesure, speech, have imporan roles in floor conrol. However, full-auomaic daa annoaion remains a fuure work. Given he curren saus of he field, we have been rying o bridge he gap beween he wo sreams of sudies menioned above: auomaic undersanding of meeing scenes from direc measuremens of nonverbal behaviors, and explicily modeling he relaionship beween individual behaviors and he saus of conversaion, wih he help of psychological findings.

Top Regime Layer Middle Ineracion Layer Hidden gaze paerns X S conversaion regimes E ineracion srucures Head direcions Gaze paerns Regimes...... S 1 H 1 X 1 X X +1 S H H +1 S +1 H +2 X +2... S +2... A (a) Boom Behavior Layer Observable (b) H head direcions G U head gesures uerances Figure 1: Conversaion model, (a) concep, (b) graphical model. (a) Speaker Addressee Addressee Addressee Side-paricipan Speaker/Adressee Speaker/Addressee (b) Side-paricipan Figure 2: Gaze paerns and ineracion paerns in (a)convergence regime and (b)dyad-link regime. 3. CONVERSATION MODEL 3.1 Model Concep This sudy focuses on group conversaions held in a closed environmen; he number of paricipans is N 3. As shown in Fig. 1(a), we assume ha conversaions have hierarchical srucures. Furher, we hypohesize ha a high-level process, called a conversaion regime, governs how people inerac wih each oher on he ineracion layer, and he ineracion process governs how each individual behaves on he behavior layer. In [22], auhors have proposed conversaion regimes as a global saus of conversaions, which correspond o addressing/lisening paerns such as monologue and dialogue. We focused on he eye gaze of each paricipan as an ineracive behavior. Also, we hypohesized ha he gaze paern of paricipans can indicae he srucure of conversaions, and proposed hree regimes: Convergence, Dyad-Link, and Divergence. The regime called Convergence (also called monologue) corresponds o he siuaion ha a speaker addresses all ohers, and he addressees lisen o he speaker. This regime is indicaed by he convergence of he addressees gaze ono he speaker, as shown in Fig. 2(a). Second, he regime called Dyad-Link (also called dialogue) corresponds o he siuaion ha wo people are alking o each oher, and he ohers are side-paricipans. This regime is indicaed by muual gaze beween he wo, as shown in Fig. 2(b). Third, he regime called Divergence (also called ohers) corresponds o siuaions oher han convergence and dyad-link regimes; every one is silen and/or no organized conversaion exiss. The gaze paern does no exhibi any organized paern. This paper exends he auhors framework in [22] o infer cross-modal nonverbal ineracions. Of paricular noe, his paper newly inroduces head gesures (nod, shake, il) and uerances as he ineracing nonverbal cues, and ries o find he causal relaionship amongs hem hem. Fig. 1(b) provides a graphical represenaion of our new conver- Ineracion srucures E' E'' E''' B' B'' Behaviors B Figure 3: Temporal represenaion of model. P1 P2 P3 E B''' Nei(E)={B 1,1, B 3,2 } B 1,1 B 1,2 B 3,1 B 3,2 B 3,3 B 2,2 Figure 4: Ineracion nework represening causal relaionship deermined from ineracion srucures. saion model, consising of hidden variables (conversaion regimes S, gaze paerns X, and ineracion srucures E) and observable variables (head direcions H, head gesures G, and uerances U). This model assumes ha he ineracion srucures are probabilisically generaed depending on he conversaion regimes and gaze paerns and he head gesures and uerances are probabilisically generaed by he ineracion srucures. To esablish he link beween conversaion regimes and he ineracion srucures, we hypohesize ha he paern of ineracions resemble he gaze paerns as shown in Fig. 2. For example, in regime convergence, he addressees ofen respond o he speaker wih nods as he sign of lisening, someimes accompanied wih shor uerances like hmm and yeah. These addressing/back-channel responses are basic primiives of ineracions in conversaion. I is assumed ha he direcion of he responses follows he same paern noed in he case of gaze, as shown in Fig. 2(a). On he oher hand, here is anoher ype of ineracion, called quesion-and-answer; i is an ineracion for exchanging messages beween wo persons. We assume ha i ofen appears in he regime dyad-link, and i akes he form depiced in Fig. 2(b). 3.2 Model Srucure Fig. 3 provides a graphical represenaion of he proposed conversaion model wih emporal informaion. The upper par (A) is he same as he proposed in [22], and he lower par (B) is he novel exension made in his paper. In he upper par, we denoe he sequence of regime saes as S = {S 1, S 2,, S T }; we arge he discree emporal inerval [1, T ]. The regime sae S a ime akes one of N convergence regimes, N C 2 dyad-link regimes and he divergence regime. The regime changes are considered o follow a discree Markov process. The sequence of gaze paerns is represened as X = {X 1, X 2,, X T }, where

he gaze paern X a ime is composed of he se of gaze direcions of all paricipans, X = {X i, } N i=1; i akes N discree direcions: look a oher s face or aver from all of hem. The sequences of head direcions is denoed as H = {H 1, H 2, H T } T =1, and he head direcion of each paricipans, H = {h i, } N i=1, is observed as coninuous azimuh (horizonal) angle. The lower par (B) in Fig. 3 represens he relaionship beween he ineracion srucures E and behaviors consising of emporal inervals of head gesures G and uerances U. In his paper, a head-gesure deecor deecs he presence/absence of head gesures a each ime sep, and a voice aciviy deecor deecs ha of uerances a each ime sep (See also 5.4 and 5.3). From he deecion resuls, he emporal inervals of coninuous head gesures are exraced; here we define a emporal inerval G G as ha is bounded by he beginning/ending ime sep a which head moion sars/ends. The same definiion can be applied o he uerance inerval U U. Here, we denoe a se of gesure and uerance inervals as B = G U; hereafer we refer o behavior B B unless i is necessary o disinguish beween gesures and uerances. Noe his paper arges nod, shake, il, and reas hem as he same behavior, because i mainly focuses on he emporal aspec of gesures, no he meaning of gesures. This paper assumes ha each behavior is riggered by anoher s behavior, or appears sponaneously. The ineracion srucures E deermine he causal relaionship among behaviors. The proposed model assumes ha he ineracion is probabilisically generaed based on he saes of regimes and gaze paerns. Fig. 4, which corresponds o he lower par of Fig. 3, visualizes he causal relaionship among behaviors assigned by he ineracion srucures; we call i he ineracion nework. In Fig. 4, boxes indicae behaviors and an arrow from a box indicaes he reacion arge ha riggered he behavior. A box wihou any ougoing arrow indicaes sponaneous behavior. The ineracion nework can be considered as a Bayesian nework wih inverse arrow direcions. The ineracion srucures consis of a se of elemens, called acion uni, E E, which corresponds o each behavior inerval, where gesure and uerance inervals are linked if heir beginning imes are similar. Each acion uni, E, has aribues including sponaneous-reacive class and a reacion arge. The sponaneous-reacive class indicaes ha he acion is sponaneous (denoed E ø) or a reacion o anoher s behavior (denoed E B, B Nei(E) B), where B denoes he reacion arge ha riggered acion E. Nei(E) denoes a se of ohers behaviors ha occur in he emporal viciniy of E, as shown in Fig. 4. This paper assumes here is only one reacion arge for each acion uni, a mos. Here, reacive behavior is defined as he direc and immediae response o ohers behavior. Sponaneous behavior is one ha is no reacive behavior. Typical examples of sponaneous behaviors are addressing and quesioning behavior of speakers. On he oher hand, ypical reacive behaviors include he hearers back-channel responses and answers o quesions posed. 3.3 Model Definiion Based on he condiional dependency depiced in Fig. 1(b), he join probabiliy disribuion of he model is de- Table 1: Sponaneous probabiliies Regime Monologue Dialogue Ohers Role speaker addressee dyad ohers Uerance η SMSU η SMAU η SDDU η SDSU η SOU (0.95) (0.06) (0.00) (0.00) (0.78) Gesure η SMSG η SMAG η SDDG η SDSG η SOG (0.93) (0.05) (0.50) (0.00) (0.71) fined as Table 2: Direcional probabiliies Monologue Addressee o speaker η DA (0.88) Dialogue One in dyad o anoher η DD (1.00) Look a response arge η DG (0.88) p(x, S, E, H, U, G, ϕ) F H (H X, ϕ) F B (U, G E, ϕ) P (E X, S, ϕ) P (X S, ϕ) P (S ϕ) p(ϕ), (1) where ϕ denoes he se of all model parameers. Eq.(1) is composed of he produc of he likelihood funcions for observed daa and he prior disribuion of all hidden variables. This paper employs he same definiions used in [22] for he priors of regimes P (S ϕ) and gaze paerns P (X S, ϕ), and for he likelihood for head direcions F H ( ), which assumes ha head direcion follows a Gaussian disribuion for any given gaze direcion. The prior p(ϕ) of model parameers is defined as he produc of ha of each of he parameers; his assumes he independency of individual parameers. In Eq. (1), P (E X, S, ϕ) represens he probabiliy ha ineracions E occur in given regimes S and gaze paerns X. This paper decomposes his ino he produc of he probabiliies of each acion uni E E, as wrien in P (E X, S, ϕ) = Q E EP (E X, S) QB Nei(E) ψ(e, B) (2) where he firs erm, P (E X, S), represens he probabiliy of he acion uni sae and he second erm is a penaly o suppress he case of wo behaviors responding o each oher (ψ(e, B) = 0), oherwise ψ(e, B) = 1. In Eq. (2), P (E X, S, ϕ) represens he probabiliy ha acion uni E is in response o anoher s behavior or is sponaneous. To define his, his paper inroduces sponaneous probabiliies and direcional probabiliies, and defines P (E X, S, ϕ) as heir produc. The former is he probabiliy ha an acion is sponaneous behavior. This paper assumes ha i depends on he regime and role of he person making acion uni E, as summarized in Table 1; each probabiliy is a hidden variable o be esimaed. In Table 1, values inside he parenheses are examples obained from manually-annoaed daa (C1) (See 5.1); hey indicae ha for a monologue, speaker s behaviors is far more sponaneous han ha of he addressees. On he oher hand, he direcional probabiliy represens he probabiliy ha acion uni E is in response o he arge person; i is assumed o depend on gaze and regime. Table 2 summarizes he direcional probabiliies. As menioned earlier, we assume ha addressees ofen respond o he speaker in regime convergence, and respond o each oher in dyad-link regime. Also, he reacion arge ends o be a gazee, because people end o look a he arge when hey respond. The second componen, which is newly defined for he join densiy in Eq. (1), is he likelihood of ineracive behavior F B(U, G E, ϕ) for given ineracion srucures E in Eq. (1). This paper defines his componen as he produc of he likelihood funcion of each behavior inerval, f B (B E),

Sponaneous behaviors duraion pause Sender Receiver (Responder) Reacive behaviors rigger ime reacion ime duraion Figure 5: Duraion, pause lengh, and reacion ime. Uerance Gesure Duraion S,D U Sponaneous Pause lengh S,P f U f ( d ) ( p ) 0 2 4 6 8 10 0 1 2 3 d p S,D G S,P f G f ( d ) ( p ) 0 0.5 1 1.5 2 0 1 2 3 4 5 6 d p The uni of horizonal axis is second. Hisogram is obained from mannual annoaion for daa C1. E Reacive Duraion R,D f U ( d ) 0 1 2 3 4 5 d R,D f G ( d ) 0 1 2 3 4 d Reacion ime R,R f ( r ) - 0.5 0 0.5 1 1.5 2 r Figure 6: Weibull disribuions for duraion, pause lengh, and response ime. in each acion uni, by assuming he condiional independency of each behavior for given ineracion srucures E. This likelihood calculaion is based on he emporal disribuions of duraion, pause, and reacion ime of gesures and uerances (See Fig. 5). Fig. 6 summarizes Weibull models employed o represen he disribuions. The model in Fig. 6 indicaes he endency observed in he iming of behaviors; e.g. sponaneous uerances are longer han reacive ones, bu sponaneous gesures end o be shorer han reacive ones. Using hese models, he likelihood f B (B E) can be defined separaely for each case as in f B(B E) = j f S,D B ( d) f S,P B (p) if E is sponaneous, B ( d) f R,R ( r ) if E is reacive, f R,D (3) where d, p, and r denoe he duraion, pause lengh, and reacion ime of behavior B, respecively. Noe ha he Weibull parameers are hidden variables o be esimaed. If he arge behavior is a gesure, he rigger ime is se o he beginning of he inerval. Oherwise, rigger ime is considered o be a hidden random variable, which follows he probabiliy disribuion of response ime. 4. ESTIMATION ALGORITHM Based on he model defined above, he problem is o esimae he ineracion E, he regime S, gaze paern X, and model parameers ϕ from measuremens Z = {H, G, U}. We employ a Bayesian approach [3] o esimae he join poserior disribuion p(e, S, X, ϕ Z) of all unknown variables from he given measuremens. To esimae he join poserior, his sudy uses he Markov chain Mone Carlo mehod called he Gibbs sampler [11], which has an advanage when dealing wih complex models. The Gibbs sampler repeaedly generaes random samples from he fullcondiional poserior disribuions of each unknown variable, which consiue a Markov chain whose invarian disribuion equals he desired join poserior. The full-condiional disribuion is he disribuion of a variable when oher variables are given. From he random samples afer he Markov chain has converged, he maximum a poserior esimae is calculaed for discree variables, and he minimum meansquared error esimaes are calculaed for coninuous variables. Noe his esimaion algorihm is a form of unsupervised learning, which does no need raining daa o obain he model parameers ϕ. Insead, we need o experienially deermine hyper-parameers of he prior disribuions of he parameers ϕ. The full-condiional disribuion of he ineracion srucure of an acion uni can be derived from he join disribuion in Eq. (1), and is wrien as P (E B S, X, E \ E, ϕ, Z) (4) Q B B(E) f B(B E) P (E B X, S) ψ(e, B), where B(E) denoes a se of behaviors included in acion uni E. According o Eq. (4), he reacion arge (also sponaneous-reacive class) B Nei(E) ø of each acion uni E is sampled. The rigger ime for each of candidae uerance inervals in neighborhood Nei(E) is sampled from he reacion ime disribuion. The full condiional of each sponaneous acion and direcional probabiliy becomes a Bea disribuion when assuming Bea priors, and hese probabiliies are sampled from he corresponding Bea full-condiional poseriors. For Weibull models, we assume Gamma priors for he Weibull s shape and scale parameers, and runcaed uniform prior for he locaion parameer; hese priors are used o represen a priori knowledge abou he iming disribuions. For oher variables and parameers, his paper follows he procedures described in [22]. 5. EXPERIMENTS 5.1 Daa This paper arges 4-person group conversaions. The paricipans were four women wihin he same age bracke; hey were seaed as shown in Figure 7. They were insruced o hold a discussion and ry o reach a conclusion as a group for a given discussion opic wihin five minues. The discussion opics were Should ax breaks be given o full-ime housewives, or no? and Is marriage and romanic love he same or differen? ; hereafer he recorded conversaions are referred C1 and C2, respecively. The head direcions were measured a 30 Hz using magneic-based sensors (POL- HEMUS Fasrak TM ), which were aached o heir heads on hair bands. Audio daa were recorded by lapel microphones aached o each paricipan. Also, video sequences, whole sho (Figure 7(b)) and bus shos (Figure 9(a)), were recorded a 30 frames/sec. These daa were synchronized a he uni-ime sep of 1/30 sec. The lenghs of daa were 10000 and 9100 frames (5.6 and 5.1 min) for C1 and C2, respecively. 5.2 Manual Annoaion The raw daa was manually annoaed o permi a quaniaive evaluaion. Annoaion was mainly performed by one female in her 20s. Fig. 8(a) shows a par of he

Person 4 Person 3 Y O 1m Person 1 Person 2 (a) (b) 1 X 2 3 Figure 7: Overview of scene. (a)plan view of paricipans locaion, (b)whole view of paricipans. manual annoaion ( 20 sec.). For each person, P1 P4, he hick bands shows he uerance inervals, manually deeced based on IPU(Iner Pausal Uni) of 0.3 sec. The line segmens beneah he uerances indicae gesure inervals, manually deeced by visual inspecion of he video. The arge gesures were nodding, shaking, and iling. Oher head moions were excluded. Nex, for each uerance inerval and each gesure inerval, sponaneous-reacive class and he reacion arge were deermined. Also, rigger ime was given for each reacive behavior. In Fig. 8(a), small circles represen he beginning of acion unis and he arrows from hem indicae he reacion arges, while circles wih no arrow indicae sponaneous acions. The posiion of arrow s head indicaes he rigger ime sep. Each vericallyelongaed ellipse indicaes an inegraed acion uni consising of an uerance and a gesure. Also, he ground ruh of gaze direcions and regimes was manually creaed by waching he video sequences. 5.3 Voice Aciviy Deecion To auomaically deec uerance inervals from he audio signals capured by lapel microphones, his paper employed a voice aciviy deecion (VAD) mehod [24] ha can robusly deec each person s uerance separaely by clusering each person s signal in he ime-frequency domain. The deeced uerances oupu by he VAD mehod were reformed by filling shor-erm gaps o saisfy he IPU crieria, and eliminaing very shor inervals as noise. Fig. 8(b) shows some of he uerance inervals so deeced. Compared o he manual deecion resuls in Fig. 8(a), he auomaic resul includes deecion lapses due o whisper-like uerances, and over-deecion due o breahing, rusling, and coughing. Table 3(a) shows he accuracy of uerance deecion in erms of precision, recall, and hi raio. Here, he hi raio is he raio of correc frames o all frames. Table 3(a) confirms ha he auomaic voice deecion mehod used was highly accurae and robus even hough he amoun of cross-alk was significan. 5.4 Gesure Recogniion The head gesures were deeced wih a new echnique based on discree Wavele ransform (DWT). Firs, DWT feaures are separaely calculaed for each head pose componen; he componens are azimuh (horizonal), elevaion (verical), and roll (in-plane roaion). This sudy applied he Daubechies wavele of order 10 (db10) and decomposiion scale was se o 2-4; windows size was 16. A each ime sep, we calculaed he DWT coefficiens of deails D2-D4 and final approximaion A4, and hen calculaed he maximum, minimum, mean, sandard deviaion of he wavele coefficiens in each sub-band, as he feaure vecor of gesures. These saisics were used in EEG signal analysis [13]. 4 Nex, we rained an SVM o classify he feaure vecor ino wo caegories; gesure or non-gesure, a each ime sep. This paper employed a polynomial kernel of order 5 and a sof margin crierion. Training and classificaion was done for each person in each conversaion. C1 was classified using he SVM rained wih manually deeced daa of C2, and vice versa. The oupu of he SVM was hen reformed in a manner similar o he uerance inervals o yield he final gesure inervals. Fig. 8(b) shows some of he deeced gesures. A comparison o Fig. 8(a) shows ha here are some errors; overdeecion occurs due o coninuous gesures and small head movemens. Table 3(b) shows he accuracy of gesure deecion, and indicaes ha he deecion was moderaely successful, despie he huge dynamic range of gesures, from almos invisible ones o very large ones. 5.5 Experimen Seing This paper employed he same values as used in [22], for hyper-parameers, which deermine he prior disribuions of gaze, regime, and head direcions. The Gibbs sampling ieraion was 10000, and saisics were calculaed from samples obained from he 5000h 10000h ieraions. The same parameer se were used for boh daa C1 and C2. 5.6 Qualiaive Evaluaion Fig. 8(b) shows a par of he ineracion srucures inferred from auomaically deeced behaviors. Fig. 9 shows hree snap shos o illusrae he flow of he conversaion. In his scene, speaking urn changes over ime; P4 P2 P1. We have confirmed ha gaze paerns and regimes were successfully esimaed for his scene. Firs, P4 sared o give her opinion o ohers who lisened o P4. During his (P4 s) urn, ohers responded o P4 wih uerances and gesures. They synchronized heir responses o a break poin in P4 s discourse. Fig. 8(b) indicaes ha hese back-channel responses were successfully deermined. Nex, a he end of P4 s uerances, she asked he ohers for agreemen and ried o confirm heir aiudes; heir answers were correcly inferred. A he same ime, P2 overlaid her uerance wih he end of P4 s senence, and ook over he speaking urn. P1, P3, and P4 urned heir gaze o P2 and acknowledged her urn. Also, hey responded o her ag-quesion wih posiive answers; heir responses were successfully deermined, even hough he urn aking by P2 was abrup. P1 hen ook advanage of a momenary chance, and ook he urn. P2, P3, and P4 laughed a wha P1 said; i appeared o be a humorous phrase. These responses oward P1 were successfully deermined. Fig. 8(b) indicaes ha mos quesion/answer and addressees back-channel responses oward speakers were accuraely esimaed, and followed he changes in speaking urn. A visual inspecion of all inferred ineracions confirmed ha he inferred ineracion srucures were reasonably accurae; a few flaws were presen. 5.7 Quaniaive Evaluaion Table 4 shows he resul of he quaniaive evaluaion of ineracions. Table 4(a) shows he raio of sponaneousreacive class (correcly esimaed) o he manual annoaion. Table 4(b) shows he raio of he number of acion unis whose arge person was correcly inferred, o he number of all acion unis whose sponaneous-reacive class was

uerance inervals gesure inervals (a) (a) =1 =2 =1 ah =3 Tha's righ!. if here's no sysem proecing people like ha, I's gonna be like a... yes hm! ah Tha's why, indeed, people who are selfish and si in he do nohing posiion as housewives should no be given any benefis. I hink i's beer. so hmm hmmm ha ha ha ha ha ha yes ah yeah I see ha ha ha yes, yes (b) ha ha ha They cerainly are Tha's righ, isn' i Tha's righ. common, righ? They migh have parens who need care, or somehing. Say, depending on family siuaion, people, who can' go for work, Figure 8: Ineracion nework represenaion of ineracion srucures, (a)manual annoaion, (b)inference resul from auomaically deeced behaviors, (daa = C1), 1 frame = 1/30 sec., Display lengh ' 20 sec. 1 correcly deermined. Table 4(c) shows he raio of he number of acion unis in which he difference beween he esimaed rigger ime and he one from he corresponding annoaion was less han or equal o 0.3 sec.; he denominaor of his raio is he number of acion unis whose arge person was correcly inferred. Table 4(a) shows ha he accuracy of deermining sponaneous-reacive class is raher modes; miss-classificaion happens ofen, especially when he acion uni has a shor preceding pausal lengh. Table 4(b) indicaes ha arge persons were accuraely inferred, and Table 4(c) suggess ha he accuracy of idenifying he rigger ime was reasonably high. C1 yielded superior performance o C2, because C2 was a more complex conversaion, wih a lo of rapid urn changes. In general, he resuls gained from auomaically deeced behaviors were basically comparable o hose from he manually deeced ones; his verifies he effeciveness of VAD and he gesure deecion echnique described here. Despie he limied daase and annoaion, he above resuls sugges he effeciveness of he proposed mehod in analyzing nonverbal ineracions in mulipary conversaions. Fuure work includes evaluaions using a comprehensive daase ha includes various group and opics, as well as examining he consisency of manual annoaions (used as ground ruh) given by differen annoaors. 2 3 4 P2 C R4 P3 P1 P4 =1 1 2 3 4 C P2 P3 P1 P4 =2 1 2 3 4 P2 R2 P3 P1 C R1 P4 (a) =3 (b) Figure 9: Snap shos of hree ime seps, 1, 2, 3 in Fig. 8. (a)each paricipan, (b)regime esimaes and gaze paerns (solid arrows: esimaes, wide arrows: ground ruh). RiC denoes Pi s monologue regime. 6. CONCLUSION AND DISCUSSION This paper proposed a novel arge of conversaion scene analysis, he auomaic inference of ineracion paerns from paricipans nonverbal behaviors in mulipary conversaions. To ha purpose, a hierarchical probabilisic conversaion model was inroduced. Even hough his paper focused on simple ineracions condiioned on conversaion regimes, gaze paerns, and emporal srucures such as duraion,

Table 3: Deecion accuracy of uerance (a) and gesure (b) (a)uerance (b)head Gesure Precision Recall Hi Precision Recall Hi C1 95.2 85.2 95.0 60.0 86.6 73.3 C2 91.5 81.3 93.6 75.1 60.3 77.2 Table 4: Acuracy of esimaed ineracions: (Manual)manual annoaion, (Auo)auomaically deeced behaviors (a)spon. (b)person (c)trigger C1 (Manual) 88.3 97.7 81.2 C2 (Manual) 77.1 92.2 67.7 C1 (Auo) 75.7 95.9 74.5 C2 (Auo) 71.1 91.4 70.1 pause, and reacion imes, he proposed framework is considered o be noeworhy in ha i provides a basic mehodology for analyzing nonverbal cross-modal ineracions in face-o-face conversaions, and offers several prospecive direcions. Firs, he ineracion srucures discovered wih he proposed framework can be used as a clue for undersanding he menal/conex level aspecs of conversaions. The firs sep would be classifying he responses ino posiive and negaive ypes, which can be indicaed by head gesure classes. The proposed gesure deecor is powerful enough o disinguish various head moions such as nod, shake, and il. The problem is o esablish useful links beween moion feaures and he inner sae of he person, like he degree of agreemen. Moreover, i can provide a basic elemen for analyzing how one person s opinion can spread hroughou a human nework and how a group s concordance is formed over ime. I is also ineresing work o relae our framework o psychological/linguisic sudies such as adjacency pair analysis [25] and synchrony analysis [8]. The ineracion srucures can be a useful elemen of meeing annoaion for auomaic archiving/summarizing sysems. For example, i could provide more semanic-based rerieval capabiliy such as who had posiive/negaive on his opinion? and who s opinion was he mos influenial?. Also, he idenified ineracion srucures can be used o improve auomaic video ediing so ha viewers can more clearly undersand who responds o whom. Furhermore, i is worh considering a sysem ha can quanify communicaion skill and provide users wih feedback o improve human communicaion skill in organizaions. To realize real-ime applicaions, fuure works include he real-ime simulaneous racking of faces from low-resoluion video sequences, image-based head gesure recogniion, and voice deecion/separaion capured wih microphone arrays. Our framework can also easily incorporae oher modaliies such as prosody and facial expressions. We are currenly developing a facial expression recogniion echnique ha is robus agains head-pose changes [17]. Finally, auhors believe ha his work will conribue o opening up a new research field ha can explore various aspecs of nonverbal cross-modal ineracions in mulipary conversaions, and bridge relaed disciplines such as psychology, social linguisics, and mulimodal applicaions. 7. REFERENCES [1] M. Argyle. Bodily Communicaion 2nd ed. Rouledge, London and New York, 1988. [2] S. Basu. Conversaional Scene Analysis. Ph.D hesis, Massachuses Insiue of Technology, 2002. [3] J. M. Bernardo and A. F. M. Smih. Bayesian Theory. John Wiley & Sons, Ld., 1994. [4] M. Brand, N. Oliver, and A. Penland. Coupled hidden Markov models for complex acion recogniion. In Proc. CVPR 97, pages 994 999, 1997. [5] L. Chen, M. Harper, A. Franklin, T. R. Rose, I. Kimbara, Z. Huang, and F. Quek. A mulimodal analysis of floor conrol in meeings. In Proc. MLMI 06, 2006. [6] L. Chen, R. T. Rose, F. Parrill, X. Han, J. Tu, Z. Huang, M. Harper, D. M. F. Quek, R. Tule, and T. Huang. VACE mulimodal meeing corpus. In Proc. MLMI, 2005. [7] T. K. Choudhury. Sensing and Modeling Human Neworks. Ph.D hesis, MIT, 2004. [8] W. S. Condon and M. B. Ogson. Sound film analysis of normal and pahological behavior paerns. J. Nervous and Disease, 143:338 347, 1966. [9] A. Dielmann and S. Renals. Dynamic Bayesian neworks for meeing srucuring. In Proc. IEEE ICASSP 04, 2004. [10] D. Gaica-Perez. Analyzing group ineracions in conversaions: a review. In Proc. IEEE In. Conf. Mulisensor Fusion and Inegraion for Inelligen Sysems 06, pages 41 46, 2006. [11] W. R. Gilks, S. Richardson, and D. J. Spiegelhaler. Markov Chain Mone Carlo in Pracice. Chapman & Hall/CRC, 1996. [12] C. Goodwin. Conversaional Organizaion: Ineracion beween Speakers and Hearers. Academic Press, 1981. [13] Ï. Güler and E. D. Übeyli. Muliclass suppor vecor machines for EEG-signals classificaion. IEEE Trans. Informaion Technology in Biomedicine, 11:117 126, 2007. [14] J. Janssen and R. Manca. Applied Semi-Markov Processes. Springer, 2006. [15] A. Kapoor and R. W. Picard. A real-ime head nod and shake deecor. In Proc. Workshop on PUI, 2001. [16] A. Kendon. Some funcions of gaze-direcion in social ineracion. Aca Psychologica, 26:22 63, 1967. [17] S. Kumano, K. Osuka, J. Yamao, E. Maeda, and Y. Sao. Pose-invarian facial expression recogniion using variable-inensiy emplaes. In Proc. ACCV 07, 2007. [18] S. K. Maynard. Ineracional funcions of a nonverbal sign: Head movemen in japanese dyadic casual conversaion. J. Pragmaics, 11:589 606, 1987. [19] I. McCowan, D. Gaica-Perez, S. Bengio, G. Lahoud, M. Barnard, and D. Zhang. Auomaic analysis of mulimodal group acions in meeings,. IEEE Trans. PAMI, 27(3), 2005. [20] L.-P. Morency, C. Sidner, C. Lee, and T. Darrell. Conexual recogniion of head gesures. In Proc. ICMI 05, pages 18 24, 2005. [21] D. N. P. Murhy, M. Xie, and R. Jiang. Weibull Models. John Wiley & Sons, Ld., 2004. [22] K. Osuka, Y. Takemae, J. Yamao, and H. Murase. A probabilisic inference of mulipary-conversaion srucure based on Markov-swiching models of gaze paerns, head direcions, and uerances. In Proc. ICMI 05, 2005. [23] K. Osuka, J. Yamao, and H. Murase. Conversaion scene analysis wih dynamic Bayesian nework based on visual head racking. In Proc. ICME 06, 2006. [24] H. Sawada, S. Araki, K. Osuka, M. Fujimoo, and K. Ishizuka. Voice aciviy deecion for muliple speakers wih muliple pin microhpones. In 2007 Spring Meeing, Acousical Sociey of Japan, 2007. [25] E. A. Schegloff and H. Sacks. Opening up closings. Semioica, 8:289 327, 1973. [26] R. Virginia and M. James. Nonverbal behavior in inerpersonal relaions 5h Ed. Allyn & Bacon, 2003. [27] D. Zhang, D. Gaica-Perez, S. Bengio, I. McCowan, and G. Lahoud. Modeling individual and group acions in meeings: A wo-layers HMM framework. In Proc. 2nd. IEEE Workshop on Even Mining, 2004.