Robust Object Tracking with Online Multi-lifespan Dictionary Learning

203 IEEE Inernaional Conference on Compuer Vision Robus Objec Tracking wih Online Muli-lifespan Dicionary Learning Junliang Xing, Jin Gao, Bing Li, Weiming Hu Naional Laboraory of Paern Recogniion Insiue of Auomaion, CAS Beijing 0090, P. R. China {jlxing,jgao0,bli,wmhu}@nlpr.ia.ac.cn Shuicheng Yan Dep. of Elecrical and Compuer Engineering Naional Universiy of Singapore Singapore 7576, Singapore eleyans@nus.edu.sg Absrac Recenly, sparse represenaion has been inroduced for robus objec racking. By represening he objec sparsely, i.e., using only a few emplaes via l -norm minimizaion, hese so-called l -rackers exhibi promising racking resuls. In his work, we address he objec emplae building and updaing problem in hese l -racking approaches, which has no been fully sudied. We propose o perform emplae updaing, in a new perspecive, as an online incremenal dicionary learning problem, which is efficienly solved hrough an online opimizaion procedure. To guaranee he robusness and adapabiliy of he racking algorihm, we also propose o build a muli-lifespan dicionary model. By building arge dicionaries of differen lifespans, effecive objec observaions can be obained o deal wih he well-known drifing problem in racking and hus improve he racking accuracy. We derive effecive observaion models boh generaively and discriminaively based on he online muli-lifespan dicionary learning model and deploy hem o he Bayesian sequenial esimaion framework o perform racking. The proposed approach has been exensively evaluaed on en challenging video sequences. Experimenal resuls demonsrae he effeciveness of he online learned emplaes, as well as he sae-of-he-ar racking performance of he proposed approach.. Inroducion Visual objec racking, which aims o esimae he arge sae (e.g. posiion and size) in video sequences, is a very imporan research opic in he compuer vision field. I has many applicaions like visual surveillance, vehicle navigaion and human compuer ineracion [25]. Alhough many algorihms have been proposed in he las decades, objec racking sill remains a very challenging problem for realworld applicaions due o he difficulies like background cluering, image noises, illuminaion changes, objec occlusions and fas moions, as shown in Figure. Recenly, sparse coding based mehods have been suc- Figure. Typical difficulies in objec racking problem. () Image noises and background cluering (op-lef image), (2) illuminaion changes (op-righ image), (3) objec occlusions (boomlef image), and (4) fas objec moions (boom-righ image). The racking resuls using fixed emplaes, fully updaed emplaes, updae mehod in [9], [26], [22], [3] and he proposed mehod in his work are respecively ploed in eal, olive, purple, cyan, blue, green and red colors. Bes viewed in original color PDF file. cessfully applied o visual racking problem [9, 20, 5, 28, 3]. The basic assumpion in hese mehods is ha he arge can be represened as a linear combinaion of only a few elemens in a emplae se. Then he confidence of a arge candidae can be modeled using he reconsrucion error of he sparse represenaion. Denoing a emplae se of he arge wih m elemens as T =[, 2,..., m ] R n m, and a arge candidae as y R n, which is used as he observaions o esimae he objec sae x (see Secion 3.3 for more deails), he sparse represenaion of he candidae using he emplaes is obained by solving min Tc y 2 2 + λ c, () c where c R m is he coefficien vecor for he sparse represenaion, and λ is he regularizaion parameer o conrol he sparsiy of c. Based on he derived resul, he confidence of he candidae y wih respec o he arge sae x is naurally modeled as, 550-5499/3 $3.00 203 IEEE DOI 0.09/ICCV.203.88 665

p(y x) = Γ exp( α Tc y 2 2), (2) where α is a consan and Γ is a normalizaion facor o make p(y x) a valid probabiliy disribuion. Buil on his idea, previous works on sparse coding based objec racking have mainly focused on wo problems. One is how o design he objec emplaes. Mehods using global objec emplaes [9, 26], local image paches [3], or a combinaion of hem [28] have been proposed. The oher problem is how o solve he minimizaion problem o perform efficien racking, and relevan soluions include Inerior Poin mehod [9], Acceleraed Proximal Gradien approach [5], Leas Angle Regression algorihm [28, 3], and Augmened Lagrange Muliplier mehod [26]. Supposing ha we have already designed he objec emplaes and used hem o rack he arge via a cerain opimizaion procedure, how can we updae he objec emplaes effecively and efficienly? This is also a very imporan problem bu has no been paid much aenion in previous l -rackers. Exising l -rackers address he emplae updae problem usually by adoping some classic mehods (e.g. fixed emplaes wih no updaes [28] or incremenal updae wih replacing [9, 27]) or some inuiive sraegies [3, 26]. In his paper, we propose o perform he emplae updae problem in he racking scenario as an online incremenal dicionary learning problem. We also explore he emporal naure of he objec emplaes and propose o learn a muli-lifespan dicionary o improve he adapabiliy and robusness of a racking algorihm. We apply he online muli-lifespan dicionary learning model ino he Bayesian sequenial esimaion framework and design effecive observaion models boh generaively and discriminaively for a paricle filer implemenaion. Exensive experimens on public benchmark video sequences demonsrae he effeciveness of our online learned emplaes, and he sae-ofhe-ar racking performance of he proposed approach. 2. Relaed Work Objec racking mehods can be roughly grouped ino wo caegories: generaive and discriminaive. Generaive mehods, which use a descripive appearance model o represen he objec, perform racking as a searching problem over candidae regions o find he mos similar one. Examples of generaive racking mehods are eigenracker [6], mean shif racker [7], and fragmen racker []. Discriminaive mehods formulae objec racking as a binary classificaion problem and find he locaion ha can bes separae he arge from he background. Examples of discriminaive racking mehods are online boosing racker [], ensemble racker [3], and muli-insance learning racker [4]. Wih recen advances in sparse represenaion, sparse coding based objec rackers demonsrae o be a promising racking framework [9, 20, 5, 28, 3, 27, 26]. In he l -racker firs proposed by Xue and Lin [9], a combina- Table. Comparison of objec emplaes in popular l -rackers. Mehod Efficiency Building Updaing Robusness Adapiviy Xue and Lin[9] Low Manually Inuiively Low Low Xue and Lin[20] Low Manually Inuiively Low Low Bao e al. [5] High Manually Inuiively Low Low Zhang e al. [27] Low Manually Inuiively Low High Zhang e al. [26] High Manually Inuiively Low High Jia e al. [3] High Learned Learned Low High Zhong e al. [28] High Manually Inuiively High Low Proposed Mehod High Learned Learned High High ion of objec emplaes and rivial emplaes is employed o ackle he occlusion problem. To make he emplaes direcly robus o occlusion, local image paches can be used as he objec dicionary [3, 28]. In order o improve he efficiency of he opimizaion process, Xue and Lin [20] furher propose a minimal error bounded sraegy is o reduce he number of he l -norm relaed minimizaion problem, and laer some oher efficien opimizaion procedures are furher adoped o solve he problem [5, 28, 3, 26]. Besides emplae design and opimizaion mehod, emplae updae is even a more imporan problem in he sparse coding based racking framework. Alhough fixed objec emplae may work well for shor video sequences when he arge says nearly unchanged, i may be incompeen for long sequences where he arge ofen undergoes differen kinds of changes (see Figure ). To adap o appearance changes of he arge during racking, he emplaes in [9, 20, 5, 26] are updaed based on he weighs assigned o each emplae, and he similariies beween emplaes and he curren esimaion of he arge candidae. In [28], he global objec emplaes are kep fixed during racking o ensure he discriminaive power of he model, while he local pach emplaes are consanly updaed o adap o objec changes. In order o incremenally updae he emplaes, a more reasonable sraegy can be found in [3], where old emplaes are given slow updae probabiliies and he incremenal subspace learning algorihm in [22] is employed which resrics he emplae vecors o be orhogonal. Table gives a comparaive summary of he objec emplaes used in mos popular l rackers. Alhough all hese mehods provide differen sraegies conribuing o emplae updae, mos of hem use some predefined emplaes and updae hem inuiively, which may no fully unleash he poenial capabiliy of emplaes. The proposed online learning based emplae building and updaing algorihm, which is boh robus and adapive for racking, well addresses he problems of objec emplaes in he l rackers. 3. Proposed Approach Given he objec racking resuls, he main objecive of his paper is o online auomaically learn good objec emplaes, which can, in urn, benefi he ongoing objec racking process wih improved robusness and adapabiliy. Our core idea o achieve his objecive is rying no o impose heavy consrains on he emplae building and 666

Processed frames Processing... Process resul Resample Updae Esimae Firs frame... Previous frame Curren frame Draw Samples Legends: Draw Samples Learning SLD Discriminaive Observaion Model Objec Sample Background Sample Learn Dicionary Tracking Resul Resampled Paricle Updaed Paricle Draw Samples Learning Learning MLD LLD OMDL Generaive Observaion Model Muli-lifespan Dicionary Building Online Dicionary Learning Bayesian Sequenial Esimaion Figure 2. Overall descripion of he proposed approach. We perform emplae updae as an online dicionary learning problem and propose o learn a muli-lifespan dicionary, i.e. he Shor Lifespan Dicionary (SLD), he Middle Lifespan Dicionary (MLD) and he Long Lifespan Dicionary (LLD), o model he objec. The SLD learned from samples densely colleced from he previous frame makes he bes adapaion o he arge. The LLD learned from he mos accurae samples colleced from all he frames ensures he robusness of he model. The MLD, beween he SLD and he LLD, balances he adapabiliy and robusness of he final model. The Online Muli-lifespan Dicionary Learning (OMDL) model, ogeher wih he background samples colleced in he previous frame, is used o deduce observaion models boh generaively and discriminaively. These observaion models are hen applied ino he Bayesian sequenial esimaion framework using paricle filer implemenaion o perform adapive and robus objec racking. updaing process and making hem mos suiable for racking. To his end, we formulae his emplae building and updaing problem as online dicionary learning, which auomaically updaes objec emplaes ha can beer adap o he daa for racking. In order o furher improve he robusness and adapabiliy of he learned emplaes, we explore he emporal propery of he learned dicionary and propose o build a dicionary wih muliple lifespans o possess disinc emporal properies. Based on he learned mulilifespan dicionary, we deduce effecive observaion models boh generaively and discriminaively, and deploy hem ino he Bayesian sequenial esimaion framework o perform racking using a paricle filer implemenaion. Noe ha Li e al. [5] also use muliple deecors wih differen lifespans. Their objecive is o improve he compuaion efficiency of he muli-lifespan deecors which are used sequenially in a cascade paricle filer. Our muli-lifespan model, however, mainly aims o ease he conradicion beween he adapiviy and robusness of emplae based objec racking algorihms, and he muli-lifespan dicionaries are fused in parallel in a muli-sae paricle filer. Figure 2 gives he overall descripion of our approach, he deails of which are elaboraed in he following subsecions. 3.. Tracking as Online Dicionary Learning In sparse coding based racking algorihms, a arge candidae is represened as a linear combinaion of a few elemens from a emplae se. The building and updaing of his emplae se, herefore, have grea impac on he final racking resuls. Previous works usually build his emplae se by direcly sampling from he iniializaion of racking, and hen use some inuiive sraegies o updae he se during racking [9, 28, 3]. From a differen viewpoin, here we wan o auomaically learn his emplae se o make i bes adap o he video daa o be racked. We do no wan o impose any consrains on he learned emplaes, bu only expec ha hey can beer represen he arge ha have been racked and will be racked. Suppose all he possible arge candidaes are wihin a emplae se Y ={y,...,y N }, where y i R n denoes one n-dimensional sample and N is he emplae size which can be very large, since samples are coninually obained when a new frame has been racked. A good emplae se should hen have he minimal cos o represen all he elemens in his se. Denoe he represenaion cos funcion as: f(d) l(y, D), (3) Y y Y where D R n m is he learned emplae se, disinguished from he predefined emplae se T as in previous works, and l( ) is he loss funcion such ha l(y, D) is small if D is good a represening he candidae y. In a sparse coding based objec racking framework, he loss funcion can be naurally modeled as: l(y, D) min c R m 2 y Dc 2 2 + λ c. (4) To preven he l 2 -norm of D from being arbirarily large, which may lead o arbirarily small values of c, i usually consrains is columns d,...,d m o have l 2 -norms less 667

han or equal o. The consrain se of D hus can be represened as: D {D R n m, s.. j {,...,m}, d j 2 }. (5) Now, puing everyhing ogeher, he emplae learning can be formulaed as he following opimizaion problem, D =argmin D Y y Y l(y, D) s.. l(y, D) =min c 2 y Dc 2 2 + λ c D D. The above problem is also referred o as Dicionary Learning which has found many applicaions in signal processing [2, 24, 23], machine learning [8, 8] and laely compuer vision for face recogniion [6] and image resoraion [7]. Generally, i can be opimized using a wo-sep ieraed procedure: he firs sep fixes D and minimizes he cos funcion wih respec o he coefficien vecor c; and he second sep fixes he coefficien vecor c and performs gradien descen like mehods o minimize he cos funcion. In our scenario for emplae updaing in online objec racking, since he arge candidaes are obained consecuively, he above wo-sep ieraed procedure mus be redesigned o be performed in an online manner o learn he dicionary incremenally. In Algorihm, we summarize his redesigned procedure of online dicionary learning for emplae updae. The learning procedure receives he dicionary learned in he previous frame as inpu, and updaes he dicionary incremenally according o he samples colleced in he curren frame. In he algorihm, Sep is solved using he LARS mehod [9] and Sep 2 admis an analyical soluion. The involved marix inversion is calculaed in an online manner using he Sherman-Morrison formula [2] o make he learning process more efficien. The inroduced variables C and Y are inermediae resuls associaed wih he dicionary D and are sored for incremenal learning. The iniial value for D 0, C 0 and Y 0 are obained from he racker. To improve he robusness of he learned dicionary, muliple samples are colleced around he racking resul x in frame I, and M is he parameer o conrol he explici number of he colleced samples. Noe ha here we do no impose any consrains on he explici dicionary forma, which can be objec emplaes, image paches or even exraced feaures. 3.2. Muli-lifespan Dicionary Building The adapabiliy and robusness are he wo key characerisics ha a racker should possess. Adapabiliy means ha he racker should accommodae o arge appearance changes quickly, while robusness refers o he abiliy o keep on working under differen siuaions. These wo characerisics, however, ofen conradic wih each oher in many racking algorihms. As for emplae based objec racking, if he emplae is updaed wih a faser speed, he racker can beer adap o he changes of he arge bu (6) Algorihm Online dicionary learning for emplae updae Inpu: frame daa I, racking resuls x, learned dicionary D, C, Y in he previous frame, λ (regularizaion parameer), M (sample drawing number). Oupu: learned dicionary D in he curren frame. : Iniializaion: D D,C C,Y Y. 2: for i = M do 3: Sep : fixd o find he bes coefficiens, c (i) =argmin c R n 2 y(i) D c 2 2 + λ c. 4: Sep 2: fix{c (i) } o updae he dicionary, C C C c (i) D =argmin D D = ( i = C Y. +c (i) i j= j= c(j) c (i) C C c (i) 2 y(j) c (j), Y Y + y (i) c (i), Dc (j) 2 2 + λ c (j), ) ( i j= y(j) ) c (j), 5: end for 6: Save dicionary D, inermediae variable C and Y. may be more likely o drif due o noises accumulaed along wih fas updaing. On he conrary, if he emplae is updaed wih a slower speed, he racker is no easy o drif bu may no cach up wih he changes of he arge. Based on he formulaion of emplae updae using online dicionary learning, we explore he emporal properies of he learned dicionary and propose o build a muli-lifespan dicionary learning model o furher improve he racking effeciveness and guaranee adapabiliy and robusness simulaneously. In order o explore he properies of he online learned dicionary, we represen i using a 5-uple based on is learning procedure, i.e., D = D, C, Y, {y (i) s : e } M i=,λ, (7) where {y (i) s : e } M i= denoes all he candidaes sampled o rain he dicionary and compleely deerminae he learned dicionary ogeher wih regularizaion parameer λ. Here he subscrip s and e, i.e. he sar and end frame number of candidaes, reflec he emporal propery of he raining daa. By collecing raining candidaes from differen emporal inervals wih a corresponding sampling sraegy, we can learn dicionaries of muliple emporal properies. Muli-lifespan dicionaries provide a very good soluion o he conradicion when simulaneously pursuing he adapabiliy and robusness of he racker. As show in Figure 2, we simulaneously learn hree differen lifespan dicionaries, he Shor Lifespan Dicionary (SLD), he Middle Lifespan Dicionary (MLD), and he Here sampling sraegy refers o differen sampling variance and candidae number M in each frame. 668

Tracking Resul Learn SLD Learn MLD Learn LLD Negaive Samples Figure 3. Online dicionary learning for emplae updae. Line : racking resuls; Line 2-4: examples of he learned dicionaries a frame 00; Line 5-7: colleced negaive samples used for he discriminaive observaion model (see Secion 3.3). Long Lifespan Dicionary (LLD). The SLD is rained using he candidaes densely sampled only in he previous frame (i.e., s = ) and made he bes adapaion o he arge in curren frame. The LLD, on he conrary, is rained using accuraely sampled candidaes in all previous frames o esablish a robus objec appearance model (i.e., s =). Beween he SLD and LLD, he MLD ries o build an inermediae model ha compromises he modes buil by SLD and LLD (i.e., s =/2). Denoing SLD, MLD and LLD respecively as D S, D M and D L, he final online muli-lifespan dicionary learning model (OMDL) is represened as: D = { D S, D M, D L}. (8) In Figure 3, we give some examples of he hree lifespan dicionaries learned using he online dicionary learning algorihm. I can be observed ha he SLD successfully capures a more adapive objec appearance model, while he LLD builds a more robus objec appearance model. The MLD obains a good inermediae objec model ha balances he models buil by LLD and SLD. 3.3. Bayesian Sequenial Esimaion We deploy he OMDL model ino he Bayesian sequenial esimaion framework, which performs racking by solving he maximum a poserior (MAP) problem, ˆx =argmaxp(x y : ), (9) x where y : = {y,...,y } represens all he observaion candidaes unil he curren frame. The poserior probabiliy p(x y : ) is calculaed recursively by he Bayesian heorem using paricle filer [2], p(x y : ) p(x y ) p(x x )p(x y : )dx, (0) where p(x x ) is he dynamic model and p(x y ) is he observaion model. We employ he affine ransformaion o model he objec moion beween consecuive frames. The observaion model, which is of fundamenal imporance o he success of he racker, is modeled using he OMDL in boh generaive and discriminaive manners. The generaive observaion model using OMDL firs solves he problem min D P c y 2 2 + λ c, () c where D P = [ ] D S, D M, D L. Based on he soluion of his problem, he general observaion model is buil as: g(y x )= exp ( α D I c I y 2 2), (2) I {S,M,L} where superscrip I denoes he corresponding decomposiion of D and c, and α is a consan as in Eqn. (2). Here we fuse he reconsrucion confidences of he OMDL model equally o build he general observaion model. Alhough more sophisicaed sraegies, e.g. weighed fusing based on he reconsrucion error, can be adoped, we find his simple sraegy works well in he experimens. The discriminaive observaion model using OMDL firs collecs some background samples (denoed as D N, see Figure 3) around he arge in he previous frame and hen selecs he discriminaive feaures by solving he problem min D T s p 2 2 + λ s, (3) s where D =[D P, D N ] and p is he label vecor for D (+ for objec samples and - for background samples). The soluion of his problem is a vecor s wih indexes of non-zero elemens indicaing seleced feaures, which can be used o form he projecion marix P by removing all-zero rows of marix S =Diag(s 0). Afer solving min c D c y 2 2+λ c, wih D =PD, y =Py, (4) he discriminaive observaion model is buil as: d(y x )=exp ( β ( D P c P y 2 2 D N c N y 2 2)), (5) where he superscrip P and N are used o decompose D and c. The final observaion model is hen represened as, 4. Experimens p(y x ) g(y x )d(y x ). (6) We implemen he proposed approach in MATLAB and run he experimens on an Inel Core 3.4 GHz PC wih 4 GB memory. The regularizaion parameer λ in he sparse coding problems is fixed as 0.0. We use global objec emplae normalized o 32 32 pixels as he raining daa o learn he muli-lifespan dicionary model. The dicionary numbers for he SLD, MLD and LLD are all se o 20 and incremenally learned wih 28, 8 and sample(s) respecively a every frame. The number of he negaive samples colleced a each frame is se o 60. The consan α and β o conrol he Gaussian kernel shape are se o 2.0 and 6.0 respecively. Noe ha hese parameers are fixed in all he experimens. We firs conduc experimens o compare he racking resuls using six differen emplae updae mehods. Then we evaluae he racking performance of our approach compared wih six sae-of-he-ar racking algorihms. A las, we give some analyses and discussions of our mehod. 669

Tracking Error (cener locaion disance) 300 250 200 50 00 50 NU FU IU IU2 IU3 IU4 DLU 0 0 0 20 30 40 50 60 70 Frame # (a) Tracking Error on car (b) Tracking Error on shaking (c) Tracking Error on faceocc2 (d) Tracking Error on animal Tracking Precision (overlap raio) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. NU FU IU IU2 IU3 IU4 DLU (e) Tracking Precision on car (f) Tracking Precision on shaking (g) Tracking Precision on faceocc2 0 0 0 20 30 40 50 60 70 Frame # (h) Tracking Precision on animal Figure 4. Tracking error and precision of seven differen mehods for emplae updae. The experimen is conduced on four challenging video sequences, including car, shaking, faceocc2 and animal. Bes viewed in original color PDF file. 4.. Templae Updae Mehod Evaluaion Templae updae is very imporan for objec racking, especially in complex scenes where he arge undergoes grea changes. We compare our dicionary learning mehod for emplae updae (DLU) wih six ypical ones, including no updaes (NU, using fixed emplaes), fully updae (FU, updaing he whole emplaes using he racking resuls in he curren frame), he inuiive updae mehod in [9] (IU) and [26] (IU2), he incremenal subspace learning mehod in [22] (IU3) and [3] (IU4). In order o concenrae on he emplae updae mehod and make a fair comparison, he emplaes in hese seven mehods are all buil from global arge appearances and he number of emplaes is se o 60. The effeciveness of he obained emplaes is judged by heir descripive power of he arge, which is evaluaed using he same measure generaed from Eqn. (2). The experimens are performed on four challenging image sequences, car, shaking, faceocc2 and animal (see Figure ) wih he same iniial recangles in he firs frame, which cover mos challenging siuaions for emplae updaing. We employ wo well-acceped merics, cener locaion disance and overlap raio, o respecively evaluae he racking error and precision of he seven emplae updae mehods. The cener locaion disance is normalized by he objec size. In Figure 4, we plo he full quaniaive experimenal resuls of he seven mehods on he four es sequences. Our dicionary learning mehod for emplae updae obains he minimal racking error and he highes racking precision on aggregae, especially on he es sequence animal, where all he oher emplae updae mehods fail o follow he deer running a high speed from he fifh frame bu only leaving our mehod o rack he arge unil he end of he sequence. I is really surprising ha he incremenal updae mehod in [22], which uses he eigenvecor of he arge samples as emplae and updaes i incremenally in he eigenspace, performs poorly on he four es sequences and even is no beer han he fixed emplae mehod and fully updae mehod. The reason behind his may be ha forcing he emplae o be orhoropic canno well adap o he challenging racking siuaions wih non-whie image noises, especially when using hese emplaes o perform sparse represenaion [23]. This may be he reason why Xu e al. [3] proposes a modified emplae updae mehod o beer deploy he subspace learning ino sparse represenaion. From Figure 4 i is observed ha fixed emplaes and fully updae mehod may perform well when he arge does no change much. Bu wih he accumulaion of racking errors and when he arge undergoes grea changes, hese wo mehods end o perform worse han oher four mehods using incremenal emplae updae. Noe ha in he car sequence, all he seven mehods fail o rack he car a abou he 300h frame when he car suddenly urns righ. This is because a generaive observaion model may no be enough o perform robus racking, which is also he reason why we deduce wo differen kinds of observaion models for he final algorihm in Secion 3.3. 4.2. Tracking Performance Evaluaion We furher evaluae he performance of our final racking approach on 0 video sequences popularly used in previous works [4, 4, 5, 28, 3, 0], including sylv, bike, david, woman, coke, jumping, and he four sequences used in he firs experimen. These en video sequences ogeher presen an even wider range of challenges o a racking algorihm (see Figure 5). The racking resuls are compared 670

Table 2. Average racking errors (in pixels). The bes and second bes resuls are respecively shown in red and blue colors. Sequence Frag IVT MIL VTD l MTT Ours sylv 0.245 0.875 0.56 0.220 0.96 0.260 0.39 bike 2.09 0.075 0.083 0.086 0.082 0.070 0.054 car.436 0.062 0.848 0.065 0.378 0.403 0.6 david 0.946 0.057 0.94 0.35 0.20.03 0.0 woman.302.590.35.26.305 2.28 0.35 animal 0.934 0.0 0.82 0.056 0.059 0.047 0.047 coke.247 0.894 0.38 0.759 0.954 0.338 0.78 shaking 0.704.005 0.222 0.279.286 0.336 0.6 jumping 0.69 0.094 0.245.2.47 0.666 0.08 faceocc2 0.37 0.0 0.252 0.7 0.49 0.097 0.3 Average 0.923 0.485 0.39 0.48 0.680 0.560 0.8 Table 3. Average racking precision. The bes and second bes resuls are respecively shown in red and blue colors. Sequence Frag IVT MIL VTD l MTT Ours sylv 0.67 0.450 0.75 0.80 0.323 0.770 0.833 bike 0.36 0.983 0.97.000 0.908.000.000 car 0.097.000 0.02 0.972 0.682 0.687 0.78 david 0.089 0.905 0.537 0.65 0.435 0.320 0.779 woman 0.256 0.204 0.209 0.309 0.25 0.98 0.440 animal 0.099 0.887 0.747 0.972 0.972.000.000 coke 0.05 0.9 0.27 0.068 0.085 0.559 0.678 shaking 0.222 0.025 0.44 0.784 0.0 0.099 0.578 jumping 0.690 0.959 0.233 0.230 0.8 0.98 0.984 faceocc2 0.767 0.772 0.537 0.743 0.49 0.929 0.826 Average 0.302 0.630 0.472 0.650 0.47 0.576 0.790 wih six sae-of-he-ar algorihms, he fragmen racker (Frag) [], he incremenal visual racking (IVT) algorihm [22], he muli-insance learning (MIL) racker [4], he visual racking decomposiion (VTD) mehod [4], he laes l -racker (l ) [5] and is muli-ask racking (MTT) version [27]. The implemenaions of hese algorihms are all provided by heir corresponding auhors wih suggesed parameer seings. To make a more fair comparison, we se he roaion parameers in he moion model of IVT, l, MTT and ours o be zero, since Frag, MIL and VTD do no roae he objec samples and he ground-ruhs of he es sequences also do no consider he roaion of he arge. All he seven algorihms are iniialized wih he same iniial bounding box according o he ground-ruh, wih oher parameers se as suggesed by heir corresponding auhors. We analyze he experimenal resuls boh quaniaively and qualiaively. Table 2 and 3 lis he average racking errors and precisions for all seven algorihms. The proposed racking approach, on he whole, performs well agains oher six algorihms, especially on he sequence sylv, woman, animal, coke, and jumping, on which some oher algorihms may fail o follow he arges bu ours can successfully rack hem unil he end of he sequence. The IVT, VTD and MTT also perform well on hese en sequences and can rack he arges in mos siuaions. Togeher wih he resuls in he firs experimen, i can be concluded ha he incremenal subspace learning mehod in IVT is more suiable o model he racking confidence direcly using he reconsrucion error over all he orhorhombic emplaes, raher han a few emplaes, which may lose oo much informaion. The emplaes learned using our dicionary learning mehod, on he conrary, can well adap o he racking daa, especially in he sparse coding based racking framework. In Figure 5, some example racking resuls are drawn o given a more vivid comparison. Due o he page limiaion, we provide more experimenal resuls in he supplemenary maerial. 4.3. Speed Analysis and Discussions Our racking algorihm runs a abou 2.5 fps in he curren MATLAB implemenaion wihou using opimizaion echnologies like parallel compuing or GPU acceleraion. Table 4 liss he speed of several popular l -rackers by running he codes provided by he corresponding auhors on our es plaform. For he firs wo l -rackers in Table 4, whose implemenaions are no publicly available, we calculae heir speed based on he repor from [5]. I can be observed ha our algorihm is faser han mos oher rackers. The main reason is ha our approach does no need o add he rivial emplaes as hose adoped in [9, 20, 5, 27] due o he design of observaion model. Therefore, alhough we use muli-lifespan dicionaries, he oal number of emplaes is grealy reduced, e.g., from 084 (60+32 32)o 60. Wha is more, currenly our learning for emplae updae is performed frame-by-frame for easy implemenaion, which may no be necessary for racking. I can be performed only every several frames, e.g., every five frames like in [28] and [3]. In ha implemenaion, he speed of he learning procedure can be furher improved. Table 4. Running speed comparison of several popular l -rackers. Algorihm [9] [20] [5] [27] [28] [3] Ours Speed 0.0fps 0.05fps fps 2fps 2.5fps 2.5fps 2.5fps 5. Conclusions and Fuure Work We sudy he emplae updae problem in he sparse coding based objec racking framework. We formulae he emplae updae problem as online dicionary learning, which make he emplae beer adap o he racking daa. We propose o learn a muli-lifespan dicionary o simulaneously ensure adapabiliy and robusness of he racker. The online learned muli-lifespan dicionary has been deployed ino he Bayesian sequenial esimaion framework using paricle filer o perform racking. Exensive experimens on challenging image sequences demonsrae he effeciveness of he proposed mehod. Currenly, he lifespan dicionary is only learned from global objec emplaes, and in our fuure work, we plan o add local image pach based dicionary o furher improve he racking performance. Acknowledgemen: This work is parly suppored by NSFC (Gran No. 60935002), he Naional 863 High-Tech R&D Program of China (Gran No. 202AA02504), he Naural Science Foundaion of Beijing (Gran No. 42003), he Guangdong Naural Science Foundaion (Gran No. S202020008), and Singapore Minisry of Educaion (Gran No. MOE200-T2--087). 67

(a) sylv, in-plane and ou-of-plane roaions (b) bike, background cluering and fas moions (c) car, background cluering and illuminaion changes (d) david, scale variaions and viewpoin changes (e) woman, occlusions and viewpoin changes (f) animal, fas moions and blurs (g) coke, complex backgrounds and roaions (h) shaking, dynamic illuminaion changes and scale variaions (i) jumping, fas moions and blurs (j) faceocc2, occlusions and roaions Frag IVT MIL VTD L MTT Ours Figure 5. Example racking resuls of he seven differen algorihms. Our racking algorihm performs well agains he six sae-ofhe-ar racking algorihms, and works robusly and adapively under a lo of difficul siuaions, like complex backgrounds ((c) and (g)), illuminaion changes ((c) and (h)), roaions ((a) and (g)), fas moions ((b) and (c)), scale variaions ((d) and (h)), viewpoin changes ((d) and (e)), and occlusions ((e) and (j)). Bes viewed in original color PDF file. References [] A. Adam, E. Rivlin, and I. Shimshoni. Robus fragmens-based racking using he inegral hisogram. In CVPR, 2006. [2] M. Aharon, M. Elad, and A. Brucksein. K-SVD: An algorihm for designing overcomplee dicionaries for sparse represenaion. TSP, 54():43 22, 2006. [3] S. Avidan. Ensemble racking. TPAMI, 29(2):26 7, 2007. [4] B. Babenko, M. Yang, and S. Belongie. Robus objec racking wih online muliple insance learning. TPAMI, 33(8):69 32, 20. [5] C. Bao, Y. Wu, H. Ling, and H. Ji. Real ime robus l racker using acceleraed proximal gradien approach. In CVPR, 202. [6] M. Black and A. Jepson. Eigenracking: Robus maching and racking of ariculaed objecs using a view-based represenaion. IJCV, 25():63 84, 998. [7] D. Comaniciu and P. Meer. Kernel-based objec racking. TPAMI, 25(5):564 77, 2003. [8] K. Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski. Dicionary learning algorihms for sparse represenaion. Neural Compu., 5(2):349 96, 2003. [9] B. Efron, T. Hasie, I. Johnsone, and R. Tibshirani. Leas angle regression. Ann. Sa., 32(2):407 5, 2004. [0] M. Godec, P. Roh, and H. Bischof. Hough-based racking of nonrigid objecs. CVIU, 7(0):245 256, 203. [] H. Grabner, M. Grabner, and H. Bischof. Real-ime racking via online boosing. In BMVC, 2006. [2] M. Isard and A. Blake. Condensaion - condiional densiy propagaion for visual racking. IJCV, 29():5 28, 998. [3] X. Jia, H. Lu, and M. Yang. Visual racking via adapive srucural local sparse appearance model. In CVPR, 202. [4] J. Kwon and K. Lee. Visual racking decomposiion. In CVPR, 200. [5] Y. Li, H. Ai, T. Yamashia, S. Lao, and M. Kawade. Tracking in low frame rae video: A cascade paricle filer wih discriminaive observers of differen life spans. TPAMI, 30(0):728 40, 2008. [6] L. Ma, C. Wang, B. Xiao, and W. Zhou. Sparse represenaion for face recogniion based on discriminaive low-rank dicionary learning. In CVPR, 202. [7] J. Mairal, F. Bach, and J. Ponce. Task-driven dicionary learning. TPAMI, 34(4):79 804, 202. [8] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for marix facorizaion and sparse coding. JMLR, :9 60, 200. [9] X. Mei and H. Ling. Robus visual racking using l minimizaion. In ICCV, 2009. [20] X. Mei, H. Ling, Y. Wu, E. Blasch, and L. Bai. Minimum error bounded efficien l racker wih occlusion deecion. In CVPR, 20. [2] W. Press, S. Teukolsky, W. Veerling, and B. Flannery. Numerical Recipes: The Ar of Scienific Compuing (3rd Ediion). Cambridge Universiy Press, New York, 2007. [22] D. Ross, J. Lim, R. Lin, and M. Yang. Incremenal learning for robus visual racking. IJCV, 77(-3):25 4, 2008. [23] I. Tosic and P. Frossard. Dicionary learning. IEEE Signal Process. Mag., 28(2):27 38, 20. [24] M. Yaghoobi, T. Blumensah, and M. Davies. Dicionary learning for sparse approximaions wih he majorizaion mehod. TSP, 57(6):278 9, 2009. [25] A. Yilmaz, O. Javed, and M. Shah. Objec racking: A survey. ACM Compu. Surv., 38(4): 45, 2006. Aricle 3. [26] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja. Low-rank sparse learning for robus visual racking. In ECCV, 202. [27] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja. Robus visual racking via muli-ask sparse learning. In CVPR, 202. [28] W. Zhong, H. Lu, and M. Yang. Robus objec racking via sparsiybased collaboraive model. In CVPR, 202. 672