A Fast Bandit Algorithm for Recommendations to Users with Heterogeneous Tastes

A Fas Bandi Algorihm for Recommendaions o Users wih Heerogeneous Tases Pushmee Kohli and Mahyar Salek Microsof Research Cambridge, Unied Kingdom {pkohli, mahyar}@microsof.com Greg Soddard Norhwesern Universiy Evanson, Illinois, USA gregs@u.norhwesern.edu Absrac We sudy recommendaion in scenarios where here s no prior informaion abou he qualiy of conen in he sysem. We presen an online algorihm ha coninually opimizes recommendaion relevance based on behavior of pas users. Our mehod rades weaker heoreical guaranees in asympoic performance han he sae-ofhe-ar for sronger heoreical guaranees in he online seing. We es our algorihm on real-world daa colleced from previous recommender sysems and show ha our algorihm learns faser han exising mehods and performs equally well in he long-run. 1 Inroducion The marke for online conen consumpion and he rae a which conen is produced has experienced immense growh over he pas few years. New conen is generaed on a daily or even hourly basis, creaing an incredibly fas urn-over ime for relevan conen. While radiional search and recommendaion engines have he abiliy o discover qualiy conen in an offline manner, services such as news aggregaors need o consanly adjus heir recommendaions o caer o curren ho opics. For example, aricles abou he U.S. presidenial inauguraion may be quie popular on January 21 s, he day of he inauguraion, bu hey re likely o fall ou of favor on he morning of he 22 nd. In he face of such rapid changes in relevance, online algorihms which coninually opimize recommendaions based on user usage daa provide an aracive soluion. We propose a simple online recommendaion algorihm which learns quickly from user click daa o minimize abandonmen, he even ha a user does no click on any aricles in he recommended se (also known as %no in he informaion rerieval communiy). Our algorihm operaes wih minimal assumpions and no knowledge of feaures of users or aricles, and hus is well-suied o address changing environmens induced by frequen urn-over in he se of poenial aricles and shifs in user preferences. We focus on conen such as news aricles, jokes, or movies, where users have varying ases bu here s no noion of a single correc recommendaion. Copyrigh c 2013, Associaion for he Advancemen of Arificial Inelligence (www.aaai.org). All righs reserved. Recommending relevan conen is a key challenge for search engines and recommendaion sysems and has been exensively sudied in he informaion rerieval communiy. The early guiding principle in he IR lieraure was he probabiliy ranking principle (PRP) (Roberson 1977), saing ha aricles should be ranked in decreasing order of relevance probabiliy. (Chen and Karger 2006) noed ha opimizing wih PRP in mind may yield sub-opimal oucomes, paricularly when he objecive is minimizing abandonmen. In recen years, he concep of diversiy in recommended ses of conen has emerged as a guiding principle which beer serves in addressing goals such as abandonmen minimizaion. The inuiive goal behind a diverse se of conen is o use each aricle in he se o saisfy a differen ype of user. This approach is paricularly applicable o he canonical problem of handling a variey of user inens; when a user searches for a erm such as jaguar heir inended meaning could be he car, he animal, he American fooball eam, or a number of differen meanings. This paper compares he PRP and diversiy principle from an online algorihm perspecive. We compare our online algorihm, which is implicily based on he PRP, wih he Ranked Bandi Algorihm (RBA) of (Radlinski, Kleinberg, and Joachims 2008), which is based on he diversiy principle. While he diversiy principle yields superior offline performance, our approach has sronger heoreical guaranees in he online case. Our empirical work focuses on a fundamenally differen sor of user preference han he previous diversiy work. Insead of inen, we caer o a heerogeneiy of users ases, i.e. does he user find his joke funny or will he user like his news aricle. Surprisingly, we find ha explicily incorporaing diversiy in his seing doesn yield a large gain; he offline PRP-based soluion gives nearly he same performance as he offline diversiy-based soluion. A he hear of our mehod is he use of a sochasic muliarmed bandi algorihm o conrol he rade-off beween exploraion and exploiaion of aricles. A muli-armed bandi problem is an absrac game where a player is in a room wih many differen slo machines (slo machines are someimes called one-armed bandis), wih no prior knowledge of he payoffs of any machines. His goal is o maximize his oal reward from he slo machines and in doing so, he mus explore machines o es which machine has he highes average payoff bu also exploi hose he knows o

have high rewards. Similar o (Radlinski, Kleinberg, and Joachims 2008), he primary conribuion of our algorihm is he mehod by which we combine insances of several MAB algorihms o efficienly approximae his combinaorial se recommendaion problem. 1.1 Our Conribuions We presen an online algorihm for he minimizaion of abandonmen. Our mehod uses several insances of a muliarmed bandi algorihm working (almos) independenly o recommend a se of aricles. Alhough he independence beween bandi insances carries all he drawbacks of he PRP, we use a sochasic opimizaion concep known as he correlaion gap (Agrawal e al. 2010) o prove ha our algorihm has near-opimal performance in he online seing. Furhermore, he independence beween bandi insances allows for a faser learning rae han online algorihms based on he diversiy principle. Our second conribuion is an empirical sudy of bandi-based recommendaion algorihms on real-world daases colleced from previous recommendaion algorihm research. We find ha while in heory he diversiy-based soluions yield superior offline soluions, in pracice here are only small differences beween he offline diversiy-based soluion and he offline PRP-based soluion. We also empirically verify ha he learning rae of our mehod is faser han ha of exising mehods. 2 Previous Work Previous work in informaion rerieval and machine learning has addressed recommendaion o heerogenous populaions via he goal of maximizing diversiy in search resuls bu he lieraure varies widely in modeling assumpions. In some work, diversiy refers o increasing he se of opics ha a recommended se of aricles or search resuls may cover (Agrawal e al. 2009) and (Panigrahi e al. 2012). Oher works assume users inrinsically value diversiy; (Raman, Shivaswamy, and Joachims 2012) and (Yue and Guesrin 2011) boh assume a rich feaure model and use online learning echniques o learn user uiliy funcions. (Li e al. 2010) give an online approach for news recommendaion using a user s profile as a feaure vecor. (Chen and Karger 2006) prove in a general sense ha he sandard submodular greedy algorihm is he opimal way o incorporae diversiy ino search resul rankings. By conras, our work carries lile assumpions. In his sense, our work is closer o he lieraure on online sochasic submodular maximizaion, paricularly in he bandi seing. (Calinescu e al. 2011) prove ha a coninuous version of he sandard submodular greedy algorihm yields an opimal approximaions for all maroid consrains and (Sreeer, Golovin, and Krause 2009) give a similar mehod (hough less general) which can be exended o he online seing. The work mos closely relaed o ours is (Radlinski, Kleinberg, and Joachims 2008) and (Slivkins, Radlinski, and Gollapudi 2010), alhough he laer work makes srong use of a similariy measure beween documens whereas we assume no such consruc. Their ranked bandi algorihm serves as our baseline in his paper and we discuss he relaionships beween our mehods in laer secions. 3 Problem Formalizaion We consider he problem of minimizing abandonmen for an aricle recommendaion sysem. A he beginning of he day, n aricles are submied o he sysem. When a user visis our sie, hey re presened wih a se of k aricles; if he user finds any of he aricles relevan, he clicks on i and we receive a payoff of 1. If he user finds no aricles relevan, we receive a payoff of 0. We receive no addiional payoffs if he user clicks on more han one aricle. Each user j can be represened by a {0, 1} n -vecor X j, where a X j i = 1 indicaes ha he user j finds aricle i relevan. These relevance vecors X j are disribued according o some unknown disribuion D. These relevance vecors can be hough o represen he ype of a user. This ype srucure allows for a large degree of correlaion beween aricle relevances. A each ime period, a random user arrives, corresponding o choosing a vecor X i.i.d. from D, and we presen a se of k aricles S. Le F (S, X ) denoe he payoff for showing se S o a user wih relevance vecor X. We ll refer F as he se relevance funcion and i has he following form { 1 if X F (S, X ) = i = 1 for some i S (1) 0 oherwise The user s relevance vecor X is no observed before he algorihm he se is chosen. Thus he value of displaying a se S is he expeced value E[F (S, X)] where he expecaion is aken over he realizaion of he relevance vecor X from he disribuion D. When i is clear, we will wrie E[F (S)] as shorhand for E[F (S, X)]. In words, E[F (S)] is he fracion of users who will be saisfied by a leas one aricle in S. The problem of minimizing abandonmen is equivalen o he problem of maximizing E[F (S)] subjec o S k. For he remainder of his paper, we ll focus on maximizing expeced se relevance E[F (S)]. Before urning o he online version of his problem, we consider opimizaion in he offline seing. In he offline seing, an algorihm would have access o he disribuion bu even wih such assumpions he problem is NP-hard 1 Despie his inracabiliy, we can ake advanage of he srucure of F (S), namely ha i is submodular, and use he greedy algorihm of (Nemhauser, Wolsey, and Fisher 1978). This yields a (1 1 e ) approximaion, which is he bes possible approximaion under complexiy heoreic assumpions. (Chen and Karger 2006) argue his greedy approach yields an opimally diverse se of aricles. A se funcion G is said o be submodular if, for all elemens a and ses S, T such ha S T, G(S a) G(S) G(T a) G(T ). The se relevance funcion F (S, X), as defined in equaion 1, is submodular and his propery forms he heoreical basis for he online approaches given in he nex secion. 1 This can be shown by a sandard reducion from he max coverage problem. See (Radlinski, Kleinberg, and Joachims 2008) for deails.

4 The Online Problem We now urn o he online version of his problem, which presens a classic explore-exploi radeoff: we mus balance he need o learn he average relevance of aricles wih no feedback agains he need o exploi he good aricles ha we ve already discovered. We solve his problem using heoreical resuls from he muli-armed bandi (MAB) lieraure, a class of algorihms which solve exploraion-exploiaions problems. Bandi problems can be disinguished by he assumpions made on he rewards. In he sochasic bandi problem, rewards for each opion are drawn from a saionary disribuion while in he adversarial seing, payoffs for each opion are deermined by an adversary who has knowledge of pas play, hisory of rewards, and he sraegy ha he player is using. The objecive of an online algorihm is he minimizaion of regre, where he regre of an algorihm is defined as he expeced difference beween he accumulaed rewards of he single bes opion and he rewards accumulaed by ha algorihm. In our conex, his is he difference beween he fracion of users saisfied by he opimal se of aricles and he fracion of users saisfied by he recommendaion algorihm. However, as we noed in he previous secion, maximizaion of E[F (S)] is inracable, so we follow he approach of (Sreeer, Golovin, and Krause 2009) and (Radlinski, Kleinberg, and Joachims 2008) and use (1 1 e )OP T as he offline benchmark. The regre afer ime is defined as R(T ) = (1 1 T e ) E[F (S )] =0 T E[F (S )] =0 There are known bandi algorihms which achieve provably-minimal regre (up o consan facors), bu direc applicaion of hese bandi algorihms requires exploring all possible opions a leas once. In our seing, each subse of aricles is a poenial opion and hence here are exponenially many opions, making sandard MAB algorihms impracical. In he nex secion we presen wo approaches, one from previous work and our algorihm, for combining several insances of a bandi algorihm o yield a low-regre and compuaionally efficien soluion o his recommendaion problem. 4.1 Ranked Bandi Approach The work of (Radlinski, Kleinberg, and Joachims 2008) and (Sreeer, Golovin, and Krause 2009) inroduced he ranked bandi algorihm o solve he problem of minimizing abandonmen. The pseudocode is given in algorihm 1. The idea behind he ranked bandi algorihm is o use k insances of a MAB algorihm o learn he greedy-opimal soluion (which is also he diversiy-opimal soluion). Specifically, k insances of a bandi algorihm are creaed, where bandi i is responsible for selecing he aricle o be displayed in slo i. The algorihm is designed such ha he bandi in slo i aemps o maximize he marginal gain of he aricle in slo i. In he conex of minimizing abandonmen, bandi i aemps o maximize he click-hrough-rae of he aricle in slo i given ha he user has no clicked on any earlier aricles. Algorihm 1 Ranked Bandi 1: MAB i : Bandi algorihm for slo i 2: for = 1...T do 3: s i selecaricle(mab i, N) 4: S i s i 5: Display S o user, receive feedback vecor X 6: Feedback: { 1 if aricle si was he firs click z i = 0 oherwise 7: updae(mab i, z i ) 8: end for While RBA works wih any bandi algorihm, he regre of RBA depends on he choice of bandi algorihm. (Radlinski, Kleinberg, and Joachims 2008) use an adversarial bandi algorihm known as EXP3 in heir work and show ha RBA inheris he regre bounds guaraneed by EXP3. However he adversarial assumpion is overly pessimisic in his problem and ideally we could make use of he sochasic naure of user behavior. Sochasic bandi algorihms such as UCB1 have beer heoreical and pracical performance bu he dependence beween slos in RBA violaes he necessary independence assumpions for he sochasic seing. In heir work, (Radlinski, Kleinberg, and Joachims 2008) show RBA o have regre on he order of O(k T n lg(n)). Our approach, discussed in he nex secion, is able o leverage he sochasic naure of he problem wihou complicaion and hus achieves a provable regre of O(kn lg(t )). In addiion o he lack of heoreical guaranees, he learning rae of RBA can be quie slow because of wrong feedback. The correc value of an aricle in slo i + 1 is he marginal value of ha aricle given ha slos 1 o i are displaying he correc aricles, ha is he firs i aricles in he greedy soluion. In any ime period where hose aricles aren displayed, he marginal value of any aricle in slo i+1 will no necessarily be correc. Alhough early slos should display he correc aricles mos of he ime, laer slos can begin learning correcly unil he earlier slos converge. This effecively induces sequenial learning across slos and back of he envelope calculaions sugges ha correc learning will only begin in slo k + 1 afer Ω(n k ), ime seps have pas. 4.2 Independen Bandi Approach In his secion we describe our mehod which we call he independen bandi algorihm (IBA) which is implicily based on he probabiliy ranking principle. Raher han learning he marginal values as in he ranked bandi algorihm, he independen bandi algorihm opimizes he click-hrough-rae of each slo independenly of he oher slos. Using ools from sochasic opimizaion heory, we prove ha he independen bandi algorihm has near-opimal regre and our simulaions demonsrae ha IBA converges o is offline-opimal soluion much quicker han RBA. The pseudocode for he independen bandi algorihm is given in algorihm 2. Line 5 ensures ha he bandi algo-

rihms don selec he same aricles by emporarily removing aricles already displayed from he se of poenial aricles for bandis in laer slos. The main difference beween he independen and he ranked bandi algorihm is he feedback; IBA gives a reward of 1 o any aricle ha was clicked on while RBA only gives a reward of 1 o he firs aricle ha was clicked on. This independence beween bandi insances in IBA allows for learning o happen in parallel, enabling a faser rae of learning for IBA. To analyze he regre of IBA, we mus firs derive an approximaion guaranee for wha he offline version of he independen algorihm would compue. The independenopimal soluion consiss of he k aricles wih he highes click-hrough-raes. If aricle relevances were all independen hen he independen-opimal soluion is he opimal soluion, however he independen-opimal soluion will be sub-opimal when here are correlaions beween aricle relevances. We use he correlaion gap resul of (Agrawal e al. 2010) o show ha he independen-opimal soluion yields a (1 1 e ) approximaion o he opimal soluion for any disribuion over user relevance vecors. The correlaion gap is a concep in sochasic opimizaion which quanifies he loss incurred by opimizing under he assumpion ha all random variables are independen. Formally le G(S, X) be some funcion where S is he decision variable and X is a vecor of {0, 1} random variables, where X is drawn from some arbirary disribuion D. Le D I be he produc disribuion if each X i were an independen bernoulli variable wih probabiliy equal o is marginal probabiliy under D. When G is a nondecreasing, submodular funcion he correlaion gap is quie small. Theorem (Agrawal e al. 2010) 1. Le G be a nondecreasing, submodular funcion. Le S and SI be he opimizers for E D [G(S, X)] and E D I [G(S, X)] respecively. Then E D [G(SI, X)] (1 1 e )E D[G(S, X)]. Now we consider he independen bandi algorihm. The key propery of IBA is ha individual bandi insances do no affec each oher and his allows us o prove ha IBA inheris he low regre of he underlying sochasic bandi algorihms, yielding beer regre bounds han RBA. For he purposes of he nex heorem, we use he UCB1 algorihm (deails are given in secion 5), which has regre O(n lg(t )). Theorem 1. When UCB1 is used as he bandi algorihm for IBA, he accumulaed rewards saisfy T E[ F (S, X)] (1 1 )OP T O(kn lg(t )) e Proof. The high level is o firs show ha IBA has low regre when compared wih he independen-opimal se. We hen apply he correlaion gap of (Agrawal e al. 2010) o conclude he regre is close o (1 1 e )OP T. For a given documen displayed in slo i, le p i denoe he marginal probabiliy of relevance, ha is p i = E X [X i ]. Assume for now ha all X i are independen. Using his independence assumpion, for a given se S we can wrie he Algorihm 2 Independen Bandi 1: MAB i : Bandi algorihm for slo i 2: for = 1...T do 3: S0 = 4: for i = 1...k do 5: Si selecaricle(mab i, N \ Si 1 ) 6: end for 7: Display S o user, receive feedback vecor X 8: Feedback: { 1 if aricle si was clicked on z i = 0 oherwise 9: updae(mab i, z i ) 10: end for expeced valued of F (S, X) as follows k i 1 E[F (S )] = (1 p j )p i (2) i=1 j=1 (noe, his equaion gives he same value for any permuaion of S ). Le SI denoe he se which maximizes he above funcion under he assumpion ha all X i are independen. Trivially, his se consiss of he k aricles wih he larges p i. Label hese elemens p i for i = 1...k. A a given ime le S denoe he se played and le Si represen he ih elemen of his se. Define δ i = p i p i, ha is he difference beween he relevance probabiliy of he bes aricle and he relevance probabiliy of he aricle acually played a ime. k i 1 E[F (S )] = (1 (p j δ j ))(p i δ i ) i=1 j=1 k i 1 (1 p j )(p i ) δ i i=1 j=1 = E[F (S I)] i Now aking he sum of he f(s, X) over ime yields E[ F (S, X)] F (SI) δi i The erm δ i is he regre incurred in slo i. (Auer, Cesa-Bianchi, and Fischer 2002) proves ha he regre of UCB1 is bounded by O(n lg(t )), so δ i O(n lg(t )) for each slo. In he above analysis, we assume ha he probabiliy of an aricle being relevan was independen of each oher X i, which is usually a fauly assumpion. However, he work of (Agrawal e al. 2010) shows ha opimizing under he independence assumpion yields a provable approximaion. Le S denoe he se which maximizes E[f(S, X)]. Then he correlaion gap implies E[f(SI )] (1 1 e )E[f(S )]. Combining his wih he above regre bound yields he resul E[ F (S, X)] (1 1 )OP T O(kn lg(t )) e δ i

Figure 1: Movie-Lens-100 daase wih relevance hreshold θ = 2, he low hreshold. The Ranked-ɛGreedy mehod sars performing beer afer = 10000 bu fails o achieve he heoreical opimum performance wihin 100000 ime seps. The Independen-ɛGreedy algorihm achieves is offline opimum afer 50000 ime seps. I is worh noing ha he independen-opimal soluion is (weakly) worse han he greedy-opimal soluion, so RBA will asympoically ouperform IBA. However, he previous heorem shows ha IBA has he same wors-case guaranee along wih a beer regre bound ha holds uniformly hroughou ime. In he nex secion, we simulae boh algorihms using real-world daases and show ha he asympoic performances of he wo mehods are essenially equal bu IBA performs beer in he shor erm. 5 Experimenal Resuls In his secion, we give he resuls of experimens we used o es he empirical difference in performance beween he ranked bandi algorihm and he independen bandi algorihm. Daases. We used wo publicly available daases as our inpu for user preferences. Our firs daase is from he Jeser projec (Goldberg e al. 2001) and is a collecion of user raings on jokes, ranging from -10.0 (very no funny) o 10 (very funny). Our second daase comes from he MovieLens projec (movieslens.umn.edu) and consiss of user raings assigned o movies, where each raing is from 1 (bad) o 5 (very good). Each daase consiss of a collecion of < userid, aricleid, raing >-uples denoing he raing ha he user gave o his aricle (eiher a joke or a movie). Wih he Jeser daase, we used wo separae daases. Jeser-Small consis of 25000 users raings on 10 aricles where each user had raed mos aricles in he se. Jeser-large consiss of 25000 users raings on 100 aricles bu here many unraed aricles for each user. In he case where a user didn rae an aricle, we assign ha aricle he lowes score. Movie-Lens-100 consiss of raings by 943 Figure 2: Movie-Lens-100 daase wih relevance hreshold θ = 4, he high hreshold. The Independen-ɛGreedy mehod performs he bes ou of all four mehods. 2 lg() i users on a sub-sampled se of 100 aricles from he Movie- Lens daase. For all daases, we conver real-valued raings o binary relevan-or-no scores by using a hreshold rule; if he raing assigned by a user o an aricle exceeds a hreshold θ, hen ha aricle is deemed relevan o ha user. For each daase, we esed a high and a low hreshold for relevance. 2 The daa we use is of a fundamenally differen naure han he generaed by (Radlinski, Kleinberg, and Joachims 2008). In ha work, hey model user inen, i.e. is a user ha searches for he erm jaguar alking abou he car, he animal, or some oher meaning? In our work, we care abou user ase, i.e. which joke or movie will a user like? In he case of inen, here s generally a correc answer and a single aricle rarely saisfies muliple ypes of users. For he case of ase, here is rarely a single correc answer and a single aricle may saisfy many differen ypes of users. Baselines.In our experimens, we used wo well-known sochasic muli-armed bandi algorihms o es he Ranked Bandi Algorihm and he Independen Bandi Algorihm. Boh algorihms, UCB1 and ɛ-greedy, are examined in deail in (Auer, Cesa-Bianchi, and Fischer 2002) bu we briefly review hem here. In each ime sep, UCB1 plays he opion which maximizes x i + where x i denoes he curren average reward of opion i and i denoes he number of imes ha opion i has been played so far. The second erm in his equaion naurally induces exploraion since his erm grows for opions ha have no been played in a while. The second MAB algorihm is he ɛ-greedy algorihm. A each ime, wih probabiliy ɛ a uniformly random arm is played, and wih probabiliy 1 ɛ he opion wih he curren highes average reward is played. Noe ha his algorihm requires he ɛ parameer o be uned; for hese experimens, we se ɛ =.05, which proved o give he bes average performance during iniial ess. 2 We only show he resuls for a few differen daases due o space consrains. These daases are represenaive of he qualiaive resuls from he enire se of experimens.

Figure 3: Jeser-Large daase wih relevance hreshold θ = 7, he high hreshold. Ranked-ɛGreedy and IndependenɛGreedy perform similarly;independen-ɛgreedy performs beer unil =20000 bu boh remain very close, and well below he offline greedy opimum, for all 100000 ime seps. Our experimen consiss of he following seps: a each ime, we draw a random user from he daase and he algorihm recommends a se of k = 5 aricles o display. We assume ha he user clicks on any relevan aricles displayed. If he user clicks on any aricles, we ge a payoff of 1 and a payoff of 0 oherwise. Each experimens consiss of T = 100000 ime seps and we average our resuls over 200 repeiions of each experimen. Performance of each algorihm was measured by he percen of ses ha conained a leas one relevan aricle o he user. We show daapoins a 1000 ime sep incremens and each daapoin shown is he average se relevance over he las 1000 ime seps. Key Resuls. The resuls of our experimens are displayed in figures 1, 2, 3, and 4. Each plo shows he performance of he online algorihms as well as he offline benchmarks. The performance of Independen-ɛGreedy and Independen- UCB were roughly he same in all cases, so we omi he resuls for Independen-UCB for clariy. Our mos surprising finding is he closeness of he greedy-opimal and he independen-opimal soluions. The larges difference beween he wo soluions, shown in figure 1, is 4%; if we displayed he greedy-opimal se of aricles, approximaely 92% of users will find a leas one relevan aricle while if we displayed he independen-opimal se, hen 88% of users will find a leas one relevan aricles. This finding suggess ha in seings where a recommendaion algorihm is caering o he ases (as opposed o inens ), explici consideraion of diversiy may no be necessary since he independenopimal soluion yields similar resuls o he greedy-opimal soluion. Our second finding, which goes hand in hand wih he previous one, is he favorable performance of he Independen Bandi Algorihm versus he performance of he Ranked Bandi Algorihm. In half of our experimens, eiher Independen-ɛGreedy or Independen-UCB perform sricly beer han Ranked-ɛGreedy. In he experimens shown in Figure 4: Jeser-Small daase wih relevance hreshold θ = 3.5, he low hreshold. In his case, he offline greedy and offline independen soluion were he exac same se. figures 1 and 3, Ranked-ɛGreedy performs beer han he independen soluions bu only begins o perform beer afer 10000 or 20000 ime seps. The faser learning raes of IBA compared o RBA demonsraes a key feaure of IBA; he independence beween bandi insances in differen slos allows learning o happen in parallel as opposed o he de faco sequenial learning in RBA. This parallel learning allows for a quicker convergence o he independen-opimal soluion. In all cases, he Ranked-ɛGreedy algorihm doesn converge o he value of he greedy-opimal soluion wihin 100000 ime seps. Lasly, our experimens demonsrae a sark difference beween he performance of Ranked-ɛGreedy and Ranked- UCB. As we noed a he end of secion 4.1, learning for laer slos in RBA is hindered by exploraion in early slos. This effec is especially pronounced in he UCB1 algorihm when here are muliple aricles ha have high average rewards. The relaively low exploraion rae of he ɛ-greedy algorihm allows for faser convergence in earlier slos and hence a faser learning rae for laer slos. In RBA, low exploraion raises he risk of playing a sub-opimal aricle in an earlier slo bu he gain from he faser learning rae ouweighs ha poenial loss. 6 Conclusion We ve presened a simple online algorihm for he problem of abandonmen minimizaion in recommendaion sysems which has near-opimal performance in he online problem. We have demonsraed, heoreically and empirically, ha our approach rades off a small loss in offline performance for a faser learning rae and sronger performance in he online seing. In he fuure, we would like o invesigae he exension of hese MAB echniques o general submodular uiliy funcions. Addiionally, we would like o invesigae how o run algorihms such as IBA or RBA when i is only possible o observe user feedback on he se of aricles bu no on he individual aricles wihin he se.

References Agrawal, R.; Gollapudi, S.; Halverson, A.; and Ieong, S. 2009. Diversifying search resuls. In Proceedings of he Second ACM Inernaional Conference on Web Search and Daa Mining, 5 14. ACM. Agrawal, S.; Ding, Y.; Saberi, A.; and Ye, Y. 2010. Correlaion robus sochasic opimizaion. In Proceedings of he Tweny-Firs Annual ACM-SIAM Symposium on Discree Algorihms, 1087 1096. Sociey for Indusrial and Applied Mahemaics. Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finieime analysis of he muliarmed bandi problem. Machine learning 47(2):235 256. Calinescu, G.; Chekuri, C.; Pál, M.; and Vondrák, J. 2011. Maximizing a monoone submodular funcion subjec o a maroid consrain. SIAM Journal on Compuing 40(6):1740 1766. Chen, H., and Karger, D. 2006. Less is more: probabilisic models for rerieving fewer relevan documens. In Proceedings of he 29h annual inernaional ACM SIGIR conference on Research and developmen in informaion rerieval, 429 436. ACM. Goldberg, K.; Roeder, T.; Gupa, D.; and Perkins, C. 2001. Eigenase: A consan ime collaboraive filering algorihm. Informaion Rerieval 4(2):133 151. Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A conexual-bandi approach o personalized news aricle recommendaion. In Proceedings of he 19h inernaional conference on World wide web, 661 670. ACM. Mahajan, D. K.; Rasogi, R.; Tiwari, C.; and Mira, A. 2012. Logucb: an explore-exploi algorihm for commens recommendaion. In Proceedings of he 21s ACM inernaional conference on Informaion and knowledge managemen, 6 15. ACM. Sreeer, M.; Golovin, D.; and Krause, A. 2009. Online learning of assignmens. In Neural Informaion Processing Sysems (NIPS). Yue, Y., and Guesrin, C. 2011. Linear submodular bandis and heir applicaion o diversified rerieval. In Neural Informaion Processing Sysems (NIPS). Nemhauser, G.; Wolsey, L.; and Fisher, M. 1978. An analysis of approximaions for maximizing submodular se funcionsi. Mahemaical Programming 14(1):265 294. Panigrahi, D.; Das Sarma, A.; Aggarwal, G.; and Tomkins, A. 2012. Online selecion of diverse resuls. In Proceedings of he fifh ACM inernaional conference on Web search and daa mining, 263 272. ACM. Radlinski, F.; Kleinberg, R.; and Joachims, T. 2008. Learning diverse rankings wih muli-armed bandis. In Proceedings of he 25h inernaional conference on Machine learning, 784 791. ACM. Raman, K.; Shivaswamy, P.; and Joachims, T. 2012. Online learning o diversify from implici feedback. In Proceedings of he 18h ACM SIGKDD inernaional conference on Knowledge discovery and daa mining, KDD 12, 705 713. New York, NY, USA: ACM. Roberson, S. E. 1977. The probabiliy ranking principle in ir. Journal of documenaion 33(4):294 304. Slivkins, A.; Radlinski, F.; and Gollapudi, S. 2010. Learning opimally diverse rankings over large documen collecions. arxiv preprin arxiv:1005.5197.