Opponent Modeling in Deep Reinforcement Learning

Opponen Modeling in Deep Reinforcemen Learning He He Universiy of Maryland, College Park, MD 20740 USA Jordan Boyd-Graber Universiy of Colorado, Boulder, CO 80309 USA Kevin Kwok Massachuses Insiue of Technology, Cambridge, MA 02139 USA Hal Daumé III Universiy of Maryland, College Park, MD 20740 USA HHE@UMIACS.UMD.EDU JORDAN.BOYD.GRABER@COLORADO.EDU KKWOK@MIT.EDU HAL@UMIACS.UMD.EDU Absrac Opponen modeling is necessary in muli-agen seings where secondary agens wih compeing goals also adap heir sraegies, ye i remains challenging because sraegies inerac wih each oher and change. Mos previous work focuses on developing probabilisic models or parameerized sraegies for specific applicaions. Inspired by he recen success of deep reinforcemen learning, we presen neural-based models ha joinly learn a policy and he behavior of opponens. Insead of explicily predicing he opponen s acion, we encode observaion of he opponens ino a deep Q-Nework (DQN); however, we reain explici modeling (if desired) using muliasking. By using a Mixure-of-Expers archiecure, our model auomaically discovers differen sraegy paerns of opponens wihou exra supervision. We evaluae our models on a simulaed soccer game and a popular rivia game, showing superior performance over DQN and is varians. 1. Inroducion An inelligen agen working in sraegic seings (e.g., collaboraive or compeiive asks) mus predic he acion of oher agens and reason abou heir inenions. This is imporan because all acive agens affec he sae of he world. For example, a muli-player game AI can exploi subopimal players if i can predic heir bad moves; a negoiaing Proceedings of he 33 rd Inernaional Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyrigh 2016 by he auhor(s). agen can reach an agreemen faser if i knows he oher pary s boom line; a self-driving car mus avoid accidens by predicing where cars and pedesrians are going. Two criical quesions in opponen modeling are wha variable(s) o model and how o use he prediced informaion. However, he answers depend much on he specific applicaion, and mos previous work (Billings e al., 1998a; Souhey e al., 2005; Ganzfried & Sandholm, 2011) focuses exclusively on poker games which require subsanial domain knowledge. We aim o build a general opponen modeling framework in he reinforcemen learning seing, which enables he agen o exploi idiosyncrasies of various opponens. Firs, o accoun for changing behavior, we model uncerainy in he opponen s sraegy insead of classifying i ino a se of sereoypes. Second, domain knowledge is ofen required when predicion of he opponens are separaed from learning he dynamics of he world. Therefore, we joinly learn a policy and model he opponen probabilisically. We develop a new model, DRON (Deep Reinforcemen Opponen Nework), based on he recen deep Q-Nework of Mnih e al. (2015, DQN) in Secion 3. DRON has a policy learning module ha predics Q-values and an opponen learning module ha infers opponen sraegy. 1 Insead of explicily predicing opponen properies, DRON learns hidden represenaion of he opponens based on pas observaions and uses i (in addiion o he sae informaion) o compue an adapive response. More specifically, we propose wo archiecures, one using simple concaenaion o combine he wo modules and one based on he Mixureof-Expers nework. While we model opponens implicily, addiional supervision (e.g., he acion or sraegy aken) can 1 Code and daa: hps://gihub.com/hhexiy/ opponen

Opponen Modeling in Deep Reinforcemen Learning be added hrough muliasking. Compared o previous models ha are specialized in paricular applicaions, DRON is designed wih a general purpose and does no require knowledge of possible (parameerized) game sraegies. A second conribuion is DQN agens ha learn in muliagen seings. Deep reinforcemen learning has shown compeiive performance in various asks: arcade games (Mnih e al., 2015), objec recogniion (Mnih e al., 2014), and robo navigaion (Zhang e al., 2015). However, i has been mosly applied o he single-agen decision-heoreic seings wih saionary environmens. One excepion is Tampuu e al. (2015), where wo agens conrolled by independen DQNs inerac under collaboraive and compeiive rewards. While heir focus is he collecive behavior of a muli-agen sysem wih known conrollers, we sudy from he view poin of a single agen ha mus learn a reacive policy in a sochasic environmen filled wih unknown opponens. We evaluae our mehod on wo asks in Secion 4: a simulaed wo-player soccer game in a grid world, and a real quesion-answering game, quiz bowl, agains users playing online. Boh games have opponens wih a mixure of sraegies ha require differen couner-sraegies. Our model consisenly achieves beer resuls han he DQN baseline. In addiion, we show our mehod is more robus o non-saionary sraegies; i successfully idenifies he opponen s sraegy and responds correspondingly. 2. Deep Q-Learning Reinforcemen learning is commonly used for solving Markov-decision processes (MDP), where an agen ineracs wih he world and collecs rewards. Formally, he agen akes an acion a in sae s, goes o he nex sae s according o he ransiion probabiliy T (s, a, s ) = P r(s s, a) and receives reward r. Saes and acions are defined by he sae space S and he acion space A. Rewards r are assigned by a real-valued reward funcion R(s, a, s ). The agen s behavior is defined by a policy π such ha π(a s) is he probabiliy of aking acion a in sae s. The goal of reinforcemen learning is o find an opimal policy π ha maximizes [ he expeced discouned cumulaive reward T R = E =0 γ r ], where γ [0, 1] is he discoun facor and T is he ime sep when he episode ends. One approach o solve MDPs is o compue is Q-funcion: he expeced reward saring from sae s, aking acion a and following policy π: Q π (s, a) E [ γ r s 0 = s, a 0 = a, π]. Q-values of an opimal policy solve he Bellman Equaion (Suon & Baro, s h s h o ; h s h o h o Q o (s,a ) (a) expers Q 1 s h s Q k O gaing o h o w i Q o (s,a ) (b) sofmax Figure 1. Diagram of he DRON archiecure. (a) DRON-conca: opponen represenaion is concaenaed wih he sae represenaion. (b) DRON-MoE: Q-values prediced by K expers are combined linearly by weighs from he gaing nework. 1998): Q (s, a) = s T (s, a, s ) [ ] r + γ max Q (s, a ). a Opimal policies always selec he acion wih he highes Q- value for a given sae. Q-learning (Wakins & Dayan, 1992; Suon & Baro, 1998) finds he opimal Q-values wihou knowledge of T. Given observed ransiions (s, a, s, r), Q-values are updaed recursively: [ ] Q(s, a) Q(s, a) + α r + γ max Q(s, a ) Q(s, a). a For complex problems wih coninuous saes, he Q- funcion canno be expressed as a lookup able, requiring a coninuous approximaion. Deep reinforcemen learning such as DQN (Mnih e al., 2015) a deep Q-learning mehod wih experience replay approximaes he Q-funcion using a neural nework. I draws samples (s, a, s, r) from a replay memory M, and he neural nework predics Q by minimizing squared loss a ieraion i: [ ( L i (θ i ) = E (s,a,s,r) U(M) r + γ max Q(s, a ; θ i 1 ) a ) ] 2 Q(s, a; θ i ), where U(M) is a uniform disribuion over replay memory. 3. Deep Reinforcemen Opponen Nework In a muli-agen seing, he environmen is affeced by he join acion of all agens. From he perspecive of one agen, he oucome of an acion in a given sae is no longer sable, bu is dependen on acions of oher agens. In his secion, we firs analyze he effec of muliple agens on he Q-learning framework; hen we presen DRON and is muliasking variaion.

Opponen Modeling in Deep Reinforcemen Learning s h s h o ; h s h o h o y o expers Q 1 s h s Q k gaing sofmax w i O o h o y o join policy of opponens: Q πo = max π Q π πo (s, a) s S and a A. The recurren relaion beween Q-values holds: Q π πo (s, a ) = o π o (o s ) s +1 T (s, a, o, s +1 ) [ ]] [R(s, a, o, s +1 ) + γe a+1 Q π πo (s +1, a +1 ). (1) Q o (s,a ) (a) Q o (s,a ) Figure 2. Diagram of he DRON wih muliasking. The blue par shows ha he supervision signal from he opponen affecs he Q-learning nework by changing he opponen feaures. 3.1. Q-Learning wih Opponens In MDP erms, he join acion space is defined by A M = A 1 A 2... A n where n is he oal number of agens. We use a o denoe he acion of he agen we conrol (he primary agen) and o o denoe he join acion of all oher agens (secondary agens), such ha (a, o) A M. Similarly, he ransiion probabiliy becomes T M (s, a, o, s ) = P r(s s, a, o), and he new reward funcion is R M (s, a, o, s ). Our goal is o learn an opimal policy for he primary agen given ineracions wih he join policy π o of he secondary agens. 2 If π o is saionary, hen he muli-agen MDP reduces o a single-agen MDP: he opponens can be considered par of he world. Thus, hey redefine he ransiions and reward: T (s, a, s ) = o R(s, a, s ) = o (b) π o (o s)t M (s, a, o, s ), π o (o s)r M (s, a, o, s ). Therefore, an agen can ignore oher agens, and sandard Q-learning suffices. Neverheless, i is ofen unrealisic o assume opponens use fixed policies. Oher agens may also be learning or adaping o maximize rewards. For example, in sraegy games, players may disguise heir rue sraegies a he beginning o fool he opponens; winning players proec heir lead by playing defensively; and losing players play more aggressively. In hese siuaions, we face opponens wih an unknown policy π o ha changes over ime. Considering he effecs of oher agens, he definiion of an opimal policy in Secion 2 no longer applies he effeciveness policies now depends on policies of secondary agens. We herefore define he opimal Q-funcion relaive o he 2 While a join policy defines he disribuion of join acions, he opponens may be conrolled by independen policies. 3.2. DQN wih Opponen Modeling Given Equaion 1, we can coninue applying Q-learning and esimae boh he ransiion funcion and he opponens policy by sochasic updaes. However, reaing opponens as par of he world can slow responses o adapive opponens (Uher & Veloso, 2003), because he change in behavior is masked by he dynamics of he world. To encode opponen behavior explicily, we propose he Deep Reinforcemen Opponen Nework (DRON) ha models Q πo and π o joinly. DRON is a Q-Nework (N Q ) ha evaluaes acions for a sae and an opponen nework (N o ) ha learns represenaion of π o. The remaining quesions are how o combine he wo neworks and wha supervision signal o use. To answer he firs quesion, we invesigae wo nework archiecures: DRON-conca ha concaenaes N Q and N o, and DRON-MOE ha applies a Mixure-of- Expers model. To answer he second quesion, we consider wo seings: (a) predicing Q-values only, as our goal is he bes reward insead of accuraely simulaing opponens; and (b) also predicing exra informaion abou he opponen when i is available, e.g., he ype of heir sraegy. DRON-conca We exrac feaures from he sae (φ s ) and he opponen (φ o ) and hen use linear layers wih recificaion or convoluional neural neworks N Q and N o o embed hem in separae hidden spaces (h s and h o ). To incorporae knowledge of π o ino he Q-Nework, we concaenae represenaions of he sae and he opponen (Figure 1a). The concaenaion hen joinly predics he Q-value. Therefore, he las layer(s) of he neural nework is responsible for undersanding he ineracion beween opponens and Q-values. Since here is only one Q-Nework, he model requires a more discriminaive represenaion of he opponens o learn an adapive policy. To alleviae his, our second model encodes a sronger prior of he relaion beween opponens acions and Q-values based on Equaion 1. DRON-MOE The righ par of Equaion 1 can be wrien as o π o (o s )Q π (s, a, o ), an expecaion over differen opponen behavior. We use a Mixure-of-Expers nework (Jacobs e al., 1991) o explicily model he opponen acion as a hidden variable and marginalize over i (Figure 1b). The expeced Q-value is obained by combining

Opponen Modeling in Deep Reinforcemen Learning predicions from muliple exper neworks: Q(s, a ; θ) = K w i Q i (h s, a ) i=1 Q i (h s, ) = f(w s i h s + b s i ). Each exper nework predics a possible reward in he curren sae. A gaing nework based on he opponen represenaion compues combinaion weighs (disribuion over expers): w = sofmax (f(w o h o + b o )). Here f( ) is a nonlinear acivaion funcion (ReLU for all experimens), W represens he linear ransformaion marix, and b is he bias erm. Unlike DRON-conca, which ignores he ineracion beween he world and opponen behavior, DRON-MOE knows ha Q-values have differen disribuions depending on φ o ; each exper nework capures one ype of opponen sraegy. Muliasking wih DRON The previous wo models predic Q-values only, hus he opponen represenaion is learned indirecly hrough feedback from he Q-value. Exra informaion abou he opponen can provide direc supervision for N o. Many games reveal addiional informaion besides he final reward a he end of a game. A he very leas he agen has observed acions aken by he opponens in pas saes; someimes heir privae informaion such as he hidden cards in poker. More high-level informaion includes absraced plans or sraegies. Such informaion reflecs characerisics of opponens and can aid policy learning. Unlike previous work ha learns a separae model o predic hese informaion abou he opponen (Davidson, 1999; Ganzfried & Sandholm, 2011; Schadd e al., 2007), we apply muliask learning and use he observaion as exra supervision o learn a shared opponen represenaion h o. Figure 2 shows he archiecure of muliask DRON, where supervision is y o. The advanage of muliasking over explici opponen modeling is ha i uses high-level knowledge of he game and he opponen, while remaining robus o insufficien opponen daa and modeling error from Q-values. In Secion 4, we evaluae muliasking DRON wih wo ypes of supervision signals: fuure acion and overall sraegy of he opponen. 4. Experimens In his secion, we evaluae our models on wo asks, he soccer game and quiz bowl. Boh asks have wo players agains each oher and he opponen presens varying behavior. We compare DRON models wih DQN and analyze heir response agains differen ypes of opponens. (2) A B (4) A (1) B B (3) A Ball Goals N w/ ball w/o ball Defensive Offensive Avoid Advance opponen o goal Defend goal Inercep he ball Figure 3. Lef: Illusraion of he soccer game. Righ: Sraegies of he hand-crafed rule-based agen. All sysems are rained under he same Q-learning framework. Unless saed oherwise, he experimens have he following configuraion: discoun facor γ is 0.9, parameers are opimized by AdaGrad (Duchi e al., 2011) wih a learning rae of 0.0005, and he mini-bach size is 64. We use ɛ-greedy exploraion during raining, saring wih an exploraion rae of 0.3 ha linearly decays o 0.1 wihin 500,000 seps. We rain all models for fify epochs. Cross Enropy is used as he loss in muliasking learning. 4.1. Soccer Our firs esbed is a soccer varian following previous work on muli-player games (Liman, 1994; Collins, 2007; Uher & Veloso, 2003). The game is played on a 6 9 grid (Figure 3) by wo players, A and B. 3 The game sars wih A and B in a randomly squares in he lef and righ half (excep he goals), and he ball goes o one of hem. Players choose from five acions: move N, S, W, E or sand sill (Figure 3(1)). An acion is invalid if i akes he player o a shaded square or ouside of he border. If wo players move o he same square, he player who possesses he ball before he move loses i o he opponen (Figure 3(2)), and he move does no ake place. A player scores one poin if hey ake he ball o he opponen s goal (Figure 3(3), (4)) and he game ends. If neiher player ges a goal wihin one hundred seps, he game ends wih a zero zero ie. Implemenaion We design a wo-mode rule-based agen as he opponen Figure 3(righ). In he offensive mode, he agen always prioriize aacking over defending. In 5000 games agains a random agen, i wins 99.86% of he ime and he average episode lengh is 10.46. In defensive mode, he agen only focuses on defending is own goal. As a resul, i wins 31.80% of he games and ies 58.40% of hem; he average episode lengh is 81.70. I is easy o find a sraegy o defea he opponen in eiher mode, however, he sraegy does no work well for boh modes, as we will show in Table 2. Therefore, he agen randomly chooses beween 3 Alhough he game is played in a grid world, we do no represen he Q-funcion in abular form as in previous work. Therefore i can be generalized o more complex pixel-based seings.

Opponen Modeling in Deep Reinforcemen Learning Model Basic Max R Muliask +acion +ype DRON-conca 0.682 0.695 0.690 DRON-MOE 0.699 0.697 0.686 DQN-world 0.664 - - Mean R DRON-conca 0.660 0.672 0.669 DRON-MOE 0.675 0.664 0.672 DQN-world 0.616 - - Average reward 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 DQN-world DRON-conca 0.1 DRON-MoE 0.2 0 10 20 30 40 50 Number of epochs Figure 4. Learning curves on Soccer over fify epochs. models are more sable han DQN. DRON Table 1. Rewards of DQN and DRON models on Soccer. We repor he maximum es reward ever achieved (Max R) and he average reward of he las 10 epochs (Mean R). Saisically significan (p < 0.05 in wo-ailed pairwise -ess) improvemen for DQN ( ) and all oher models in bold. DRON models achieve higher rewards in boh measures. he wo modes in each game o creae a varying sraegy. The inpu sae is a 1 15 vecor represening coordinaes of he agen, he opponen, he axis limis of he field, posiions of he goal areas and ball possession. We define a player s move by five cases: approaching he agen, avoiding he agen, approaching he agen s goal, approaching self goal and sanding sill. Opponen feaures include frequencies of observed opponen moves, is mos recen move and acion, and he frequency of losing he ball o he opponen. The baseline DQN has wo hidden layers, boh wih 50 hidden unis. We call his model DQN-world: opponens are modeled as par of he world. The hidden layer of he opponen nework in DRON also has 50 hidden unis. For muliasking, we experimen wih wo supervision signals, opponen acion in he curren sae (+acion) and he opponen mode (+ype). Resuls In Table 1, we compare rewards of DRON models, heir muliasking variaions, and DQN-world. Afer each epoch, we evaluae he policy wih 5000 randomly generaed games (he es se) and compue he average reward. We repor he mean es reward afer he model sabilizes and he maximum es reward ever achieved. The DRON models ouperform he DQN baseline. Our model also has much smaller variance (Figure 4). Adding addiional supervision signals improves DRONconca bu no DRON-MOE (muliask column). DRONconca does no explicily learn differen sraegies for differen ypes of opponens, herefore more discriminaive opponen represenaion helps model he relaion beween opponen behavior and Q-values. However, for DRON-MOE, while beer opponen represenaion is sill desirable, he supervision signal may no be aligned wih classificaion DQN DQN DRON DRON O only D only -world -conca -MOE O 0.897-0.272 0.811 0.875 0.870 D 0.480 0.504 0.498 0.493 0.486 Table 2. Average rewards of DQN and DRON models when playing agains differen ypes of opponens. Offensive and defensive agens are represened by O and D. O only and D only means raining agains O and D agens only. Upper bounds of rewards are in bold. DRON achieves rewards close o he upper bounds agains boh ypes of opponens. of he opponens learned from he Q-values. To invesigae how he learned policies adap o differen opponens, we es he agens agains a defensive opponen and an offensive opponen separaely. Furhermore, we rain wo DQN agens argeing a each ype of opponen respecively. Their performance is bes an agen can do when facing a single ype of opponen (in our seing), as he sraegies are learned o defea his paricular opponen. Table 2 shows he average rewards of each model and he DQN upper bounds (in bold). DQN-world is confused by he defensive behavior and significanly sacrifices is performance agains he offensive opponen; DRON achieves a much beer rade-off, reaining rewards close o boh upper bounds agains he varying opponen. Finally, we examine how he number of expers in DRON- MOE affecs he resul. From Figure 5, we see no significan difference in varying he number of expers, and DRON- MOE consisenly performs beer han DQN across all K. Muliasking does no help here. 4.2. Quiz Bowl Quiz bowl is a rivia game widely played in Englishspeaking counries beween schools, wih ournamens held mos weekends. I is usually played beween wo eams. The quesions are read o players and hey score poins by buzzing in firs (ofen before he quesion is finished) and answering he quesion correcly. One example quesion

Opponen Modeling in Deep Reinforcemen Learning Reward 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 DQN-world DRON-MoE(R) DRON-MoE(R+acion) DRON-MoE(R+ype) K=2 K=3 K=4 Accuracy 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0 20 40 60 80 100 120 Number of words revealed (a) 1500 1350 1200 1050 900 750 600 450 300 Accuracy 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 95% confidence inerval 0.0 0 50 100 150 200 Number of words revealed (b) Figure 5. Effec of varying he number expers (2 4) and muliasking on Soccer. The error bars show he 90% confidence inerval. DRON-MOE consisenly improves over DQN regardless of he number of mixure componens. Adding exra supervision does no obviously improve he resuls. wih buzzes is shown in Figure 7. A successful quiz bowl player needs wo hings: a conen model o predic answers given (parial) quesions and a buzzing model o decide when o buzz. Conen Model We model he quesion answering par as an incremenal ex-classificaion problem. Our conen model is a recurren neural nework wih gaed recurren unis (GRU). I reads in he quesion sequenially and oupus a disribuion over answers a each word given pas informaion encoded in he hidden saes. Buzzing Model To es deph of knowledge, quesions sar wih obscure informaion and reveals more and more obvious clues owards he end (e.g., Figure 7). Therefore, he buzzing model faces a speed-accuracy radeoff: while buzzing laer increases one s chance of answering correcly, i also increases he risk of losing he chance o answer. A safe sraegy is o always buzz as soon as he conen model is confiden enough. A smarer sraegy, however, is o adap o differen opponens: if he opponen ofen buzzes lae, wai for more clues; oherwise, buzz more aggressively. To model ineracion wih oher players, we ake a reinforcemen learning approach o learn a buzzing policy. The sae includes words revealed and predicions from he conen model, and he acions are buzz and wai. Upon buzzing, he conen model oupus he mos likely answer a he curren posiion. An episode erminaes when one player buzzes and answers he quesion correcly. Correc answers are worh 10 poins and wrong answers are 5. Implemenaion We collec quesion/answer pairs and log user buzzes from Proobowl, an online muli-player quizbowl applicaion. 4 Addiionally, we include daa from Boyd-Graber e al. (2012). Mos buzzes are from srong ournamen players. Afer removing answers wih fewer 4 hp://proobowl.com Figure 6. Accuracy vs. he number of words revealed. (a) Realime user performance. Each do represens one user; do size and color correspond o he number of quesions he user answered. (b) Conen model performance. Accuracy is measured based on predicions a each word. Accuracy improves as more words are revealed. han five quesions and users who played fewer han weny quesions, we end up wih 1045 answers, 37.7k quesions and 3610 users. We divide all quesions ino wo nonoverlapping ses: one for raining he conen model and one for raining he buzzing policy. The wo ses are furher divided ino rain/dev and rain/dev/es ses randomly. There are clearly wo clusers of players (Figure 6(a)): aggressive players who buzz early wih varying accuracies and cauious players who buzz lae bu mainain higher accuracy. Our GRU conen model (Figure 6(b)) is more accurae wih more inpu words a behavior similar o human players. Our inpu sae mus represen informaion from he conen model and he opponens. Informaion from he conen model akes he form of a belief vecor: a vecor (1 1045) wih he curren esimae (as a log probabiliy) of each possible guess being he correc answer given our inpu so far. We concainae he belief vecor from he previous ime sep o capure sudden shifs in cerainy, which are ofen good opporuniies o buzz. In addiion, we include he number of words seen and wheher a wrong buzz has happened. The opponen feaures include he number of quesions he opponen has answered, he average buzz posiion, and he error rae. The basic DQN has wo hidden layers, boh wih 128 hidden unis. The hidden layer for he opponen has en hidden unis. Similar o soccer, we experimen wih wo seings for muliasking: (a) predicing how opponen buzzes; (b) predicing he opponen ype. We approximae he ground ruh for (a) by min(1, /buzz posiion) and use he mean square error as he loss funcion. The ground ruh for (b) is based on dividing players ino four groups according o heir buzz posiions he percenage of quesion revealed. Resuls In addiion o DQN-world, we also compare wih DQN-self, a baseline wihou ineracion wih opponens a all. DQN-self is ignoran of he opponens and plays he safe

Model Opponen Modeling in Deep Reinforcemen Learning Muliask Basic vs. opponens buzzing a differen posiions (%revealed (#episodes)) Basic +acion +ype 0 25% (4.8k) 25 50% (18k) 50 75% (0.7k) 75 100% (1.3k) R R rush miss R rush miss R rush miss R rush miss DRON-conca 1.04 1.34 1.25-0.86 0.06 0.15 1.65 0.10 0.11-1.35 0.13 0.18 0.81 0.19 0.12 DRON-MOE 1.29 1.00 1.29-0.46 0.06 0.15 1.92 0.10 0.11-1.44 0.18 0.16 0.56 0.22 0.10 DQN-world 0.95 - - -0.72 0.04 0.16 1.67 0.09 0.12-2.33 0.23 0.15-1.01 0.30 0.09 DQN-self 0.80 - - -0.46 0.09 0.12 1.48 0.14 0.10-2.76 0.30 0.12-1.97 0.38 0.07 Table 3. Comparison beween DRON and DQN models. The lef column shows he average reward of each model on he es se. The righ column shows performance of he basic models agains differen ypes of players, including he average reward (R), he rae of buzzing incorrecly (rush) and he rae of missing he chance o buzz correcly (miss). means higher is beer and means lower is beer. In he lef column, we indicae saisically significan resuls (p < 0.05 in wo-ailed pairwise -ess) wih boldface for verical comparison and for horizonal comparison. The anibioic eryhromycin works by disruping his organelle, which conains E, P, and A sies on is large subuni. The pars of his organelle are assembled a nucleoli, and when bound o a membrane, hese creae he rough ER. Codons are ranslaed a his organelle where he RNA and mrna mee. For 10 poins, name his organelle ha is he sie of proein synhesis. : DQN-self : DQN-world : DRON-MOE : DRON-conca Figure 7. Buzz posiions of human players and agens on one science quesion whose answer is ribosome. Words where a player buzzes is displayed in a color unique o he player; a wrong buzz is shown in ialic. Words where an agen buzzes is subscriped by a symbol unique o he agen; color of he symbol corresponds o he player i is playing agains. A gray symbol means ha buzz posiion of he agen does no depend on is opponen. DRON agens adjus heir buzz posiions according o he opponen s buzz posiion and correcness. Bes viewed in color. sraegy: answer as soon as he conen model is confiden. During raining, when he answer predicion is correc, i receives reward 10 for buzz and -10 for wai. When he answer predicion is incorrec, i receives reward -15 for buzz and 15 for wai. Since all rewards are immediae, we se γ o 0 for DQN-self. 5 Wih daa of he opponens responses, DRON and DQN-world use he game payoff (from he perspecive of he compuer) as he reward. Firs we compare he average rewards on es se of our models DRON-conca and DRON-MOE (wih 3 expers) and he baseline models: DQN-self and DQN-world. From he firs column in Table 3, our models achieve saisically significan improvemens over he DQN baselines and DRON- MOE ouperforms DRON-conca. In addiion, he DRON models have much less variance compared o DQN-world as he learning curves show in Figure 9. To invesigae sraegies learned by hese models, we show heir performance agains differen ypes of players (as de- 5 This is equivalen o cos-sensiive classificaion. fined a he end of Implemenaion ) in Table 3, righ column. We compare hree measures of performance, he average reward (R), percenage of early and incorrec buzzes (rush), and percenage of missing he chance o buzz correcly before he opponen (miss). All models bea Type 2 players, mainly because hey are he majoriy in our daase. As expeced, DQN-self learns a safe sraegy ha ends o buzz early. I performs he bes agains Type 1 players who answer early. However, i has very high rush rae agains cauious players, resuling in much lower rewards agains Type 3 and Type 4 players. Wihou opponen modeling, DQN-world is biased owards he majoriy player, hus having he same problem as DQNself when playing agains players who buzz lae. Boh DRON models exploi cauious players while holding heir own agains aggressive players. Furhermore, DRON-MOE maches DQN-self agains Type 1 players, hus i discovers differen buzzing sraegies. Figure 7 shows an example quesion wih buzz posiions labeled. The DRON agens demonsrae dynamic behavior agains differen players; DRON-MOE almos always buzzes righ before he opponen in his example. In addiion, when he player buzzes wrong and he game coninues, DRON- MOE learns o wai longer since he opponen is gone, while he oher agens are sill in a rush. As wih he Soccer ask, adding exra supervision does no yield beer resuls over DRON-MOE (Table 3) bu significanly improves DRON-conca. Figure 8 varies he number of expers in DRON-MOE (K) from wo o four. Using a mixure model for he opponens consisenly improves over he DQN baseline, and using hree expers gives beer performance on his ask. For muliasking, adding he acion supervision does no help a all. However, he more high-level ype supervision yields compeen resuls, especially wih four expers, mosly because he number of expers maches he number of ypes.

Opponen Modeling in Deep Reinforcemen Learning Reward 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 DQN-world DRON-MoE(R) DRON-MoE(R+acion) DRON-MoE(R+ype) K=2 K=3 K=4 (2011) and Bard e al. (2013) consruc a a porfolio of sraegies offline based on domain knowledge or pas experience for heads-up limi Texas hold em; hey hen selec sraegies online using muli-arm bandi algorihms. Our approach does no have a clear online/offline disincion. We learn sraegies and heir selecor in a join, probabilisic way. However, he offline consrucion can be mimicked in our models by iniializing exper neworks wih DQN pre-rained agains differen opponens. Figure 8. Effec of varying he number expers (2 4) and muliasking on quiz bowl. The error bars show he 90% confidence inerval. DRON-MOE consisenly improves over DQN regardless of he number of mixure componens. Supervision of he opponen ype is more helpful han he specific acion aken. Average reward 2 1 0 1 2 DQN-world DRON-MoE 3 DQN-self DRON-conca 4 0 10 20 30 40 50 Number of epochs Figure 9. Learning curves on Quizbowl over fify epochs. DRON models are more sable han DQN. 5. Relaed Work and Discussion Implici vs. Explici opponen modeling Opponen modeling has been sudied exensively in games. Mos exising approaches fall ino he caegory of explici modeling, where a model (e.g., decision rees, neural neworks, Bayesian models) is buil o direcly predic parameers of he opponen, e.g., acions (Uher & Veloso, 2003; Ganzfried & Sandholm, 2011), privae informaion (Billings e al., 1998b; Richards & Amir, 2007), or domain-specific sraegies (Schadd e al., 2007; Souhey e al., 2005). One difficuly is ha he model may need a prohibiive number of examples before producing anyhing useful. Anoher is ha as he opponen behavior is modeled separaely from he world, i is no always clear how o incorporae hese predicions robusly ino policy learning. The resuls on muliasking DRON also sugges ha improvemen from explici modeling is limied. However, i is beer suied o games of incomplee informaion, where i is clear wha informaion needs o be prediced o achieve higher reward. Our work is closely relaed o implici opponen modeling. Since he agen aims o maximize is own expeced reward wihou having o idenify he opponen s sraegy, his approach does no have he difficuly of incorporaing predicion of he opponen s parameers. Rubin & Wason Neural nework opponen models Davidson (1999) applies neural neworks o opponen modeling, where a simple muli-layer percepron is rained as a classifier o predic opponen acions given game logs. Locke e al. (2007) propose an archiecure similar o DRON-conca ha aims o idenify he ype of an opponen. However, insead of learning a hidden represenaion, hey learn a mixure weighs over a pre-specified se of cardinal opponens; and hey use he neural nework as a sandalone solver wihou he reinforcemen learning seing, which may no be suiable for more complex problems. Foerser e al. (2016) use modern neural neworks o learn a group of parameer-sharing agens ha solve a coordinaion ask, where each agen is conrolled by a deep recurren Q-Nework (Hausknech & Sone, 2015). Our seing is differen in ha we conrol only one agen and he policy space of oher agens is unknown. Opponen modeling wih neural neworks remains undersudied wih ample room for improvemen. 6. Conclusion and Fuure Work Our general opponen modeling approach in he reinforcemen learning seing incorporaes (implici) predicion of opponens behavior ino policy learning wihou domain knowledge. We use recen deep Q-learning advances o learn a represenaion of opponens ha beer maximizes available rewards. The proposed nework archiecures are novel models ha capure he ineracion beween opponen behavior and Q-values. Our model is also flexible enough o include supervision for parameers of he opponens, much as in explici modeling. These gains can furher benefi from advances in deep learning. For example, Eigen e al. (2014) exends he Mixureof-Expers nework o a sacked model deep Mixure-of- Expers which can be combined wih hierarchical reinforcemen learning o learn a hierarchy of opponen sraegies in large, complex domains such as online sraegy games. In addiion, insead of hand-crafing opponen feaures, we can feed in raw opponen acions and use a recurren neural nework o learn he opponen represenaion. Anoher imporan direcion is o design online algorihms ha can adap o fas changing behavior and balance exploiaion and exploraion of opponens.

Opponen Modeling in Deep Reinforcemen Learning Acknowledgemens We hank Hua He, Xiujun Li, and Mohi Iyyer for helpful discussions abou deep Q-learning and our model. We also hank he anonymous reviewers for heir insighful commens. This work was suppored by NSF gran IIS-1320538. Boyd-Graber is also parially suppored by NSF grans CCF- 1409287 and NCSE-1422492. Any opinions, findings, conclusions, or recommendaions expressed here are hose of he auhors and do no necessarily reflec he view of he sponsor. References Bard, Nolan, Johanson, Michael, Burch, Neil, and Bowling, Michael. Online implici agen modelling. In Proceedings of Inernaional Conference on Auonomous Agens and Muliagen Sysems, 2013. Billings, Darse, Papp, Denis, Schaeffer, Jonahan, and Szafron, Duane. Opponen modeling in poker. In Associaion for he Advancemen of Arificial Inelligence, 1998a. Billings, Darse, Papp, Denis, Schaeffer, Jonahan, and Szafron, Duane. Opponen modeling in poker. In Associaion for he Advancemen of Arificial Inelligence, 1998b. Boyd-Graber, Jordan, Sainoff, Brianna, He, He, and Daumé III, Hal. Besing he quiz maser: Crowdsourcing incremenal classificaion games. In Empirical Mehods in Naural Language Processing, 2012. Collins, Brian. Combining opponen modeling and modelbased reinforcemen learning in a wo-player compeiive game. Maser s hesis, School of Informaics, Universiy of Edinburgh, 2007. Davidson, Aaron. Using arifical neural neworks o model opponens in exas hold em. CMPUT 499 - Research Projec Review, 1999. URL hp://www.spaz.ca/ aaron/poker/nnpoker.pdf. Duchi, John, Hazan, Elad, and Singer, Yoram. Adapive subgradien mehods for online learning and sochasic opimizaion. Journal of Machine Learning Research, 2011. Eigen, David, Ranzao, Marc Aurelio, and Suskever, Ilya. Learning facored represenaions in a deep mixure of expers. In ICLR Workshop, 2014. Foerser, Jakob N., Assael, Yannis M., de Freias, Nando, and Whieson, Shimon. Learning o communicae o solve riddles wih deep disribued recurren q-neworks. Arxiv:1602.02672, 2016. Ganzfried, Sam and Sandholm, Tuomas. Game heorybased opponen modeling in large imperfec-informaion games. In Proceedings of Inernaional Conference on Auonomous Agens and Muliagen Sysems, 2011. Hausknech, Mahew and Sone, Peer. Deep recurren q-learning for parially observable MDPs. Arxiv:1507.06527, 2015. Jacobs, Rober A., Jordan, Michael I., Nowlan, Seven J., and Hinon, Geoffrey E. Adapive mixures of local expers. Neural Compuaion, 3(1):79 87, 1991. Liman, Michael L. Markov games as a framework for muli-agen reinforcemen learning. In Proceedings of he Inernaional Conference of Machine Learning, 1994. Locke, Alan J., Chen, Charles L., and Miikkulainen, Riso. Evolving explici opponen models in game playing. In Proceeedings of he Geneic and Evoluionary Compuaion Conference, 2007. Mnih, Volodymyr, Heess, Nicolas, Graves, Alex, and Kavukcuoglu, Koray. Recurren models of visual aenion. In Proceedings of Advances in Neural Informaion Processing Sysems, 2014. Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness, Joel, Bellemare, Marc G., Graves, Alex, Riedmiller, Marin, Fidjeland, Andreas K., Osrovski, Georg, Peersen, Sig, Beaie, Charles, Sadik, Amir, Anonoglou, Ioannis, King, Helen, Kumaran, Dharshan, Wiersra, Daan, Legg, Shane, and Hassabis, Demis. Human-level conrol hrough deep reinforcemen learning. Naure, 518(7540):529 533, 02 2015. URL hp: //dx.doi.org/10.1038/naure14236. Richards, Mark and Amir, Eyal. Opponen modeling in Scrabble. In Inernaional Join Conference on Arificial Inelligence, 2007. Rubin, Jonahan and Wason, Ian. On combining decisions from muliple exper imiaors for performance. In Inernaional Join Conference on Arificial Inelligence, 2011. Schadd, Frederik, Bakkes, Er, and Spronck, Pieer. Opponen modeling in real-ime sraegy games. In Proceedings of he GAME-ON 2007, pp. 61 68, 2007. Souhey, Finnegan, Bowling, Michael, Larson, Bryce, Piccione, Carmelo, Burch, Neil, Billings, Darse, and Rayner, Chris. Bayes bluff: Opponen modelling in poker. In Proceedings of Uncerainy in Arificial Inelligence, 2005. Suon, Richard S and Baro, Andrew G. Reinforcemen learning: An inroducion, volume 1. MIT Press Cambridge, 1998.

Tampuu, Ardi, Maiisen, Tambe, Kodelja, Dorian, Kuzovkin, Ilya, Korjus, Krisjan, Aru, Juhan, Aru, Jaan, and Vicene, Raul. Muliagen cooperaion and compeiion wih deep reinforcemen learning. ArXiv:1511.08779, 2015. Uher, William and Veloso, Manuela. Adversarial reinforcemen learning. Technical Repor CMU-CS-03-107, School of Compuer Science, Carnegie Mellon Universiy, 2003. Wakins, Chrisopher J. C. H. and Dayan, Peer. Q-learning. Machine Learning, 8(3-4):279 292, 1992. Zhang, Marvin, McCarhy, Zoe, Finn, Chelsea, Levine, Sergey, and Abbeel, Pieer. Learning deep neural nework policies wih coninuous memory saes. ArXiv:1507.01273, 2015. Opponen Modeling in Deep Reinforcemen Learning