Opponent Modeling in Deep Reinforcement Learning

Size: px
Start display at page:

Download "Opponent Modeling in Deep Reinforcement Learning"

Transcription

1 Opponen Modeling in Deep Reinforcemen Learning He He Universiy of Maryland, College Park, MD USA Jordan Boyd-Graber Universiy of Colorado, Boulder, CO USA Kevin Kwok Massachuses Insiue of Technology, Cambridge, MA USA Hal Daumé III Universiy of Maryland, College Park, MD USA Absrac Opponen modeling is necessary in muli-agen seings where secondary agens wih compeing goals also adap heir sraegies, ye i remains challenging because sraegies inerac wih each oher and change. Mos previous work focuses on developing probabilisic models or parameerized sraegies for specific applicaions. Inspired by he recen success of deep reinforcemen learning, we presen neural-based models ha joinly learn a policy and he behavior of opponens. Insead of explicily predicing he opponen s acion, we encode observaion of he opponens ino a deep Q-Nework (DQN); however, we reain explici modeling (if desired) using muliasking. By using a Mixure-of-Expers archiecure, our model auomaically discovers differen sraegy paerns of opponens wihou exra supervision. We evaluae our models on a simulaed soccer game and a popular rivia game, showing superior performance over DQN and is varians. 1. Inroducion An inelligen agen working in sraegic seings (e.g., collaboraive or compeiive asks) mus predic he acion of oher agens and reason abou heir inenions. This is imporan because all acive agens affec he sae of he world. For example, a muli-player game AI can exploi subopimal players if i can predic heir bad moves; a negoiaing Proceedings of he 33 rd Inernaional Conference on Machine Learning, New York, NY, USA, JMLR: W&CP volume 48. Copyrigh 2016 by he auhor(s). agen can reach an agreemen faser if i knows he oher pary s boom line; a self-driving car mus avoid accidens by predicing where cars and pedesrians are going. Two criical quesions in opponen modeling are wha variable(s) o model and how o use he prediced informaion. However, he answers depend much on he specific applicaion, and mos previous work (Billings e al., 1998a; Souhey e al., 2005; Ganzfried & Sandholm, 2011) focuses exclusively on poker games which require subsanial domain knowledge. We aim o build a general opponen modeling framework in he reinforcemen learning seing, which enables he agen o exploi idiosyncrasies of various opponens. Firs, o accoun for changing behavior, we model uncerainy in he opponen s sraegy insead of classifying i ino a se of sereoypes. Second, domain knowledge is ofen required when predicion of he opponens are separaed from learning he dynamics of he world. Therefore, we joinly learn a policy and model he opponen probabilisically. We develop a new model, DRON (Deep Reinforcemen Opponen Nework), based on he recen deep Q-Nework of Mnih e al. (2015, DQN) in Secion 3. DRON has a policy learning module ha predics Q-values and an opponen learning module ha infers opponen sraegy. 1 Insead of explicily predicing opponen properies, DRON learns hidden represenaion of he opponens based on pas observaions and uses i (in addiion o he sae informaion) o compue an adapive response. More specifically, we propose wo archiecures, one using simple concaenaion o combine he wo modules and one based on he Mixureof-Expers nework. While we model opponens implicily, addiional supervision (e.g., he acion or sraegy aken) can 1 Code and daa: hps://gihub.com/hhexiy/ opponen

2 Opponen Modeling in Deep Reinforcemen Learning be added hrough muliasking. Compared o previous models ha are specialized in paricular applicaions, DRON is designed wih a general purpose and does no require knowledge of possible (parameerized) game sraegies. A second conribuion is DQN agens ha learn in muliagen seings. Deep reinforcemen learning has shown compeiive performance in various asks: arcade games (Mnih e al., 2015), objec recogniion (Mnih e al., 2014), and robo navigaion (Zhang e al., 2015). However, i has been mosly applied o he single-agen decision-heoreic seings wih saionary environmens. One excepion is Tampuu e al. (2015), where wo agens conrolled by independen DQNs inerac under collaboraive and compeiive rewards. While heir focus is he collecive behavior of a muli-agen sysem wih known conrollers, we sudy from he view poin of a single agen ha mus learn a reacive policy in a sochasic environmen filled wih unknown opponens. We evaluae our mehod on wo asks in Secion 4: a simulaed wo-player soccer game in a grid world, and a real quesion-answering game, quiz bowl, agains users playing online. Boh games have opponens wih a mixure of sraegies ha require differen couner-sraegies. Our model consisenly achieves beer resuls han he DQN baseline. In addiion, we show our mehod is more robus o non-saionary sraegies; i successfully idenifies he opponen s sraegy and responds correspondingly. 2. Deep Q-Learning Reinforcemen learning is commonly used for solving Markov-decision processes (MDP), where an agen ineracs wih he world and collecs rewards. Formally, he agen akes an acion a in sae s, goes o he nex sae s according o he ransiion probabiliy T (s, a, s ) = P r(s s, a) and receives reward r. Saes and acions are defined by he sae space S and he acion space A. Rewards r are assigned by a real-valued reward funcion R(s, a, s ). The agen s behavior is defined by a policy π such ha π(a s) is he probabiliy of aking acion a in sae s. The goal of reinforcemen learning is o find an opimal policy π ha maximizes [ he expeced discouned cumulaive reward T R = E =0 γ r ], where γ [0, 1] is he discoun facor and T is he ime sep when he episode ends. One approach o solve MDPs is o compue is Q-funcion: he expeced reward saring from sae s, aking acion a and following policy π: Q π (s, a) E [ γ r s 0 = s, a 0 = a, π]. Q-values of an opimal policy solve he Bellman Equaion (Suon & Baro, s h s h o ; h s h o h o Q o (s,a ) (a) expers Q 1 s h s Q k O gaing o h o w i Q o (s,a ) (b) sofmax Figure 1. Diagram of he DRON archiecure. (a) DRON-conca: opponen represenaion is concaenaed wih he sae represenaion. (b) DRON-MoE: Q-values prediced by K expers are combined linearly by weighs from he gaing nework. 1998): Q (s, a) = s T (s, a, s ) [ ] r + γ max Q (s, a ). a Opimal policies always selec he acion wih he highes Q- value for a given sae. Q-learning (Wakins & Dayan, 1992; Suon & Baro, 1998) finds he opimal Q-values wihou knowledge of T. Given observed ransiions (s, a, s, r), Q-values are updaed recursively: [ ] Q(s, a) Q(s, a) + α r + γ max Q(s, a ) Q(s, a). a For complex problems wih coninuous saes, he Q- funcion canno be expressed as a lookup able, requiring a coninuous approximaion. Deep reinforcemen learning such as DQN (Mnih e al., 2015) a deep Q-learning mehod wih experience replay approximaes he Q-funcion using a neural nework. I draws samples (s, a, s, r) from a replay memory M, and he neural nework predics Q by minimizing squared loss a ieraion i: [ ( L i (θ i ) = E (s,a,s,r) U(M) r + γ max Q(s, a ; θ i 1 ) a ) ] 2 Q(s, a; θ i ), where U(M) is a uniform disribuion over replay memory. 3. Deep Reinforcemen Opponen Nework In a muli-agen seing, he environmen is affeced by he join acion of all agens. From he perspecive of one agen, he oucome of an acion in a given sae is no longer sable, bu is dependen on acions of oher agens. In his secion, we firs analyze he effec of muliple agens on he Q-learning framework; hen we presen DRON and is muliasking variaion.

3 Opponen Modeling in Deep Reinforcemen Learning s h s h o ; h s h o h o y o expers Q 1 s h s Q k gaing sofmax w i O o h o y o join policy of opponens: Q πo = max π Q π πo (s, a) s S and a A. The recurren relaion beween Q-values holds: Q π πo (s, a ) = o π o (o s ) s +1 T (s, a, o, s +1 ) [ ]] [R(s, a, o, s +1 ) + γe a+1 Q π πo (s +1, a +1 ). (1) Q o (s,a ) (a) Q o (s,a ) Figure 2. Diagram of he DRON wih muliasking. The blue par shows ha he supervision signal from he opponen affecs he Q-learning nework by changing he opponen feaures Q-Learning wih Opponens In MDP erms, he join acion space is defined by A M = A 1 A 2... A n where n is he oal number of agens. We use a o denoe he acion of he agen we conrol (he primary agen) and o o denoe he join acion of all oher agens (secondary agens), such ha (a, o) A M. Similarly, he ransiion probabiliy becomes T M (s, a, o, s ) = P r(s s, a, o), and he new reward funcion is R M (s, a, o, s ). Our goal is o learn an opimal policy for he primary agen given ineracions wih he join policy π o of he secondary agens. 2 If π o is saionary, hen he muli-agen MDP reduces o a single-agen MDP: he opponens can be considered par of he world. Thus, hey redefine he ransiions and reward: T (s, a, s ) = o R(s, a, s ) = o (b) π o (o s)t M (s, a, o, s ), π o (o s)r M (s, a, o, s ). Therefore, an agen can ignore oher agens, and sandard Q-learning suffices. Neverheless, i is ofen unrealisic o assume opponens use fixed policies. Oher agens may also be learning or adaping o maximize rewards. For example, in sraegy games, players may disguise heir rue sraegies a he beginning o fool he opponens; winning players proec heir lead by playing defensively; and losing players play more aggressively. In hese siuaions, we face opponens wih an unknown policy π o ha changes over ime. Considering he effecs of oher agens, he definiion of an opimal policy in Secion 2 no longer applies he effeciveness policies now depends on policies of secondary agens. We herefore define he opimal Q-funcion relaive o he 2 While a join policy defines he disribuion of join acions, he opponens may be conrolled by independen policies DQN wih Opponen Modeling Given Equaion 1, we can coninue applying Q-learning and esimae boh he ransiion funcion and he opponens policy by sochasic updaes. However, reaing opponens as par of he world can slow responses o adapive opponens (Uher & Veloso, 2003), because he change in behavior is masked by he dynamics of he world. To encode opponen behavior explicily, we propose he Deep Reinforcemen Opponen Nework (DRON) ha models Q πo and π o joinly. DRON is a Q-Nework (N Q ) ha evaluaes acions for a sae and an opponen nework (N o ) ha learns represenaion of π o. The remaining quesions are how o combine he wo neworks and wha supervision signal o use. To answer he firs quesion, we invesigae wo nework archiecures: DRON-conca ha concaenaes N Q and N o, and DRON-MOE ha applies a Mixure-of- Expers model. To answer he second quesion, we consider wo seings: (a) predicing Q-values only, as our goal is he bes reward insead of accuraely simulaing opponens; and (b) also predicing exra informaion abou he opponen when i is available, e.g., he ype of heir sraegy. DRON-conca We exrac feaures from he sae (φ s ) and he opponen (φ o ) and hen use linear layers wih recificaion or convoluional neural neworks N Q and N o o embed hem in separae hidden spaces (h s and h o ). To incorporae knowledge of π o ino he Q-Nework, we concaenae represenaions of he sae and he opponen (Figure 1a). The concaenaion hen joinly predics he Q-value. Therefore, he las layer(s) of he neural nework is responsible for undersanding he ineracion beween opponens and Q-values. Since here is only one Q-Nework, he model requires a more discriminaive represenaion of he opponens o learn an adapive policy. To alleviae his, our second model encodes a sronger prior of he relaion beween opponens acions and Q-values based on Equaion 1. DRON-MOE The righ par of Equaion 1 can be wrien as o π o (o s )Q π (s, a, o ), an expecaion over differen opponen behavior. We use a Mixure-of-Expers nework (Jacobs e al., 1991) o explicily model he opponen acion as a hidden variable and marginalize over i (Figure 1b). The expeced Q-value is obained by combining

4 Opponen Modeling in Deep Reinforcemen Learning predicions from muliple exper neworks: Q(s, a ; θ) = K w i Q i (h s, a ) i=1 Q i (h s, ) = f(w s i h s + b s i ). Each exper nework predics a possible reward in he curren sae. A gaing nework based on he opponen represenaion compues combinaion weighs (disribuion over expers): w = sofmax (f(w o h o + b o )). Here f( ) is a nonlinear acivaion funcion (ReLU for all experimens), W represens he linear ransformaion marix, and b is he bias erm. Unlike DRON-conca, which ignores he ineracion beween he world and opponen behavior, DRON-MOE knows ha Q-values have differen disribuions depending on φ o ; each exper nework capures one ype of opponen sraegy. Muliasking wih DRON The previous wo models predic Q-values only, hus he opponen represenaion is learned indirecly hrough feedback from he Q-value. Exra informaion abou he opponen can provide direc supervision for N o. Many games reveal addiional informaion besides he final reward a he end of a game. A he very leas he agen has observed acions aken by he opponens in pas saes; someimes heir privae informaion such as he hidden cards in poker. More high-level informaion includes absraced plans or sraegies. Such informaion reflecs characerisics of opponens and can aid policy learning. Unlike previous work ha learns a separae model o predic hese informaion abou he opponen (Davidson, 1999; Ganzfried & Sandholm, 2011; Schadd e al., 2007), we apply muliask learning and use he observaion as exra supervision o learn a shared opponen represenaion h o. Figure 2 shows he archiecure of muliask DRON, where supervision is y o. The advanage of muliasking over explici opponen modeling is ha i uses high-level knowledge of he game and he opponen, while remaining robus o insufficien opponen daa and modeling error from Q-values. In Secion 4, we evaluae muliasking DRON wih wo ypes of supervision signals: fuure acion and overall sraegy of he opponen. 4. Experimens In his secion, we evaluae our models on wo asks, he soccer game and quiz bowl. Boh asks have wo players agains each oher and he opponen presens varying behavior. We compare DRON models wih DQN and analyze heir response agains differen ypes of opponens. (2) A B (4) A (1) B B (3) A Ball Goals N w/ ball w/o ball Defensive Offensive Avoid Advance opponen o goal Defend goal Inercep he ball Figure 3. Lef: Illusraion of he soccer game. Righ: Sraegies of he hand-crafed rule-based agen. All sysems are rained under he same Q-learning framework. Unless saed oherwise, he experimens have he following configuraion: discoun facor γ is 0.9, parameers are opimized by AdaGrad (Duchi e al., 2011) wih a learning rae of , and he mini-bach size is 64. We use ɛ-greedy exploraion during raining, saring wih an exploraion rae of 0.3 ha linearly decays o 0.1 wihin 500,000 seps. We rain all models for fify epochs. Cross Enropy is used as he loss in muliasking learning Soccer Our firs esbed is a soccer varian following previous work on muli-player games (Liman, 1994; Collins, 2007; Uher & Veloso, 2003). The game is played on a 6 9 grid (Figure 3) by wo players, A and B. 3 The game sars wih A and B in a randomly squares in he lef and righ half (excep he goals), and he ball goes o one of hem. Players choose from five acions: move N, S, W, E or sand sill (Figure 3(1)). An acion is invalid if i akes he player o a shaded square or ouside of he border. If wo players move o he same square, he player who possesses he ball before he move loses i o he opponen (Figure 3(2)), and he move does no ake place. A player scores one poin if hey ake he ball o he opponen s goal (Figure 3(3), (4)) and he game ends. If neiher player ges a goal wihin one hundred seps, he game ends wih a zero zero ie. Implemenaion We design a wo-mode rule-based agen as he opponen Figure 3(righ). In he offensive mode, he agen always prioriize aacking over defending. In 5000 games agains a random agen, i wins 99.86% of he ime and he average episode lengh is In defensive mode, he agen only focuses on defending is own goal. As a resul, i wins 31.80% of he games and ies 58.40% of hem; he average episode lengh is I is easy o find a sraegy o defea he opponen in eiher mode, however, he sraegy does no work well for boh modes, as we will show in Table 2. Therefore, he agen randomly chooses beween 3 Alhough he game is played in a grid world, we do no represen he Q-funcion in abular form as in previous work. Therefore i can be generalized o more complex pixel-based seings.

5 Opponen Modeling in Deep Reinforcemen Learning Model Basic Max R Muliask +acion +ype DRON-conca DRON-MOE DQN-world Mean R DRON-conca DRON-MOE DQN-world Average reward DQN-world DRON-conca 0.1 DRON-MoE Number of epochs Figure 4. Learning curves on Soccer over fify epochs. models are more sable han DQN. DRON Table 1. Rewards of DQN and DRON models on Soccer. We repor he maximum es reward ever achieved (Max R) and he average reward of he las 10 epochs (Mean R). Saisically significan (p < 0.05 in wo-ailed pairwise -ess) improvemen for DQN ( ) and all oher models in bold. DRON models achieve higher rewards in boh measures. he wo modes in each game o creae a varying sraegy. The inpu sae is a 1 15 vecor represening coordinaes of he agen, he opponen, he axis limis of he field, posiions of he goal areas and ball possession. We define a player s move by five cases: approaching he agen, avoiding he agen, approaching he agen s goal, approaching self goal and sanding sill. Opponen feaures include frequencies of observed opponen moves, is mos recen move and acion, and he frequency of losing he ball o he opponen. The baseline DQN has wo hidden layers, boh wih 50 hidden unis. We call his model DQN-world: opponens are modeled as par of he world. The hidden layer of he opponen nework in DRON also has 50 hidden unis. For muliasking, we experimen wih wo supervision signals, opponen acion in he curren sae (+acion) and he opponen mode (+ype). Resuls In Table 1, we compare rewards of DRON models, heir muliasking variaions, and DQN-world. Afer each epoch, we evaluae he policy wih 5000 randomly generaed games (he es se) and compue he average reward. We repor he mean es reward afer he model sabilizes and he maximum es reward ever achieved. The DRON models ouperform he DQN baseline. Our model also has much smaller variance (Figure 4). Adding addiional supervision signals improves DRONconca bu no DRON-MOE (muliask column). DRONconca does no explicily learn differen sraegies for differen ypes of opponens, herefore more discriminaive opponen represenaion helps model he relaion beween opponen behavior and Q-values. However, for DRON-MOE, while beer opponen represenaion is sill desirable, he supervision signal may no be aligned wih classificaion DQN DQN DRON DRON O only D only -world -conca -MOE O D Table 2. Average rewards of DQN and DRON models when playing agains differen ypes of opponens. Offensive and defensive agens are represened by O and D. O only and D only means raining agains O and D agens only. Upper bounds of rewards are in bold. DRON achieves rewards close o he upper bounds agains boh ypes of opponens. of he opponens learned from he Q-values. To invesigae how he learned policies adap o differen opponens, we es he agens agains a defensive opponen and an offensive opponen separaely. Furhermore, we rain wo DQN agens argeing a each ype of opponen respecively. Their performance is bes an agen can do when facing a single ype of opponen (in our seing), as he sraegies are learned o defea his paricular opponen. Table 2 shows he average rewards of each model and he DQN upper bounds (in bold). DQN-world is confused by he defensive behavior and significanly sacrifices is performance agains he offensive opponen; DRON achieves a much beer rade-off, reaining rewards close o boh upper bounds agains he varying opponen. Finally, we examine how he number of expers in DRON- MOE affecs he resul. From Figure 5, we see no significan difference in varying he number of expers, and DRON- MOE consisenly performs beer han DQN across all K. Muliasking does no help here Quiz Bowl Quiz bowl is a rivia game widely played in Englishspeaking counries beween schools, wih ournamens held mos weekends. I is usually played beween wo eams. The quesions are read o players and hey score poins by buzzing in firs (ofen before he quesion is finished) and answering he quesion correcly. One example quesion

6 Opponen Modeling in Deep Reinforcemen Learning Reward DQN-world DRON-MoE(R) DRON-MoE(R+acion) DRON-MoE(R+ype) K=2 K=3 K=4 Accuracy Number of words revealed (a) Accuracy % confidence inerval Number of words revealed (b) Figure 5. Effec of varying he number expers (2 4) and muliasking on Soccer. The error bars show he 90% confidence inerval. DRON-MOE consisenly improves over DQN regardless of he number of mixure componens. Adding exra supervision does no obviously improve he resuls. wih buzzes is shown in Figure 7. A successful quiz bowl player needs wo hings: a conen model o predic answers given (parial) quesions and a buzzing model o decide when o buzz. Conen Model We model he quesion answering par as an incremenal ex-classificaion problem. Our conen model is a recurren neural nework wih gaed recurren unis (GRU). I reads in he quesion sequenially and oupus a disribuion over answers a each word given pas informaion encoded in he hidden saes. Buzzing Model To es deph of knowledge, quesions sar wih obscure informaion and reveals more and more obvious clues owards he end (e.g., Figure 7). Therefore, he buzzing model faces a speed-accuracy radeoff: while buzzing laer increases one s chance of answering correcly, i also increases he risk of losing he chance o answer. A safe sraegy is o always buzz as soon as he conen model is confiden enough. A smarer sraegy, however, is o adap o differen opponens: if he opponen ofen buzzes lae, wai for more clues; oherwise, buzz more aggressively. To model ineracion wih oher players, we ake a reinforcemen learning approach o learn a buzzing policy. The sae includes words revealed and predicions from he conen model, and he acions are buzz and wai. Upon buzzing, he conen model oupus he mos likely answer a he curren posiion. An episode erminaes when one player buzzes and answers he quesion correcly. Correc answers are worh 10 poins and wrong answers are 5. Implemenaion We collec quesion/answer pairs and log user buzzes from Proobowl, an online muli-player quizbowl applicaion. 4 Addiionally, we include daa from Boyd-Graber e al. (2012). Mos buzzes are from srong ournamen players. Afer removing answers wih fewer 4 hp://proobowl.com Figure 6. Accuracy vs. he number of words revealed. (a) Realime user performance. Each do represens one user; do size and color correspond o he number of quesions he user answered. (b) Conen model performance. Accuracy is measured based on predicions a each word. Accuracy improves as more words are revealed. han five quesions and users who played fewer han weny quesions, we end up wih 1045 answers, 37.7k quesions and 3610 users. We divide all quesions ino wo nonoverlapping ses: one for raining he conen model and one for raining he buzzing policy. The wo ses are furher divided ino rain/dev and rain/dev/es ses randomly. There are clearly wo clusers of players (Figure 6(a)): aggressive players who buzz early wih varying accuracies and cauious players who buzz lae bu mainain higher accuracy. Our GRU conen model (Figure 6(b)) is more accurae wih more inpu words a behavior similar o human players. Our inpu sae mus represen informaion from he conen model and he opponens. Informaion from he conen model akes he form of a belief vecor: a vecor (1 1045) wih he curren esimae (as a log probabiliy) of each possible guess being he correc answer given our inpu so far. We concainae he belief vecor from he previous ime sep o capure sudden shifs in cerainy, which are ofen good opporuniies o buzz. In addiion, we include he number of words seen and wheher a wrong buzz has happened. The opponen feaures include he number of quesions he opponen has answered, he average buzz posiion, and he error rae. The basic DQN has wo hidden layers, boh wih 128 hidden unis. The hidden layer for he opponen has en hidden unis. Similar o soccer, we experimen wih wo seings for muliasking: (a) predicing how opponen buzzes; (b) predicing he opponen ype. We approximae he ground ruh for (a) by min(1, /buzz posiion) and use he mean square error as he loss funcion. The ground ruh for (b) is based on dividing players ino four groups according o heir buzz posiions he percenage of quesion revealed. Resuls In addiion o DQN-world, we also compare wih DQN-self, a baseline wihou ineracion wih opponens a all. DQN-self is ignoran of he opponens and plays he safe

7 Model Opponen Modeling in Deep Reinforcemen Learning Muliask Basic vs. opponens buzzing a differen posiions (%revealed (#episodes)) Basic +acion +ype 0 25% (4.8k) 25 50% (18k) 50 75% (0.7k) % (1.3k) R R rush miss R rush miss R rush miss R rush miss DRON-conca DRON-MOE DQN-world DQN-self Table 3. Comparison beween DRON and DQN models. The lef column shows he average reward of each model on he es se. The righ column shows performance of he basic models agains differen ypes of players, including he average reward (R), he rae of buzzing incorrecly (rush) and he rae of missing he chance o buzz correcly (miss). means higher is beer and means lower is beer. In he lef column, we indicae saisically significan resuls (p < 0.05 in wo-ailed pairwise -ess) wih boldface for verical comparison and for horizonal comparison. The anibioic eryhromycin works by disruping his organelle, which conains E, P, and A sies on is large subuni. The pars of his organelle are assembled a nucleoli, and when bound o a membrane, hese creae he rough ER. Codons are ranslaed a his organelle where he RNA and mrna mee. For 10 poins, name his organelle ha is he sie of proein synhesis. : DQN-self : DQN-world : DRON-MOE : DRON-conca Figure 7. Buzz posiions of human players and agens on one science quesion whose answer is ribosome. Words where a player buzzes is displayed in a color unique o he player; a wrong buzz is shown in ialic. Words where an agen buzzes is subscriped by a symbol unique o he agen; color of he symbol corresponds o he player i is playing agains. A gray symbol means ha buzz posiion of he agen does no depend on is opponen. DRON agens adjus heir buzz posiions according o he opponen s buzz posiion and correcness. Bes viewed in color. sraegy: answer as soon as he conen model is confiden. During raining, when he answer predicion is correc, i receives reward 10 for buzz and -10 for wai. When he answer predicion is incorrec, i receives reward -15 for buzz and 15 for wai. Since all rewards are immediae, we se γ o 0 for DQN-self. 5 Wih daa of he opponens responses, DRON and DQN-world use he game payoff (from he perspecive of he compuer) as he reward. Firs we compare he average rewards on es se of our models DRON-conca and DRON-MOE (wih 3 expers) and he baseline models: DQN-self and DQN-world. From he firs column in Table 3, our models achieve saisically significan improvemens over he DQN baselines and DRON- MOE ouperforms DRON-conca. In addiion, he DRON models have much less variance compared o DQN-world as he learning curves show in Figure 9. To invesigae sraegies learned by hese models, we show heir performance agains differen ypes of players (as de- 5 This is equivalen o cos-sensiive classificaion. fined a he end of Implemenaion ) in Table 3, righ column. We compare hree measures of performance, he average reward (R), percenage of early and incorrec buzzes (rush), and percenage of missing he chance o buzz correcly before he opponen (miss). All models bea Type 2 players, mainly because hey are he majoriy in our daase. As expeced, DQN-self learns a safe sraegy ha ends o buzz early. I performs he bes agains Type 1 players who answer early. However, i has very high rush rae agains cauious players, resuling in much lower rewards agains Type 3 and Type 4 players. Wihou opponen modeling, DQN-world is biased owards he majoriy player, hus having he same problem as DQNself when playing agains players who buzz lae. Boh DRON models exploi cauious players while holding heir own agains aggressive players. Furhermore, DRON-MOE maches DQN-self agains Type 1 players, hus i discovers differen buzzing sraegies. Figure 7 shows an example quesion wih buzz posiions labeled. The DRON agens demonsrae dynamic behavior agains differen players; DRON-MOE almos always buzzes righ before he opponen in his example. In addiion, when he player buzzes wrong and he game coninues, DRON- MOE learns o wai longer since he opponen is gone, while he oher agens are sill in a rush. As wih he Soccer ask, adding exra supervision does no yield beer resuls over DRON-MOE (Table 3) bu significanly improves DRON-conca. Figure 8 varies he number of expers in DRON-MOE (K) from wo o four. Using a mixure model for he opponens consisenly improves over he DQN baseline, and using hree expers gives beer performance on his ask. For muliasking, adding he acion supervision does no help a all. However, he more high-level ype supervision yields compeen resuls, especially wih four expers, mosly because he number of expers maches he number of ypes.

8 Opponen Modeling in Deep Reinforcemen Learning Reward DQN-world DRON-MoE(R) DRON-MoE(R+acion) DRON-MoE(R+ype) K=2 K=3 K=4 (2011) and Bard e al. (2013) consruc a a porfolio of sraegies offline based on domain knowledge or pas experience for heads-up limi Texas hold em; hey hen selec sraegies online using muli-arm bandi algorihms. Our approach does no have a clear online/offline disincion. We learn sraegies and heir selecor in a join, probabilisic way. However, he offline consrucion can be mimicked in our models by iniializing exper neworks wih DQN pre-rained agains differen opponens. Figure 8. Effec of varying he number expers (2 4) and muliasking on quiz bowl. The error bars show he 90% confidence inerval. DRON-MOE consisenly improves over DQN regardless of he number of mixure componens. Supervision of he opponen ype is more helpful han he specific acion aken. Average reward DQN-world DRON-MoE 3 DQN-self DRON-conca Number of epochs Figure 9. Learning curves on Quizbowl over fify epochs. DRON models are more sable han DQN. 5. Relaed Work and Discussion Implici vs. Explici opponen modeling Opponen modeling has been sudied exensively in games. Mos exising approaches fall ino he caegory of explici modeling, where a model (e.g., decision rees, neural neworks, Bayesian models) is buil o direcly predic parameers of he opponen, e.g., acions (Uher & Veloso, 2003; Ganzfried & Sandholm, 2011), privae informaion (Billings e al., 1998b; Richards & Amir, 2007), or domain-specific sraegies (Schadd e al., 2007; Souhey e al., 2005). One difficuly is ha he model may need a prohibiive number of examples before producing anyhing useful. Anoher is ha as he opponen behavior is modeled separaely from he world, i is no always clear how o incorporae hese predicions robusly ino policy learning. The resuls on muliasking DRON also sugges ha improvemen from explici modeling is limied. However, i is beer suied o games of incomplee informaion, where i is clear wha informaion needs o be prediced o achieve higher reward. Our work is closely relaed o implici opponen modeling. Since he agen aims o maximize is own expeced reward wihou having o idenify he opponen s sraegy, his approach does no have he difficuly of incorporaing predicion of he opponen s parameers. Rubin & Wason Neural nework opponen models Davidson (1999) applies neural neworks o opponen modeling, where a simple muli-layer percepron is rained as a classifier o predic opponen acions given game logs. Locke e al. (2007) propose an archiecure similar o DRON-conca ha aims o idenify he ype of an opponen. However, insead of learning a hidden represenaion, hey learn a mixure weighs over a pre-specified se of cardinal opponens; and hey use he neural nework as a sandalone solver wihou he reinforcemen learning seing, which may no be suiable for more complex problems. Foerser e al. (2016) use modern neural neworks o learn a group of parameer-sharing agens ha solve a coordinaion ask, where each agen is conrolled by a deep recurren Q-Nework (Hausknech & Sone, 2015). Our seing is differen in ha we conrol only one agen and he policy space of oher agens is unknown. Opponen modeling wih neural neworks remains undersudied wih ample room for improvemen. 6. Conclusion and Fuure Work Our general opponen modeling approach in he reinforcemen learning seing incorporaes (implici) predicion of opponens behavior ino policy learning wihou domain knowledge. We use recen deep Q-learning advances o learn a represenaion of opponens ha beer maximizes available rewards. The proposed nework archiecures are novel models ha capure he ineracion beween opponen behavior and Q-values. Our model is also flexible enough o include supervision for parameers of he opponens, much as in explici modeling. These gains can furher benefi from advances in deep learning. For example, Eigen e al. (2014) exends he Mixureof-Expers nework o a sacked model deep Mixure-of- Expers which can be combined wih hierarchical reinforcemen learning o learn a hierarchy of opponen sraegies in large, complex domains such as online sraegy games. In addiion, insead of hand-crafing opponen feaures, we can feed in raw opponen acions and use a recurren neural nework o learn he opponen represenaion. Anoher imporan direcion is o design online algorihms ha can adap o fas changing behavior and balance exploiaion and exploraion of opponens.

9 Opponen Modeling in Deep Reinforcemen Learning Acknowledgemens We hank Hua He, Xiujun Li, and Mohi Iyyer for helpful discussions abou deep Q-learning and our model. We also hank he anonymous reviewers for heir insighful commens. This work was suppored by NSF gran IIS Boyd-Graber is also parially suppored by NSF grans CCF and NCSE Any opinions, findings, conclusions, or recommendaions expressed here are hose of he auhors and do no necessarily reflec he view of he sponsor. References Bard, Nolan, Johanson, Michael, Burch, Neil, and Bowling, Michael. Online implici agen modelling. In Proceedings of Inernaional Conference on Auonomous Agens and Muliagen Sysems, Billings, Darse, Papp, Denis, Schaeffer, Jonahan, and Szafron, Duane. Opponen modeling in poker. In Associaion for he Advancemen of Arificial Inelligence, 1998a. Billings, Darse, Papp, Denis, Schaeffer, Jonahan, and Szafron, Duane. Opponen modeling in poker. In Associaion for he Advancemen of Arificial Inelligence, 1998b. Boyd-Graber, Jordan, Sainoff, Brianna, He, He, and Daumé III, Hal. Besing he quiz maser: Crowdsourcing incremenal classificaion games. In Empirical Mehods in Naural Language Processing, Collins, Brian. Combining opponen modeling and modelbased reinforcemen learning in a wo-player compeiive game. Maser s hesis, School of Informaics, Universiy of Edinburgh, Davidson, Aaron. Using arifical neural neworks o model opponens in exas hold em. CMPUT Research Projec Review, URL hp:// aaron/poker/nnpoker.pdf. Duchi, John, Hazan, Elad, and Singer, Yoram. Adapive subgradien mehods for online learning and sochasic opimizaion. Journal of Machine Learning Research, Eigen, David, Ranzao, Marc Aurelio, and Suskever, Ilya. Learning facored represenaions in a deep mixure of expers. In ICLR Workshop, Foerser, Jakob N., Assael, Yannis M., de Freias, Nando, and Whieson, Shimon. Learning o communicae o solve riddles wih deep disribued recurren q-neworks. Arxiv: , Ganzfried, Sam and Sandholm, Tuomas. Game heorybased opponen modeling in large imperfec-informaion games. In Proceedings of Inernaional Conference on Auonomous Agens and Muliagen Sysems, Hausknech, Mahew and Sone, Peer. Deep recurren q-learning for parially observable MDPs. Arxiv: , Jacobs, Rober A., Jordan, Michael I., Nowlan, Seven J., and Hinon, Geoffrey E. Adapive mixures of local expers. Neural Compuaion, 3(1):79 87, Liman, Michael L. Markov games as a framework for muli-agen reinforcemen learning. In Proceedings of he Inernaional Conference of Machine Learning, Locke, Alan J., Chen, Charles L., and Miikkulainen, Riso. Evolving explici opponen models in game playing. In Proceeedings of he Geneic and Evoluionary Compuaion Conference, Mnih, Volodymyr, Heess, Nicolas, Graves, Alex, and Kavukcuoglu, Koray. Recurren models of visual aenion. In Proceedings of Advances in Neural Informaion Processing Sysems, Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness, Joel, Bellemare, Marc G., Graves, Alex, Riedmiller, Marin, Fidjeland, Andreas K., Osrovski, Georg, Peersen, Sig, Beaie, Charles, Sadik, Amir, Anonoglou, Ioannis, King, Helen, Kumaran, Dharshan, Wiersra, Daan, Legg, Shane, and Hassabis, Demis. Human-level conrol hrough deep reinforcemen learning. Naure, 518(7540): , URL hp: //dx.doi.org/ /naure Richards, Mark and Amir, Eyal. Opponen modeling in Scrabble. In Inernaional Join Conference on Arificial Inelligence, Rubin, Jonahan and Wason, Ian. On combining decisions from muliple exper imiaors for performance. In Inernaional Join Conference on Arificial Inelligence, Schadd, Frederik, Bakkes, Er, and Spronck, Pieer. Opponen modeling in real-ime sraegy games. In Proceedings of he GAME-ON 2007, pp , Souhey, Finnegan, Bowling, Michael, Larson, Bryce, Piccione, Carmelo, Burch, Neil, Billings, Darse, and Rayner, Chris. Bayes bluff: Opponen modelling in poker. In Proceedings of Uncerainy in Arificial Inelligence, Suon, Richard S and Baro, Andrew G. Reinforcemen learning: An inroducion, volume 1. MIT Press Cambridge, 1998.

10 Tampuu, Ardi, Maiisen, Tambe, Kodelja, Dorian, Kuzovkin, Ilya, Korjus, Krisjan, Aru, Juhan, Aru, Jaan, and Vicene, Raul. Muliagen cooperaion and compeiion wih deep reinforcemen learning. ArXiv: , Uher, William and Veloso, Manuela. Adversarial reinforcemen learning. Technical Repor CMU-CS , School of Compuer Science, Carnegie Mellon Universiy, Wakins, Chrisopher J. C. H. and Dayan, Peer. Q-learning. Machine Learning, 8(3-4): , Zhang, Marvin, McCarhy, Zoe, Finn, Chelsea, Levine, Sergey, and Abbeel, Pieer. Learning deep neural nework policies wih coninuous memory saes. ArXiv: , Opponen Modeling in Deep Reinforcemen Learning

Neural Network Model of the Backpropagation Algorithm

Neural Network Model of the Backpropagation Algorithm Neural Nework Model of he Backpropagaion Algorihm Rudolf Jakša Deparmen of Cyberneics and Arificial Inelligence Technical Universiy of Košice Lená 9, 4 Košice Slovakia jaksa@neuron.uke.sk Miroslav Karák

More information

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments Inernaional Journal of Elecrical and Compuer Engineering (IJECE) Vol. 6, No. 5, Ocober 2016, pp. 2415~2424 ISSN: 2088-8708, DOI: 10.11591/ijece.v6i5.10639 2415 An Effiecien Approach for Resource Auo-Scaling

More information

Fast Multi-task Learning for Query Spelling Correction

Fast Multi-task Learning for Query Spelling Correction Fas Muli-ask Learning for Query Spelling Correcion Xu Sun Dep. of Saisical Science Cornell Universiy Ihaca, NY 14853 xusun@cornell.edu Anshumali Shrivasava Dep. of Compuer Science Cornell Universiy Ihaca,

More information

More Accurate Question Answering on Freebase

More Accurate Question Answering on Freebase More Accurae Quesion Answering on Freebase Hannah Bas, Elmar Haussmann Deparmen of Compuer Science Universiy of Freiburg 79110 Freiburg, Germany {bas, haussmann}@informaik.uni-freiburg.de ABSTRACT Real-world

More information

MyLab & Mastering Business

MyLab & Mastering Business MyLab & Masering Business Efficacy Repor 2013 MyLab & Masering: Business Efficacy Repor 2013 Edied by Michelle D. Speckler 2013 Pearson MyAccouningLab, MyEconLab, MyFinanceLab, MyMarkeingLab, and MyOMLab

More information

1 Language universals

1 Language universals AS LX 500 Topics: Language Uniersals Fall 2010, Sepember 21 4a. Anisymmery 1 Language uniersals Subjec-erb agreemen and order Bach (1971) discusses wh-quesions across SO and SO languages, hypohesizing:...

More information

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices Z. Zhang e al.: Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices 525 Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion

More information

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports Downloaded from ascelibrary.org by Basil Sephanis on 07/13/16. Copyrigh ASCE. For personal use only; all righs reserved. Informaion Propagaion for informing Special Populaion Subgroups abou New Ground

More information

Georgetown University at TREC 2017 Dynamic Domain Track

Georgetown University at TREC 2017 Dynamic Domain Track Georgetown University at TREC 2017 Dynamic Domain Track Zhiwen Tang Georgetown University zt79@georgetown.edu Grace Hui Yang Georgetown University huiyang@cs.georgetown.edu Abstract TREC Dynamic Domain

More information

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Exploration. CS : Deep Reinforcement Learning Sergey Levine Exploration CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1. Homework 4 due on Wednesday 2. Project proposal feedback sent Today s Lecture 1. What is exploration? Why is it a problem?

More information

AI Agent for Ice Hockey Atari 2600

AI Agent for Ice Hockey Atari 2600 AI Agent for Ice Hockey Atari 2600 Emman Kabaghe (emmank@stanford.edu) Rajarshi Roy (rroy@stanford.edu) 1 Introduction In the reinforcement learning (RL) problem an agent autonomously learns a behavior

More information

Lecture 10: Reinforcement Learning

Lecture 10: Reinforcement Learning Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

Reinforcement Learning by Comparing Immediate Reward

Reinforcement Learning by Comparing Immediate Reward Reinforcement Learning by Comparing Immediate Reward Punit Pandey DeepshikhaPandey Dr. Shishir Kumar Abstract This paper introduces an approach to Reinforcement Learning Algorithm by comparing their immediate

More information

Generative models and adversarial training

Generative models and adversarial training Day 4 Lecture 1 Generative models and adversarial training Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University What is a generative model?

More information

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN- LEARNING TO PLAY IN A DAY: FASTER DEEP REIN- FORCEMENT LEARNING BY OPTIMALITY TIGHTENING Frank S. He Department of Computer Science University of Illinois at Urbana-Champaign Zhejiang University frankheshibi@gmail.com

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Axiom 2013 Team Description Paper

Axiom 2013 Team Description Paper Axiom 2013 Team Description Paper Mohammad Ghazanfari, S Omid Shirkhorshidi, Farbod Samsamipour, Hossein Rahmatizadeh Zagheli, Mohammad Mahdavi, Payam Mohajeri, S Abbas Alamolhoda Robotics Scientific Association

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

High-level Reinforcement Learning in Strategy Games

High-level Reinforcement Learning in Strategy Games High-level Reinforcement Learning in Strategy Games Christopher Amato Department of Computer Science University of Massachusetts Amherst, MA 01003 USA camato@cs.umass.edu Guy Shani Department of Computer

More information

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Active Learning. Yingyu Liang Computer Sciences 760 Fall Active Learning Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven,

More information

Using Proportions to Solve Percentage Problems I

Using Proportions to Solve Percentage Problems I RP7-1 Using Proportions to Solve Percentage Problems I Pages 46 48 Standards: 7.RP.A. Goals: Students will write equivalent statements for proportions by keeping track of the part and the whole, and by

More information

TD(λ) and Q-Learning Based Ludo Players

TD(λ) and Q-Learning Based Ludo Players TD(λ) and Q-Learning Based Ludo Players Majed Alhajry, Faisal Alvi, Member, IEEE and Moataz Ahmed Abstract Reinforcement learning is a popular machine learning technique whose inherent self-learning ability

More information

ACTIVITY: Comparing Combination Locks

ACTIVITY: Comparing Combination Locks 5.4 Compound Events outcomes of one or more events? ow can you find the number of possible ACIVIY: Comparing Combination Locks Work with a partner. You are buying a combination lock. You have three choices.

More information

Probability and Statistics Curriculum Pacing Guide

Probability and Statistics Curriculum Pacing Guide Unit 1 Terms PS.SPMJ.3 PS.SPMJ.5 Plan and conduct a survey to answer a statistical question. Recognize how the plan addresses sampling technique, randomization, measurement of experimental error and methods

More information

Artificial Neural Networks written examination

Artificial Neural Networks written examination 1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

More information

PreReading. Lateral Leadership. provided by MDI Management Development International

PreReading. Lateral Leadership. provided by MDI Management Development International PreReading Lateral Leadership NEW STRUCTURES REQUIRE A NEW ATTITUDE In an increasing number of organizations hierarchies lose their importance and instead companies focus on more network-like structures.

More information

In Workflow. Viewing: Last edit: 10/27/15 1:51 pm. Approval Path. Date Submi ed: 10/09/15 2:47 pm. 6. Coordinator Curriculum Management

In Workflow. Viewing: Last edit: 10/27/15 1:51 pm. Approval Path. Date Submi ed: 10/09/15 2:47 pm. 6. Coordinator Curriculum Management 1 of 5 11/19/2015 8:10 AM Date Submi ed: 10/09/15 2:47 pm Viewing: Last edit: 10/27/15 1:51 pm Changes proposed by: GODWINH In Workflow 1. BUSI Editor 2. BUSI Chair 3. BU Associate Dean 4. Biggio Center

More information

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley Challenges in Deep Reinforcement Learning Sergey Levine UC Berkeley Discuss some recent work in deep reinforcement learning Present a few major challenges Show some of our recent work toward tackling

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology Tiancheng Zhao CMU-LTI-16-006 Language Technologies Institute School of Computer Science Carnegie Mellon

More information

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Todd Holloway Two Lecture Series for B551 November 20 & 27, 2007 Indiana University Outline Introduction Bias and

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design Burton Levine Karol Krotki NISS/WSS Workshop on Inference from Nonprobability Samples September 25, 2017 RTI

More information

Regret-based Reward Elicitation for Markov Decision Processes

Regret-based Reward Elicitation for Markov Decision Processes 444 REGAN & BOUTILIER UAI 2009 Regret-based Reward Elicitation for Markov Decision Processes Kevin Regan Department of Computer Science University of Toronto Toronto, ON, CANADA kmregan@cs.toronto.edu

More information

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education GCSE Mathematics B (Linear) Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education Mark Scheme for November 2014 Oxford Cambridge and RSA Examinations OCR (Oxford Cambridge

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Improving Action Selection in MDP s via Knowledge Transfer

Improving Action Selection in MDP s via Knowledge Transfer In Proc. 20th National Conference on Artificial Intelligence (AAAI-05), July 9 13, 2005, Pittsburgh, USA. Improving Action Selection in MDP s via Knowledge Transfer Alexander A. Sherstov and Peter Stone

More information

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler Machine Learning and Data Mining Ensembles of Learners Prof. Alexander Ihler Ensemble methods Why learn one classifier when you can learn many? Ensemble: combine many predictors (Weighted) combina

More information

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots Varun Raj Kompella, Marijn Stollenga, Matthew Luciw, Juergen Schmidhuber The Swiss AI Lab IDSIA, USI

More information

Speeding Up Reinforcement Learning with Behavior Transfer

Speeding Up Reinforcement Learning with Behavior Transfer Speeding Up Reinforcement Learning with Behavior Transfer Matthew E. Taylor and Peter Stone Department of Computer Sciences The University of Texas at Austin Austin, Texas 78712-1188 {mtaylor, pstone}@cs.utexas.edu

More information

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Linking the Ohio State Assessments to NWEA MAP Growth Tests * Linking the Ohio State Assessments to NWEA MAP Growth Tests * *As of June 2017 Measures of Academic Progress (MAP ) is known as MAP Growth. August 2016 Introduction Northwest Evaluation Association (NWEA

More information

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Dublin City Schools Mathematics Graded Course of Study GRADE 4 I. Content Standard: Number, Number Sense and Operations Standard Students demonstrate number sense, including an understanding of number systems and reasonable estimates using paper and pencil, technology-supported

More information

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search Using Deep Convolutional Neural Networks in Monte Carlo Tree Search Tobias Graf (B) and Marco Platzner University of Paderborn, Paderborn, Germany tobiasg@mail.upb.de, platzner@upb.de Abstract. Deep Convolutional

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

Learning Probabilistic Behavior Models in Real-Time Strategy Games

Learning Probabilistic Behavior Models in Real-Time Strategy Games Proceedings of the Seventh AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment Learning Probabilistic Behavior Models in Real-Time Strategy Games Ethan Dereszynski and Jesse

More information

Naviance Family Connection

Naviance Family Connection What is it? Naviance Family Connection Junior Year Naviance Family Connection is a web-based program that allows you and your parents to organize and manage your college search process. It also allows

More information

A Case-Based Approach To Imitation Learning in Robotic Agents

A Case-Based Approach To Imitation Learning in Robotic Agents A Case-Based Approach To Imitation Learning in Robotic Agents Tesca Fitzgerald, Ashok Goel School of Interactive Computing Georgia Institute of Technology, Atlanta, GA 30332, USA {tesca.fitzgerald,goel}@cc.gatech.edu

More information

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task Stephen James Dyson Robotics Lab Imperial College London slj12@ic.ac.uk Andrew J. Davison Dyson Robotics

More information

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Xinying Song, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research, One Microsoft Way, Redmond, WA 98052, U.S.A.

More information

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski Problem Statement and Background Given a collection of 8th grade science questions, possible answer

More information

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Ch 2 Test Remediation Work Name MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Provide an appropriate response. 1) High temperatures in a certain

More information

Softprop: Softmax Neural Network Backpropagation Learning

Softprop: Softmax Neural Network Backpropagation Learning Softprop: Softmax Neural Networ Bacpropagation Learning Michael Rimer Computer Science Department Brigham Young University Provo, UT 84602, USA E-mail: mrimer@axon.cs.byu.edu Tony Martinez Computer Science

More information

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email Marilyn A. Walker Jeanne C. Fromer Shrikanth Narayanan walker@research.att.com jeannie@ai.mit.edu shri@research.att.com

More information

Measurement & Analysis in the Real World

Measurement & Analysis in the Real World Measurement & Analysis in the Real World Tools for Cleaning Messy Data Will Hayes SEI Robert Stoddard SEI Rhonda Brown SEI Software Solutions Conference 2015 November 16 18, 2015 Copyright 2015 Carnegie

More information

Paper 2. Mathematics test. Calculator allowed. First name. Last name. School KEY STAGE TIER

Paper 2. Mathematics test. Calculator allowed. First name. Last name. School KEY STAGE TIER 259574_P2 5-7_KS3_Ma.qxd 1/4/04 4:14 PM Page 1 Ma KEY STAGE 3 TIER 5 7 2004 Mathematics test Paper 2 Calculator allowed Please read this page, but do not open your booklet until your teacher tells you

More information

Model Ensemble for Click Prediction in Bing Search Ads

Model Ensemble for Click Prediction in Bing Search Ads Model Ensemble for Click Prediction in Bing Search Ads Xiaoliang Ling Microsoft Bing xiaoling@microsoft.com Hucheng Zhou Microsoft Research huzho@microsoft.com Weiwei Deng Microsoft Bing dedeng@microsoft.com

More information

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Neural Network GUI Tested on Text-To-Phoneme Mapping A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis

More information

SARDNET: A Self-Organizing Feature Map for Sequences

SARDNET: A Self-Organizing Feature Map for Sequences SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu

More information

Deep Neural Network Language Models

Deep Neural Network Language Models Deep Neural Network Language Models Ebru Arısoy, Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY, 10598, USA {earisoy, tsainath, bedk, bhuvana}@us.ibm.com

More information

TEAM NEWSLETTER. Welton Primar y School SENIOR LEADERSHIP TEAM. School Improvement

TEAM NEWSLETTER. Welton Primar y School SENIOR LEADERSHIP TEAM. School Improvement Welton Primar y School February 2016 SENIOR LEADERSHIP TEAM NEWSLETTER SENIOR LEADERSHIP TEAM Nikki Pidgeon Head Teacher Sarah Millar Lead for Behaviour, SEAL and PE Laura Leitch Specialist Leader in Education,

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

College Pricing and Income Inequality

College Pricing and Income Inequality College Pricing and Income Inequality Zhifeng Cai U of Minnesota, Rutgers University, and FRB Minneapolis Jonathan Heathcote FRB Minneapolis NBER Income Distribution, July 20, 2017 The views expressed

More information

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany Jana Kitzmann and Dirk Schiereck, Endowed Chair for Banking and Finance, EUROPEAN BUSINESS SCHOOL, International

More information

E mail: Phone: LIBRARY MBA MAIN OFFICE

E mail: Phone: LIBRARY MBA MAIN OFFICE MASTER OF BUSINESS ADMINISTRATION 1 Jennifer Brandow, MBA Director E mail: mba@wsc.edu Phone: 402.375.7587 MBA OFFICE Gardner Hall 106 1111 Main St. Wayne, NE 68787 ADMISSIONS 402.375.7234 admissions@wsc.edu

More information

(Sub)Gradient Descent

(Sub)Gradient Descent (Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include

More information

Residual Stacking of RNNs for Neural Machine Translation

Residual Stacking of RNNs for Neural Machine Translation Residual Stacking of RNNs for Neural Machine Translation Raphael Shu The University of Tokyo shu@nlab.ci.i.u-tokyo.ac.jp Akiva Miura Nara Institute of Science and Technology miura.akiba.lr9@is.naist.jp

More information

Preliminary Chapter survey experiment an observational study that is not a survey

Preliminary Chapter survey experiment an observational study that is not a survey 1 Preliminary Chapter P.1 Getting data from Jamie and her friends is convenient, but it does not provide a good snapshot of the opinions held by all young people. In short, Jamie and her friends are not

More information

Knowledge Transfer in Deep Convolutional Neural Nets

Knowledge Transfer in Deep Convolutional Neural Nets Knowledge Transfer in Deep Convolutional Neural Nets Steven Gutstein, Olac Fuentes and Eric Freudenthal Computer Science Department University of Texas at El Paso El Paso, Texas, 79968, U.S.A. Abstract

More information

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Transfer Learning Action Models by Measuring the Similarity of Different Domains Transfer Learning Action Models by Measuring the Similarity of Different Domains Hankui Zhuo 1, Qiang Yang 2, and Lei Li 1 1 Software Research Institute, Sun Yat-sen University, Guangzhou, China. zhuohank@gmail.com,lnslilei@mail.sysu.edu.cn

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM Proceedings of 28 ISFA 28 International Symposium on Flexible Automation Atlanta, GA, USA June 23-26, 28 ISFA28U_12 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM Amit Gil, Helman Stern, Yael Edan, and

More information

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design Paper #3 Five Q-to-survey approaches: did they work? Job van Exel

More information

A Game-based Assessment of Children s Choices to Seek Feedback and to Revise

A Game-based Assessment of Children s Choices to Seek Feedback and to Revise A Game-based Assessment of Children s Choices to Seek Feedback and to Revise Maria Cutumisu, Kristen P. Blair, Daniel L. Schwartz, Doris B. Chin Stanford Graduate School of Education Please address all

More information

The Strong Minimalist Thesis and Bounded Optimality

The Strong Minimalist Thesis and Bounded Optimality The Strong Minimalist Thesis and Bounded Optimality DRAFT-IN-PROGRESS; SEND COMMENTS TO RICKL@UMICH.EDU Richard L. Lewis Department of Psychology University of Michigan 27 March 2010 1 Purpose of this

More information

The Bruins I.C.E. School

The Bruins I.C.E. School The Bruins I.C.E. School Lesson 1: Retell and Sequence the Story Lesson 2: Bruins Name Jersey Lesson 3: Building Hockey Words (Letter Sound Relationships-Beginning Sounds) Lesson 4: Building Hockey Words

More information

Rule-based Expert Systems

Rule-based Expert Systems Rule-based Expert Systems What is knowledge? is a theoretical or practical understanding of a subject or a domain. is also the sim of what is currently known, and apparently knowledge is power. Those who

More information

Go fishing! Responsibility judgments when cooperation breaks down

Go fishing! Responsibility judgments when cooperation breaks down Go fishing! Responsibility judgments when cooperation breaks down Kelsey Allen (krallen@mit.edu), Julian Jara-Ettinger (jjara@mit.edu), Tobias Gerstenberg (tger@mit.edu), Max Kleiman-Weiner (maxkw@mit.edu)

More information

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2 Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2 Ted Pedersen Department of Computer Science University of Minnesota Duluth, MN, 55812 USA tpederse@d.umn.edu

More information

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017 Instructor Syed Zahid Ali Room No. 247 Economics Wing First Floor Office Hours Email szahid@lums.edu.pk Telephone Ext. 8074 Secretary/TA TA Office Hours Course URL (if any) Suraj.lums.edu.pk FINN 321 Econometrics

More information

Mathematics Success Level E

Mathematics Success Level E T403 [OBJECTIVE] The student will generate two patterns given two rules and identify the relationship between corresponding terms, generate ordered pairs, and graph the ordered pairs on a coordinate plane.

More information

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410) JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD 21218. (410) 516 5728 wrightj@jhu.edu EDUCATION Harvard University 1993-1997. Ph.D., Economics (1997).

More information

School Size and the Quality of Teaching and Learning

School Size and the Quality of Teaching and Learning School Size and the Quality of Teaching and Learning An Analysis of Relationships between School Size and Assessments of Factors Related to the Quality of Teaching and Learning in Primary Schools Undertaken

More information

Using focal point learning to improve human machine tacit coordination

Using focal point learning to improve human machine tacit coordination DOI 10.1007/s10458-010-9126-5 Using focal point learning to improve human machine tacit coordination InonZuckerman SaritKraus Jeffrey S. Rosenschein The Author(s) 2010 Abstract We consider an automated

More information

NCEO Technical Report 27

NCEO Technical Report 27 Home About Publications Special Topics Presentations State Policies Accommodations Bibliography Teleconferences Tools Related Sites Interpreting Trends in the Performance of Special Education Students

More information

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria FUZZY EXPERT SYSTEMS 16-18 18 February 2002 University of Damascus-Syria Dr. Kasim M. Al-Aubidy Computer Eng. Dept. Philadelphia University What is Expert Systems? ES are computer programs that emulate

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Algebra 2- Semester 2 Review

Algebra 2- Semester 2 Review Name Block Date Algebra 2- Semester 2 Review Non-Calculator 5.4 1. Consider the function f x 1 x 2. a) Describe the transformation of the graph of y 1 x. b) Identify the asymptotes. c) What is the domain

More information

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One

More information

An empirical study of learning speed in backpropagation

An empirical study of learning speed in backpropagation Carnegie Mellon University Research Showcase @ CMU Computer Science Department School of Computer Science 1988 An empirical study of learning speed in backpropagation networks Scott E. Fahlman Carnegie

More information

Level 1 Mathematics and Statistics, 2015

Level 1 Mathematics and Statistics, 2015 91037 910370 1SUPERVISOR S Level 1 Mathematics and Statistics, 2015 91037 Demonstrate understanding of chance and data 9.30 a.m. Monday 9 November 2015 Credits: Four Achievement Achievement with Merit

More information

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur) Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur) 1 Interviews, diary studies Start stats Thursday: Ethics/IRB Tuesday: More stats New homework is available

More information

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point. STT 231 Test 1 Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point. 1. A professor has kept records on grades that students have earned in his class. If he

More information

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Innov High Educ (2009) 34:93 103 DOI 10.1007/s10755-009-9095-2 Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Phyllis Blumberg Published online: 3 February

More information