Teaching a Machine to Read Maps with Deep Reinforcement Learning

Teaching a Machine o Read Maps wih Deep Reinforcemen Learning Gino Brunner and Oliver Richer and Yuyi Wang and Roger Waenhofer ETH Zurich {brunnegi, richero, yuwang, waenhofer}@ehz.ch Absrac The abiliy o use a 2D map o navigae a complex 3D environmen is quie remarkable, and even difficul for many humans. Localizaion and navigaion is also an imporan problem in domains such as roboics, and has recenly become a focus of he deep reinforcemen learning communiy. In his paper we each a reinforcemen learning agen o read a map in order o find he shores way ou of a random maze i has never seen before. Our sysem combines several sae-of-hear mehods such as A3 and incorporaes novel elemens such as a recurren localizaion cell. Our agen learns o localize iself based on 3D firs person images and an approximae orienaion angle. The agen generalizes well o bigger mazes, showing ha i learned useful localizaion and navigaion capabiliies. 1 Inroducion One of he main success facors of human evoluion is our abiliy o craf and use complex ools. No only did his abiliy give us a moivaion for social ineracion by eaching ohers how o use differen ools, i also enhanced our hinking capabiliies, since we had o undersand ever more complex ools. Take a map as an example; a map helps us navigae places we have never seen before. However, we firs need o learn how o read i, i.e., we need o associae he conen of a wo-dimensional map wih our hreedimensional surroundings. Wih algorihms becoming increasingly capable of learning complex relaions, a way o make machines inelligen is o each hem how o use already exising ools. In his paper, we each a machine how o read a map wih deep reinforcemen learning. The agen wakes up in a maze. The agen s view is an image: he maze rendered from he agen s perspecive, like a dungeon in a firs person video game. This rendered image is provided by he DeepMind Lab environmen (Beaie e al. 2016). The agen can be conrolled by a human, or as in our case, by a complex deep reinforcemen learning archiecure. 1 The agen can move (forward, backward, lef, righ) and roae (lef, righ), and is view image will change Names in alphabeic order opyrigh c 2018, Associaion for he Advancemen of Arificial Inelligence (www.aaai.org). All righs reserved. 1 Our code can be found here: hps://gihub.com/ OliverRicher/map-reader.gi accordingly. In addiion, he agen ges o see a map of he maze, also an image, as can be seen in igure 1. One locaion on he map is marked wih an X - he agen s arge. The crux is ha he agen does no know where on he map i currenly is. Several locaions on he map migh correspond well wih he curren view. Thus he agen needs o move around o learn is posiion and hen move o he arge, as illusraed in igures 6 and 8. We do equip he agen wih an approximae orienaion angle, i.e., he agen roughly knows he direcion i is moving or looking. In he map, up is always norh. During raining he agen learns which approximae orienaion corresponds o norh. A complex muli-sage ask, such as navigaing a maze wih he help of a map, can be naurally decomposed ino several subasks: (i) The agen needs o observe is 3D environmen and compare i o he map o deermine is mos likely posiion. (ii) The agen needs o undersand he map, or in our case associae symbols on he map wih rewards and hereby gain an undersanding of wha a wall is, wha navigable space is, and wha he arge is. (iii) inally he agens needs o learn how o follow a plan in order o reach he arge. Our conribuion is as follows: We presen a novel modular reinforcemen learning archiecure ha consiss of a reacive agen and several inermediae subask modules. Each of hese modules is designed o solve a specific subask. The modules hemselves can conain neural neworks or alernaively implemen exac algorihms or heurisics. Our presened agen is capable of finding he arge in random mazes roughly hree imes he size of he larges mazes i has seen during raining. urher conribuions include: The Recurren Localizaion ell ha oupus a locaion probabiliy disribuion based on an esimaed sream of visible local maps. A simple mapping module ha creaes a visible local 2D map from 3D RGB inpu. The mapping module is robus, even if he agen s compass is inaccurae. 2 Relaed Work Reinforcemen learning in relaion o AI has been sudied since he 1950 s (Minsky 1954). Imporan early work on reinforcemen learning includes he emporal difference

learning mehod by Suon (1984; 1988), which is he basis for acor-criic algorihms (Baro, Suon, and Anderson 1983) and Q-learning echniques (Wakins 1989; Wakins and Dayan 1992). irs works using arificial neural neworks for reinforcemen learning include (Williams 1992) and (Gullapalli 1990). or an in-deph overview of reinforcemen learning we refer he ineresed readers o (Kaelbling, Liman, and Moore 1996), (Suon and Baro 1998) and (Szepesvári 2010). The curren deep learning boom was sared by, among oher conribuions, he backpropagaion algorihm (Rumelhar e al. 1988) and advances in compuing power and GPU frameworks. However, deep learning could no be applied effecively o reinforcemen learning unil recenly. Mnih e al. (2015) inroduced he Deep-Q-Nework (DQN) ha uses experience replay and arge neworks o sabilize he learning process. Since hen, several exensions o he DQN archiecure have been proposed, such as he Double Deep- Q-Nework (DDQN) (van Hassel, Guez, and Silver 2016) and he dueling nework archiecure (Wang e al. 2016). These neworks are based on using replay buffers o sabilize learning, such as prioriized experience replay (Schaul e al. 2015). The sae-of-he-ar A3 (Mnih e al. 2016) relies on asynchronous acor-learners o sabilize learning. In our sysem, we use A3 learning on a modified nework archiecure o rain our reacive agen and he localizaion module in an on-policy manner. We also make use of (prioriized) replay buffers o rain our agen off policy. A major challenge in reinforcemen learning are environmens wih delayed or sparse rewards. An agen ha never ges a reward can never learn good behavior. Thus Jaderberg e al. (2016) and Mirowski e al. (2016) inroduced auxiliary asks ha le he agen learn based on inermediae inrinsic pseudo-rewards, such as predicing he deph from a 3D RGB image, while simulaneously rying o solve he main ask, e.g., finding he exi in a 3D maze. The policies learned by he auxiliary asks are no direcly used by he agen, bu solely serve he purpose of helping he agen learn beer represenaions which improves is performance on he main ask. The idea of auxiliary asks is inspired by prior work on emporal absracions, such as opions (Suon, Precup, and Singh 1999), whose focus was on learning emporal absracions o improve high-level learning and planning. In our work we inroduce a modularized archiecure ha incorporaes inermediae subasks, such as localizaion, local map esimaion and global map inerpreaion. In conras o (Jaderberg e al. 2016), our reacive agen direcly uses he oupus of hese modules o solve he main ask. Noe ha we use an auxiliary ask inside our localizaion module o improve he local map esimaion. Kulkarni e al. (2016) inroduced a hierarchical version of he DQN o ackle he challenge of delayed and sparse rewards. Their sysem operaes a differen emporal scales and allows he definiion of goals using eniy relaions. The policy is learned in such a way o reach hese goals. We use a similar approach o make our agen follow a plan, such as, go norh. Mapping and localizaion has been exensively sudied in he domain of roboics (Thrun, Burgard, and ox 2005). A robo creaes a map of he environmen from sensory inpu (e.g., sonar or LIDAR) and hen uses his map o plan a pah hrough he environmen. Subsequen works have combined hese approaches wih compuer vision echniques (uenes-pacheco, Ascencio, and Rendón-Mancha 2015) ha use RGB(-D) images as inpu. Machine learning echniques have been used o solve mapping and planning separaely, and laer also ackled he join mapping and planning problem (Elfes 1989). Insead of separaing mapping and planning phases, reinforcemen learning mehods aimed a direcly learning good policies for roboic asks, e.g., for learning human-like moor skills (Peers and Schaal 2008). Recen advances in deep reinforcemen learning have spawned impressive work in he area of mapping and localizaion. The UNREAL agen (Jaderberg e al. 2016) uses auxiliary asks and a replay buffer o learn how o navigae a 3D maze. Mirowski e al. (2016) came up wih an agen ha uses differen auxiliary asks in an online manner o undersand if navigaion capabiliies manifes as a biproduc of solving a reinforcemen learning problem. Zhu e al. (2017) ackled he problems of generalizaion across asks and daa inefficiency. They use a realisic 3D environmen wih physics engine o gaher raining daa efficienly. Their model is capable of navigaing o a visually specified arge. In conras o oher approaches, hey use a memoryless feed-forward model insead of recurren models. Gupa e al. (2017) simulaed a robo ha navigaes hrough a real 3D environmen. They focus on he archiecural problem of learning mapping and planning in a join manner, such ha he wo phases can profi from knowing each oher s needs. Their agen is capable of creaing an inernal 2D represenaion of he local 3D environmen, similar o our local visible map. In our work a global map is given, and he agen learns o inerpre and read ha map o reach a cerain arge locaion. Thus, our agen is capable of following complicaed long range rajecories in an approximaely shores pah manner. urhermore, heir sysem is rained in a fully supervised manner, whereas our agen is rained wih reinforcemen learning. Bhai e al. (2016) augmen he sandard DQN wih semanic maps in he VizDoom (Kempka e al. 2016) environmen. These semanic maps are consruced from 3D RGB-D inpu, and hey employ echniques such as sandard compuer vision based objec recogniion and SLAM. They showed ha his resuls in beer learned policies. The ask of heir agen is o eliminae as many opponens as possible before dying. In conras, our agen needs o escape from a complex maze. urhermore, our environmens are designed o provide as lile semanic informaion as possible o make he ask more difficul for he agen; our agen needs o consruc is local visible map based purely on he shape of is surroundings. 3 Archiecure Many complex asks can be divided ino easier inermediae asks which when all solved individually solve he complex ask. We use his principle and apply i o neural nework archiecure design. In his secion we firs inroduce our concep of modular inermediae asks, and hen discuss how we implemen he modular asks in our map reading archiecure.

Visible Local Map Nework a -1 r -1 Recurren Localizaion ell {p i loc } N i=1 Visual Inpu NN NN Map Excerp Visible ield Visible Local Map {p i loc } N i=1 Map Inerpreaion Nework STTD r -1? H loc Reacive Agen V igure 2: The visible local map nework: The RGB pixel inpu is passed hrough wo convoluional neural nework (NN) layers and a fully conneced () layer before being concaenaed o he discreized angle ˆα and furher processed by fully conneced layers and a gaing operaion. igure 1: Archiecure overview and inerplay beween he four modules. ˆα is he discreized angle, a 1 is he las acion aken, r 1 is he las reward received, {p loc i } N i=1 is he esimaed locaion probabiliy disribuion over he N possible discree locaions, H loc is he enropy of he esimaed locaion probabiliy disribuion, STTD is he shor erm arge direcion suggesed by he map inerpreaion nework, V is he esimaed sae value and π is he policy oupu from which he nex acion a is sampled. 3.1 Modular Inermediae Tasks An inermediae ask module can be any informaion processing uni ha akes as inpu eiher sensory inpu and/or he oupu of oher modules. A module is defined and designed afer he inermediae ask i solves and can consis of rainable and hard coded pars. Since we are dealing wih neural neworks, he oupu and herefore he inpu of a module can be erroneous. Each module adjuss is rainable parameers o reduce is error independen of oher modules. We achieve his by sopping error back-propagaion on module boundaries. Noe ha his separaion has some advanages and drawbacks: Each module performance can be evaluaed and debugged individually. Small inermediae subask modules have shor credi assignmen pahs, which reduces he problem of exploding and vanishing gradiens during back-propagaion. Modules canno adjus heir oupu o fi he inpu needs of he nex module. This has o be achieved hrough inerface design, i.e., inermediae ask specificaion. Our neural nework archiecure consiss of four modules, each dedicaed o a specific subask. We firs give an overview of he inerplay beween he modules before describing hem in deail in he following secions. The archiecure overview is skeched in igure 1. The firs module is he visible local map nework; i akes he raw visual inpu from he 3D environmen and creaes for each frame a wo dimensional map excerp of he currenly visible surroundings. The second module, he recurren localizaion cell, akes he sream of visible local map excerps and inegraes i ino a local map esimaion. This local map esimaion is compared o he global map o ge a probabiliy disribuion over he discreized possible locaions. The hird module is called map inerpreaion nework; i learns o inerpre he global map and oupus a shor erm arge direcion for he esimaed posiion. The las module is a reacive agen ha learns o follow he esimaed shor erm arge direcion o ulimaely find he exi of he maze. We allow our agen o have access o a discreized angle ˆα describing he direcion i is facing, comparable o a robo having access o a compass. urhermore, we do no limi ourself o compleely unsupervised learning and allow he agen o use a discreized version of is acual posiion during raining. This could be implemened as a robo raining on he nework wih he help of a GPS signal. The robo could rain as long as he accuracy of he GPS signal is below a cerain hreshold and ac on he rained nework as soon as he GPS signal ges inaccurae or oally los. We leave such a pracical implemenaion of our algorihm o fuure work and focus here on he algorihmic srucure iself. We now describe each module archiecure individually before we discuss heir join raining in Secion 3.6. If no specified oherwise, we use recified linear uni acivaions afer each layer. 3.2 Visible Local Map Nework The visible local map nework preprocesses he raw visual RGB inpu from he environmen hrough wo convoluional neural nework layers followed by a fully conneced layer. We adaped his preprocessing archiecure from (Jaderberg e al. 2016). The hereby generaed feaures are concaenaed o a 3-ho discreized encoding ˆα of he orienaion angle α, i.e., we inpu he angle as n-dimensional vecor where each dimension represens a discree sae of he angle, wih n = 30. We se he hree vecor componens ha represen he discree angle values closes o he acual angle o one while he remaining componens are se o zero, e.g. ˆα = [0... 01110... 0]. We used a 3-ho insead of a 1-ho encoding o smooh he inpu. Noe ha his encoding has an average quanizaion error of 6 degrees. The discreized angle and preprocessed visual feaures are passed hrough a fully conneced layer o ge an inermediae represenaion from which wo hings are esimaed:

s -1 a -1 r -1 ~ V M sofmax LM es+mfb s o f m a x LM es {p i loc } N i=1. s LM mfb igure 3: Skech of he informaion flow in he recurren localizaion cell. The las egomoion esimaion s 1, he discreized angle ˆα, he las acion a 1 and reward r 1 are passed hrough wo fully conneced () layers and combined wih a wo dimensional convoluion beween he former local map esimaion LM 1 es and he curren visible local map inpu o ge he new egomoion esimaion s. This egomoion esimaion is used o shif he previously esimaed local map LM 1 es and he previous map feedback local map LM mfb 1. A weighed and clipped combinaion of hese local map esimaions, LM es+mfb 1, is convolved wih he full map o ge he esimaed locaion probabiliy disribuion {p loc i } N i=1. Recurren connecions are marked by empy arrows. 1. A reconsrucion of he map excerp ha corresponds o he curren visual inpu 2. The curren field of view, which is used o gae he esimaed map excerp such ha only esimaes which lie in he line of sigh make i ino he visible local map. This gaing is crucial o reduce noise in he visible local map oupu. See igure 2 for a skech of he visible local map nework archiecure. 3.3 Recurren Localizaion ell Moving around in he environmen, he agen generaes a sream of visible local map excerps like he oupu in igure 2 or he visible local map inpu Ṽ in igure 3. The recurren localizaion cell hen builds an egocenric local map ou of his sream and compares i o he acual map o esimae he curren posiion. The agen has o predic is egomoion o shif he egocenric esimaed local map accordingly. We refer o igure 3 for a skech of he archiecure described hereafer. Le M be he curren map, Ṽ he oupu of he visible local map nework, ˆα he discreized 3-ho encoded orienaion angle, a 1 he 1-ho encoded las acion aken, r 1 he exrinsic reward received by aking acion a 1, LM es he esimaed local map a ime sep, LM mfb he map feedback local map a ime sep, LM es+mfb he esimaed local map wih map feedback a ime sep, s he esimaed necessary shifing (or esimaed egomoion) a ime sep and {p loc i } N i=1 he discree esimaed locaion probabiliy disribuion. Then we can describe he funcionaliy of he recurren localizaion cell by he following equaions: s = sofmax(f(s 1, ˆα, a 1, r 1 ) + LM 1 es Ṽ ) [ ] +0.5 LM es = LM 1 es s + Ṽ 0.5 LM es+mfb = [ LM es {p loc i } N i=1 = sofmax LM mfb = + λ LM mfb 1 ( s m LM es+mfb N p i g(m, i) i=1 ] +0.5 0.5 Here, f( ) is a wo layer feed forward neural nework, denoes a wo dimensional discree convoluion wih sride one in boh dimensions, [ ] +0.5 0.5 denoes a clipping o [ 0.5, +0.5], λ is a rainable map feedback parameer and g(m, i) exracs from he map m he local map around locaion i. 3.4 Map Inerpreaion Nework The goal of he map inerpreaion nework is o find rewarding locaions on he map and consruc a plan o ge o hese locaions. We achieve his in hree sages: irs, he nework passes he map hrough wo convoluional layers followed by a recified linear uni acivaion o creae a 3-channel reward map. The channels are rained (as discussed in Secion 3.6) o represen wall locaions, navigable locaions and arge locaions respecively. This reward map is hen area averaged, recified and passed o a parameer free 2D shores pah planning module which oupus for each of he discree locaions on he map a disribuion over {Norh, Eas, Souh, Wes}, i.e., a shor erm arge direcion (STTD), as well as a measure of disance o he neares arge locaion. This plan is hen muliplied wih he esimaed locaion probabiliy disribuion o ge he smooh STTD and arge disance of he currenly esimaed locaion. Noe ha planning for each possible locaion and querying he plan wih he full locaion probabiliy disribuion helps o resolve he exploiaion-exploraion dilemma of he reacive agen: An uncerain locaion probabiliy disribuion close o he uniform disribuion will resul in an uncerain STTD disribuion over {Norh, Eas, Souh, Wes}, hereby encouraging exploraion. )

A locaion probabiliy disribuion over locaions wih similar STTD will accumulae hese similariies and resul in a clear STTD for he agen, even hough he locaion migh sill be unclear (exploiaion). 3.5 Reacive Agen and Inrinsic Reward As menioned, he reacive agen faces wo parially conradicing goals: following he STTD (exploiaion) and improving he localizaion by generaing informaion rich visual inpu (exploraion), e.g., no excessive saring a walls. The agen learns his rade off hrough reinforcemen learning, i.e., by maximizing he expeced sum of rewards. The rewards we provide here are exrinsic rewards from he environmen (negaive reward for running ino walls, posiive reward for finding he arge) as well as inrinsic rewards linked o he shor erm goal inpus of he reacive agen. These shor erm goal inpus are he STTD disribuion over {Norh, Eas, Souh, Wes} and he measure of disance o he neares arge locaion from he map inerpreaion nework as well as he normalized enropy H loc of he discree locaion probabiliy disribuion {p loc i } N i=1. Hloc represens a measure of locaion uncerainy which is linked o he need for exploraion. The inrinsic reward consiss of wo pars o encourage boh exploraion and exploiaion. The exploraion inrinsic reward I explor in each imesep is he difference in locaion probabiliy disribuion enropy o he previous imesep: I explor = H 1 loc H loc Noe ha his reward is posiive if and only if he locaion probabiliy disribuion enropy decreases, i.e., when he agen ges more cerain abou is posiion. The exploiaion inrinsic reward should be a measure of how well he egomoion of he agen aligns wih he STTD. or his we calculae an approximae wo dimensional egomoion vecor e from he egomoion probabiliy disribuion esimaion s. Similarly we calculae a STTD vecor d 1 from he STTD disribuion over {Norh, Eas, Souh, W es} of he previous imesep. We calculae he exploiaion inrinsic reward I exploi as do produc beween he wo vecors: I exploi = e T d 1 Noe ha his reward is posiive if and only if he angle difference beween he wo vecors is no bigger han 90 degrees, i.e., if he esimaed egomoion was in he same direcion as suggesed by he STTD in he imesep before. As inpu o he reacive agen we concaenae he discreized 3-ho angle ˆα, he las exrinsic reward and he locaion probabiliy disribuion enropy H loc o he STTD disribuion and he esimaed arge disance. The agen iself is a simple feed-forward nework consising of wo fully conneced layers wih recified linear uni acivaion followed by a fully conneced layer for he policy and a fully conneced layer for he esimaed sae value respecively. The agens nex acion is sampled from he sofmax-disribuion over he policy oupus. 3.6 Training Losses To rain our agen, we use a combinaion of on-policy losses, where he daa is generaed from rollous in he environmen, and off-policy losses, where we sample he daa from a replay memory. More specifically, he oal loss is he sum of he four module specific losses: 1. L vlm, he off-policy visible local map loss 2. L loc, he on-policy localizaion loss 3. L rm, he off-policy reward map loss and 4. L a, he on-policy reacive agens acing loss We rain our agen as asynchronous advanage acor criic, or A3, wih addiional losses; similar o DeepMind s UN- REAL agen (Jaderberg e al. 2016): In each raining ieraion, every hread rolls ou up o 20 seps in he environmen and accumulaes he localizaion loss L loc and acing loss L a. or each sep, an experience frame is pushed o an experience hisory buffer of fixed lengh. Each experience frame conains all inpus he nework requires as well as he curren discreized rue posiion. rom his experience hisory, frames are sampled and inpus replayed hrough he nework o calculae he visible local map loss L vlm and he reward map loss L rm. We now describe each loss in more deail. The oupu Ṽ of he visible local map nework is rained o mach he visible excerp of he map V, consruced from he discreized locaion and angle. In each raining ieraion 20 experience frames are uniformly sampled from he experience hisory and he visible local map loss is calculaed as he sum of L2 disances beween visible local map oupus Ṽ k and arges V k : L vlm = k S Ṽk V k 2 Here, S denoes he se of sampled frame indices. Our localizaion loss L loc is rained on he policy rollous in he environmen. or each sep, we compare he esimaed posiion o he acual posiion in wo ways, which resuls in a cross enropy locaion loss L loc,xen and a disance locaion loss L loc,d. The cross enropy locaion loss is he cross enropy beween he locaion probabiliy disribuion {p loc i } N i=1 and a 1-ho encoding of he acual posiion. The disance loss L loc,d is calculaed a each sep as he L2 disance beween he acual wo dimensional cell posiion coordinaes c pos and he esimaed cenroid of all possible cells i weighed by heir corresponding probabiliy p loc i : L loc,d N = c pos p loc i c i i=1 2 In addiion o raining he locaion esimaion direcly we also assign an auxiliary local map loss L loc,lm o help wih he local map consrucion. We calculae he local map loss only once per raining ieraion as L2 disance beween he las esimaed local map LM es and he acual local map a ha poin in ime. The goal of he reward map loss L rm is o have he hree channels of he reward map represen wall locaions, free

Seps Moving Average 3,000 2,000 1,000 0 0 1 2 3 4 5 6 Toal Training Seps 10 6 igure 4: Training performance of 8 acor hreads ha sar raining on 5x5 mazes. The verical black lines mark jumps o larger mazes of he hread in blue. space locaions and arge locaions respecively. To do his, we leverage he seing ha running ino a wall gives a negaive exrinsic reward, moving in open space gives no exrinsic reward and finding he arge gives a posiive exrinsic reward. Therefore he problem can be ransformed ino esimaing an exrinsic reward. Each raining ieraion we sample 20 frames from he experience hisory. This sampling is independen from he visible local map loss sampling and skewed o have in expecaion equally many frames wih posiive, negaive and zero exrinsic reward. or each frame, he frames map is passed hrough he convoluion layers of he map inerpreaion nework o creae he corresponding reward map while he visual inpu and localizaion sae saved in he frame are fed hrough he nework o ge he esimaed locaion probabiliy disribuion. The reward map loss is he cross enropy predicion error of he reward a he esimaed posiion. Our reacive agen s acing loss is equivalen o he A3 learning described by Mnih e al. (2016). We also adaped an acion repea of 4 and a frame rae of 15 fps. The whole nework is rained by RMSprop gradien descen wih gradien back propagaion sopped a module boundaries, i.e., each module is only rained on is module specific loss. 4 Environmen and Resuls To evaluae our archiecure we creaed a raining and es se of mazes wih he corresponding black and whie maps in he DeepMind Lab environmen. The mazes are quadraic grid mazes wih each maze cell being eiher a wall, an open space, he arge or he spawn posiion. The raining se consiss of 100 mazes of differen sizes; 20 mazes each in he sizes 5x5, 7x7, 9x9, 11x11 and 13x13 maze cells. The es se consiss of 900 mazes; 100 in each of he sizes 5x5, 7x7, 9x9, 11x11, 13x13, 15x15, 17x17, 19x19 and 21x21. Noe ha he ouermos cells in he mazes are always walls, herefore he maximal navigable space of a 5x5 maze is 3x3 maze cells. Thus he navigable space for he bigges es mazes is roughly 3 imes larger han for he bigges raining mazes. or he localizaion, we used a locaion cell granulariy 3 imes finer han he maze cells, which resuls in a oal of N =63x63=3969 discree locaion saes on he bigges Seps needed 4,000 2,000 0 5 7 9 11 13 15 17 19 21 Maze widh igure 5: All he resuls of he (a mos 100) successful ess for each maze size. Every single es is represened by an x. The line connecs he arihmeic averages of each maze size. The disance beween origin and arge grows linearly wih maze size, as does he number of seps. 21x21 mazes. We rain our agen saring on small mazes and increase he maze sizes as he agen ges beer. More specifically we use 16 asynchronous agen raining hreads from which we sar 8 on he smalles (5x5) raining mazes while he oher raining hreads are sared 2 each on he oher sizes (7x7, 9x9, 11x11 and 13x13). This prevens he visible local map nework from overfiing on he small 5x5 mazes. The hread agens are placed ino a randomly sampled maze of heir currenly associaed maze size and ry o find he exi, while couning heir seps. A sep is one ineracion wih he environmen, i.e., sampling an acion from he agens policy π and receiving he corresponding nex visual inpu, discreized angle and exrinsic reward from he environmen. A sep is no he same as a locaion or maze grid cell; as agens accelerae, here is no direc correlaion beween seps and acual walked disance. We consider each sampled maze an episode sar. The episode ends successfully if he agen manages o find he arge and he seps needed are sored. If he agen does no find he exi in 4500 seps, he episode ends as no successful. Afer an episode ends, a new episode is sared, i.e., a new maze is sampled. Noe ha in his seing he agen is always placed in a newly sampled maze and no in he same maze as in (Jaderberg e al. 2016) and (Mirowski e al. 2016). or each hread we calculae a moving average of seps needed o end he episodes. Once his moving average falls below a maze size specific hreshold, he hread is ransferred o rain on mazes of he nex bigger size. Once a hread s moving average of seps needed in he bigges raining mazes (13x13) falls below he hreshold, he hread is sopped and is raining is considered successful. Once all hreads reach his sage, he overall raining is considered successful and he agen is fully rained. We calculae he moving average over he las 50 episodes and use 60, 100, 140, 180 and 220 seps as hreshold for he maze sizes 5x5, 7x7, 9x9, 11x11 and 13x13, respecively. igure 4 shows he raining performance of 8 acor hreads. One can see ha he agens someimes overfi heir policies which resuls in emporarily decreased performance even hough he maze size did no increase. In he end however, all hreads reach good performance. The rained agen is esed on he 900 es se mazes, he

Maze size Targes found 5x5 100% 7x7 100% 9x9 100% 11x11 99% 13x13 99% 15x15 98% 17x17 93% 19x19 93% 21x21 91% Table 1: Percenage of arges found in he es mazes. Up o size 9x9 he agen always finds he arge. More ineresingly, he agen is able o find more han 90% of he arges in mazes ha are bigger han any maze i has seen during raining. 1 2 3 4 1 Seps needed igure 6: Example rajecories walked by he agen. Noe ha he agen walks close o he shores pah and is coninuous localizaion and planning les he agen find he pah o he arge even afer i ook a wrong urn. 3 2 4 igure 8: our example frames o illusrae he ypical behavior of he agen: The red line is he race of is acual posiion, while he shades of blue represen is posiion esimae. The darker he blue, he more confiden he agen is o be in his locaion. rame 1 shows he agen s rue saring posiion as a red do, frame 2 shows several similar locaions idenified afer a bi of urning, in frame 3 he agen sars o undersand he rue locaion, and in frame 4 i has moved. i.e., i coninuously needs o look around o know where i is. or he localizaion in he beginning of an episode, he agen also mainly relies on urning as can be seen in four example frames in igure 8. 500 5 0 5 7 9 11 13 15 Maze widh 17 19 21 igure 7: omparison of our agen (blue lines) o an agen ha has perfec posiion informaion and an opimal shor erm arge direcion inpu (red lines). The solid lines coun all seps (urns and moves). The solid blue line is he same as he average line of igure 5. The dashed lines do no coun he seps in which he agen urns. The figure shows ha he overhead is mosly because of urning, as our agen needs o look around o localize iself. number of required seps per maze size are ploed in igure 5. We sop a es afer 4,500 seps, bu even for he bigges es mazes (21x21) he agen found more han 90% of he arges wihin hese 4,500 seps. See Table 1 for he percenage of exis found in all maze sizes. If he agen finds he exi i does so in almos shores pah manner, as can be seen in igure 6. However, he agen needs a considerable number of seps o localize iself. To evaluae his localizaion overhead, we rained an agen consising solely of he reacive agen module wih access o he perfec locaion and opimal shor erm arge direcion and ploed is average performance on he es se in igure 7. The figure shows a large gap beween he full agen and he agen wih access o he perfec posiion. This is due o urning acions, which he full agen performs o localize iself, onclusion We have presened a deep reinforcemen learning agen ha can localize iself on a 2D map based on observaions of is 3D surroundings. The agen manages o find he exi in mazes wih high success rae, even in mazes subsanially larger han i has ever seen during raining. The agen ofen finds he shores pah, showing ha he agen can coninuously reain a good localizaion. The archiecure of our sysem is buil in a modular fashion. Each module deals wih a subask of he maze problem and is rained in isolaion. This modulariy allows for a srucured archiecure design, where a complex ask is broken down ino subasks, and each subask is hen solved by a module. Modules consis of general archiecures, e.g., MLPs, or more ask-specific neworks such as our recurren localizaion cell. I is also possible o use deerminisic algorihm modules, such as in our shores pah planning module. Archiecure design is aided by he possibiliy o easily replace each module by ground ruh values, if available, o find sources of bad performance. Our agen is designed for a specific ask. We plan o make our modular archiecure more general and apply i o oher asks, such as playing 3D games. Since modules can be swapped ou and arranged differenly, i would be ineresing o equip an agen wih many modules and le i learn which module o use in which siuaion. Acknowledgmens We would like o hank he anonymous reviewers for heir helpful commens.

References Baro, A. G.; Suon, R. S.; and Anderson,. W. 1983. Neuronlike adapive elemens ha can solve difficul learning conrol problems. IEEE Trans. Sysems, Man, and yberneics 13(5):834 846. Beaie,.; Leibo, J. Z.; Teplyashin, D.; Ward, T.; Wainwrigh, M.; Küler, H.; Lefrancq, A.; Green, S.; Valdés, V.; Sadik, A.; Schriwieser, J.; Anderson, K.; York, S.; an, M.; ain, A.; Bolon, A.; Gaffney, S.; King, H.; Hassabis, D.; Legg, S.; and Peersen, S. 2016. Deepmind lab. orr abs/1612.03801. Bhai, S.; Desmaison, A.; Miksik, O.; Nardelli, N.; Siddharh, N.; and Torr, P. H. S. 2016. Playing doom wih slam-augmened deep reinforcemen learning. orr abs/1612.00380. Elfes, A. 1989. Using occupancy grids for mobile robo percepion and navigaion. IEEE ompuer 22(6):46 57. uenes-pacheco, J.; Ascencio, J. R.; and Rendón-Mancha, J. M. 2015. Visual simulaneous localizaion and mapping: a survey. Arif. Inell. Rev. 43(1):55 81. Gullapalli, V. 1990. A sochasic reinforcemen learning algorihm for learning real-valued funcions. Neural Neworks 3(6):671 692. Gupa, S.; Davidson, J.; Levine, S.; Sukhankar, R.; and Malik, J. 2017. ogniive mapping and planning for visual navigaion. orr abs/1702.03920. Jaderberg, M.; Mnih, V.; zarnecki, W. M.; Schaul, T.; Leibo, J. Z.; Silver, D.; and Kavukcuoglu, K. 2016. Reinforcemen learning wih unsupervised auxiliary asks. orr abs/1611.05397. Kaelbling, L. P.; Liman, M. L.; and Moore, A. W. 1996. Reinforcemen learning: A survey. J. Arif. Inell. Res. 4:237 285. Kempka, M.; Wydmuch, M.; Runc, G.; Toczek, J.; and Jaskowski, W. 2016. Vizdoom: A doom-based AI research plaform for visual reinforcemen learning. In IEEE onference on ompuaional Inelligence and Games, IG 2016, Sanorini, Greece, Sepember 20-23, 2016, 1 8. Kulkarni, T. D.; Narasimhan, K.; Saeedi, A.; and Tenenbaum, J. 2016. Hierarchical deep reinforcemen learning: Inegraing emporal absracion and inrinsic moivaion. In Advances in Neural Informaion Processing Sysems 29: Annual onference on Neural Informaion Processing Sysems 2016, December 5-10, 2016, Barcelona, Spain, 3675 3683. Minsky, M. L. 1954. Theory of neural-analog reinforcemen sysems and is applicaion o he brain model problem. Princeon Universiy. Mirowski, P.; Pascanu, R.; Viola,.; Soyer, H.; Ballard, A. J.; Banino, A.; Denil, M.; Goroshin, R.; Sifre, L.; Kavukcuoglu, K.; Kumaran, D.; and Hadsell, R. 2016. Learning o navigae in complex environmens. orr abs/1611.03673. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M. A.; idjeland, A.; Osrovski, G.; Peersen, S.; Beaie,.; Sadik, A.; Anonoglou, I.; King, H.; Kumaran, D.; Wiersra, D.; Legg, S.; and Hassabis, D. 2015. Human-level conrol hrough deep reinforcemen learning. Naure 518(7540):529 533. Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous mehods for deep reinforcemen learning. orr abs/1602.01783. Peers, J., and Schaal, S. 2008. Reinforcemen learning of moor skills wih policy gradiens. Neural Neworks 21(4):682 697. Rumelhar, D. E.; Hinon, G. E.; Williams, R. J.; e al. 1988. Learning represenaions by back-propagaing errors. ogniive modeling 5(3):1. Schaul, T.; Quan, J.; Anonoglou, I.; and Silver, D. 2015. Prioriized experience replay. orr abs/1511.05952. Suon, R. S., and Baro, A. G. 1998. Reinforcemen learning - an inroducion. Adapive compuaion and machine learning. MIT Press. Suon, R. S.; Precup, D.; and Singh, S. P. 1999. Beween mdps and semi-mdps: A framework for emporal absracion in reinforcemen learning. Arif. Inell. 112(1-2):181 211. Suon, R. S. 1984. Temporal credi assignmen in reinforcemen learning. Suon, R. S. 1988. Learning o predic by he mehods of emporal differences. Machine Learning 3:9 44. Szepesvári,. 2010. Algorihms for Reinforcemen Learning. Synhesis Lecures on Arificial Inelligence and Machine Learning. Morgan & laypool Publishers. Thrun, S.; Burgard, W.; and ox, D. 2005. Probabilisic roboics. MIT press. van Hassel, H.; Guez, A.; and Silver, D. 2016. Deep reinforcemen learning wih double q-learning. In Proceedings of he Thirieh AAAI onference on Arificial Inelligence, ebruary 12-17, 2016, Phoenix, Arizona, USA., 2094 2100. Wang, Z.; Schaul, T.; Hessel, M.; van Hassel, H.; Lanco, M.; and de reias, N. 2016. Dueling nework archiecures for deep reinforcemen learning. In Proceedings of he 33nd Inernaional onference on Machine Learning, IML 2016, New York iy, NY, USA, June 19-24, 2016, 1995 2003. Wakins,. J., and Dayan, P. 1992. Q-learning. Machine learning 8(3-4):279 292. Wakins,. J.. H. 1989. Learning from delayed rewards. Ph.D. Disseraion, King s ollege, ambridge. Williams, R. J. 1992. Simple saisical gradien-following algorihms for connecionis reinforcemen learning. Machine Learning 8:229 256. Zhu, Y.; Moaghi, R.; Kolve, E.; Lim, J. J.; Gupa, A.; ei- ei, L.; and arhadi, A. 2017. Targe-driven visual navigaion in indoor scenes using deep reinforcemen learning. In 2017 IEEE Inernaional onference on Roboics and Auomaion, IRA 2017, Singapore, Singapore, May 29 - June 3, 2017, 3357 3364.