Reducing state space exploration in reinforcement learning problems by rapid identification of initial solutions and progressive improvement of them

Reducig state space exploratio i reiforcemet learig problems by rapid idetificatio of iitial solutios ad progressive improvemet of them Kary FRÄMLING Departmet of Computer Sciece Helsiki Uiversity of Techology P.O. Box 5400, FIN-02015 HUT FINLAND Abstract: Most existig reiforcemet learig methods require exhaustive state space exploratio before covergig towards a problem solutio. Various geeralizatio techiques have bee used to reduce the eed for exhaustive exploratio, but for problems like maze route fidig these techiques are ot easily applicable. This paper presets a approach that makes it possible to reduce the eed for state space exploratio by rapidly idetifyig a "usable" solutio. Cocepts of short- ad log term workig memory the make it possible to cotiue explorig ad fid better or optimal solutios. Key-Words: Reiforcemet learig, Trajectory samplig, Temporal differece, Workig memory, Maze route fidig 1 Itroductio The work preseted i this paper started from the idea to develop a artificial eural et (ANN) model that would do problem solvig ad learig i similar ways as humas ad aimals do. The model would also correspod to some very rough-level ideas ad kowledge about how the brai operates, i.e. activatios ad coectios betwee differet areas of the brai ad otios of short- ad log term workig memory. Aimal problem solvig maily seems to be based o trial ad learig. The success or failure of a trial modifies behavior i the "right" directio after some umber of trials, where "some umber" is i the rage oe (e.g. learig how to tur o the radio from "power" butto) to ifiity (e.g. learig how to grab thigs, which is a life-log adaptatio procedure). Such behavior is curretly studied maily i the scietific research area called reiforcemet learig (RL). RL methods have bee successfully applied to may problems where more "covetioal" methods are difficult to use due to factors like lackig data about the eviromet, which forces the eural et to explore its eviromet ad lear iteractively. Explorig is a procedure where the aget (see for istace [2] for a discussio o the meaig of the term aget ) has to take actios without a priori kowledge about how good or bad the actio is, which may be kow oly much later whe the goal is reached or whe the task failed. The RL problem used i this paper, maze route fidig, is commoly used i psychological studies of aimal learig ad behavior [3]. Aimals have to explore the maze ad costruct a iteral model of the maze i order to reach the goal. The more maze rus the aimal performs, the quicker it goes to the goal sice solutios get better memorized. Maze route fidig is ot a very complicated problem to solve with may existig methods as poited out i sectio 2 of this paper, where the problem setup is explaied. Therefore, the ANN solutio preseted i sectio 3 is ot uique by beig the first oe able to solve the problem. It does, however, solve the problem i a ew way, which eeds sigificatly less iitial exploratio tha existig methods. Iitial exploratio rus are maily shorteed by the SLAP (Set Lowest Actio Priority) reiforcemet preseted. Idetified solutios may be further reiforced by temporal differece (TD) methods [5]. Eve though TD learig ca be used as a elemet of the methods described i the paper, the methods preseted here do ot use classical otios of value fuctios for idicatig how good a state or a actio are for reachig the goal. Istead otios of short term workig memory ad log term workig memory are used for selectig appropriate actios at each state. Short term workig memory exists oly durig the problem solvig, while previous problem-solvig istaces are stored i log term workig memory. This memory orgaizatio gives ew possibilities to balace betwee exploratio of a ew eviromet ad/or ew solutios o oe had, ad exploitatio of existig kowledge o the other had. 2 Problem Formulatio Sutto ad Barto use a maze like the oe i Fig 1 i chapter 9 of their 1998 book [7]. The discussio that follows o the advatages ad disadvatages of existig RL methods is pricipally based o this book cocerig symbols, equatios ad method descriptios. A aget is positioed at the startig poit iside the maze ad has to fid a route to the goal poit. Each maze positio correspods to a state, where the aget selects oe

actio of four, i.e. goig orth, south, east or west, uless some of these are ot possible. Each state is uiquely idetified i a table-lookup maer. Iitially, the aget has o prior kowledge about the maze, so it has o idea of what actio should be take at differet states. Therefore it chooses a actio radomly the first time it comes to a previously uvisited state, without kowig if it is a good oe or ot. If it was a bad oe, the aget eds up i a dead ed ad has to walk back ad try aother directio. Comig back to a state already visited is also a bad sig sice the aget is walkig aroud i circles. a b Fig 1. Grid problem. a) Aget is show i start positio ad goal positio i upper right corer. b) Oe of the optimal solutios. 2.1 Symbolic methods This maze route fidig problem is easy to solve with a classical depth-first search through a iferece tree represetig all possible solutios [1]. The root of the tree is the startig state. The root has liks to all ext states that ca be reached from it, which agai have liks to all their ext states. The iferece tree ca be recursively costructed where leafs of the three are idicated by oe of three cases: 1. Goal reached. 2. Dead-ed, i.e. state with o ext-states. 3. Circuit detected, i.e. comig back to a previously ecoutered state. Depth-first search ca explore the tree util a solutio is foud, which ca be memorized. If the goal is to fid the optimal path, breadth-first search [1] or complete exploratio of the whole tree ca be used. Depth-first ad breadth-first search become ufeasible whe the search tree grows bigger due to a great umber of states or a great umber of liks (actios). Heuristic approaches are ofte used to overcome these problems. They make it possible to cocetrate oly o "iterestig" parts of the search tree by associatig umerical values with each tree ode or each lik i the search tree, which idicate the "goodess" of that ode or that lik. Heuristic values ca be give directly, calculated or obtaied by learig. Reiforcemet learig is oe way of learig these values. 2.2 Reiforcemet learig priciples I RL, heuristic estimates correspod to the otio of value fuctios, which are either state-values (i.e. value of a state i the search tree) or actio-values (i.e. value of a lik/actio i the search tree). I the maze problem, value fuctios should be adjusted so that "good" actios, i.e. those leadig to the goal as quickly as possible are selected. Oe possible RL approach to the maze problem usig state values would be to radomly select actios util the goal is reached, which forms oe episode. Durig the episode, a reward of 1 is give for all state trasitios except the oe leadig to the goal state. The the value of a state s S (set of possible states) for a give episode ca be defied formally as π ( ) = k V s Eπ γ rt + k+ 1 st = s, (1) k = 0 where V π (s) is the state value that correspods to the expected retur whe startig i s ad followig policy π thereafter [7]. A policy is the "rule" beig used for selectig actios, which ca be radom selectio as assumed here or some other rule. So, for Markov Decisio Processes (MDP), E π {} deotes the expected value give that the aget follows policy π. The value of the termial state, if ay, is always zero. γ k is a discoutig factor that is less tha or equal to oe ad determies to what degree future rewards affect the value of state s. Whe the umber of episodes usig radom policy approaches ifiity, the average state value over all episodes coverges to the actual state-value for policy π. Oce state values have coverged to correct values, states which are "closer" to the goal will have higher state values tha states that are further away. If the policy is the chaged to greedy exploratio, i.e. always takig the actio that leads to the ext state with the highest state value, the the aget will automatically follow the optimal path. Ufortuately, radom iitial exploratio is too time cosumig to be useful i practical problems. The usual way to treat this case is to use ε-greedy exploratio, where actios are selected greedily with probability (1 - ε), while radom actio selectio is used with probability ε. Aother versio of ε-greedy exploratio called softmax is sometimes used. Softmax selects actios leadig to high state values with a higher probability tha actios leadig to low state values, istead of usig radom actio selectio. Whe ε-greedy exploratio ad 1 reward o every state trasitio is used for the grid world of Fig 1, all state values ca be iitialized to 0 or small radom values. Durig exploratio, states that have ot bee visited or that have bee visited less tha others will have higher state values tha more frequetly visited oes. Therefore ε-greedy exploratio will by defiitio ted to exhaustively explore the whole state space, so iitial episodes are very log. Covergece towards correct state values also requires a great umber of episodes, so this approach is ot usable for bigger problems.

2.3 Mote-Carlo methods Aother possibility is to oly give positive reward at the ed of a episode ad zero reward for all itermediate trasitios. Mote-Carlo Policy Evaluatio [7] is oe possibility for propagatig the reward backwards through the state history of oe episode. If a reward of +1 is give for reachig the goal, +1 is added to the "retur values" of all states appearig i the episode. The state-value of a state is the the average retur value over all episodes. Usig ε- greedy exploratio, state values evetually coverge to the optimal policy, eve though guarateed covergece has ot yet bee formally proved accordig to [7]. For the maze problem used i this paper, geeratig episodes usig a radom policy requires a average of 1700 steps. Eve with TD methods studied i the ext sectio, early 30 episodes is required before covergece towards a solutio occurs, so for Mote-Carlo simulatio the umber of episodes eeded is probably over 100. This would mea over 170 000 steps, which is very slow compared to all other methods treated later i this paper. 2.4 Temporal differece learig ad TD(λ) Mote-Carlo policy evaluatio requires successfully completed episodes i order to lear. Therefore it quickly becomes too slow i order to be usable for most applicatios sice it might require a very big umber of episodes before startig to select better actios tha usig a radom policy. Solvig this problem is oe of the mai issues i so called bootstrappig methods, like those based o temporal-differece (TD) learig [5]. Bootstrappig sigifies that state- or actio value updates occur at every state trasitio based o actual reward, but also o the differece betwee the curret state value ad the state value of the ext state accordig to: V ( st ) V ( st ) + α [ rt + 1 + γv ( st+ 1) V ( st )] (2), which is kow as the TD(0) method. The more advaced TD(λ) algorithm, of which TD(0) is a istace, is curretly the most used bootstrappig method. TD(λ) uses a otio of eligibility trace, which λ refers to. A eligibility trace sigifies usig the state/actio history of each episode for propagatig rewards backwards, just like i Mote-Carlo methods. Associatig a eligibility trace value with each state, which is usually icreased by oe (accumulatig eligibility trace) every time the state is ecoutered durig a episode, creates the trace. λ is a trace decay parameter, which together with γ determies how fast the eligibility trace disappears for each state. For a accumulatig eligibility trace, a state s eligibility trace value e t (s) at time t is calculated by: γλet 1 ( s) if s st et ( s) = (3) γλet 1 ( s) + 1 if s = st Experiece has show that TD methods geerally coverge much faster to the optimal solutio tha do Mote-Carlo methods [7]. Usig a model of the eviromet that is costructed durig exploratio, ca further accelerate covergece as for Dya agets [6]. I Dya agets, the model memorizes which states are reached by what actios for each state/actio pair ecoutered, so TD learig ca be used for updatig value fuctios both durig iteractio with the eviromet ad without iteractio with the eviromet. For the maze i Fig 1, Sutto ad Barto have compared covergece times betwee direct reiforcemet learig ad Dya aget learig [7]. Sice both of these use radom exploratio o the first ru, the first episode lasted for about 1700 steps. Direct RL eeded about 30 episodes before covergig to the optimal path of 14 steps, while the best Dya aget foud it after about five episodes. However, both methods stay i eteral oscillatio betwee 14 steps ad 16 steps due to ε-greedy exploratio that regularly puts the aget off the optimal path. The mai shortage of these techiques is that they have a very log first exploratio ru, durig which they go through most states umerous times (54 states ad 1700 steps => ~32 visits per state). For a simple maze like the oe i Fig 1 this is ot a big problem, but the legth of the iitial exploratio ru ca be expected to grow expoetially as the umber of states icreases. These log exploratio rus are due to the eed of curret methods to first explore the whole state space i order to coverge towards a optimal solutio. Exploratio of the etire state space is impossible to use i most practical applicatios reported like backgammo [8], which has approximately 10 20 states. However, may states correspod to similar game situatios, for which similar moves are appropriate. Therefore learig results for oe state ca be applied to umerous other states too if there is a way to idetify similar states based o state descriptios istead of treatig each state as a separate case. May artificial eural etworks are capable of such geeralizatio, where actios leared for oe state descriptio are automatically applied to similar states eve though these states would ever have bee ecoutered before. Also, i a game like backgammo, most states have a very small or zero probability of occurrig i a real game, so they do ot eed to be leared. However, i a maze problem this approach does ot seem to be applicable sice there are o geeral rules that could be leared based o a geeral descriptio of possible states. There are 16 differet states depedig o possible directios, but there is o geerally applicable rule for what actio is appropriate for each type of state, so the problem of excessively log iitial exploratio remais. The solutio proposed i this paper rapidly fids ad memorizes at least oe usable solutio usig miimal exploratio efforts ad the explores towards the optimal solutio.

3 Problem Solutio Oe of the iitial ideas of the work preseted here was to maitai a lik with aimal ad huma problem solvig ad the brai. This is why the reiforcemet learig methods preseted here use a artificial eural et (ANN) model eve though they could probably also be implemeted i other ways. I this "brai ispired" ANN, euros are either stimulus or actio euros, which seems more appropriate tha speakig about iputs ad outputs of the eural et. I the maze solvig problem, each state correspods to oe stimulus euro ad each possible actio to oe actio euro. Whe the ANN aget eters a completely ukow maze, it oly has four actio euros which correspod to the four possible actios, but it has o stimulus euros. Stimulus euros are created ad coected to actio euros for every state ecoutered for the first time durig a episode. Whe a ew stimulus euro is created, the weights of its coectios to actio euros are iitialized to small radom values for istace i the iterval ]0,1]. Sice these stimuli ad their coectio weights are created durig oe episode ad exist oly util the episode is fiished, they are here called short term workig memory. Oce a episode is fiished, both short term stimuli ad coectio weights ca be copied as istaces i log term workig memory. 3.1 ANN architecture The purpose of log term workig memory is to be able to solve the same problem more efficietly i the future. Whe a stimulus is activated i short term workig memory, we ca suppose that the correspodig stimulus istaces i log term workig memory are also activated to a certai degree. Sice log term workig memory istaces are coected to actio euros, they affect what actio is selected. Actios are selected accordig to the wier-takesall priciple, where the actio euro with the biggest activatio value wis. Activatio values of actio euros are calculated accordig to: stim ltm stim stwi, * si + * ltw j, i, i= 1 j= 1 i= 1 a = α * s (4) where a is the activatio value of actio euro, stw i, is the coectio weight from stimulus euro i to actio euro, s i is the curret activatio value of stimulus euro i, ltw j,i, is the coectio weight for log term workig memory istace j ad stimulus i to actio euro, stim is the umber of stimulus euros ad ltm is the umber of istaces i log term memory. α is a weightig parameter that adjusts to what degree stimulus activatios i short term workig memory cause activatio of correspodig stimuli i log term workig memory. α ca also be cosidered as a parameter that adjusts the ifluece of past experieces o actio selectio. Sice short term i workig memory coectio weights are always iitialized to radom values whe a state is ecoutered for the first time durig a episode, adjustig the α parameter offers a alterative to ε-greedy exploratio ad softmax for balacig betwee exploratio ad exploitatio. Equatio (4) ca be rewritte i the form a = stim stim ltm si stw, s + α ltw j, i, (5) i i i= 1 i= 1 j= 1, which shows that log term workig memory ca be implemeted as a vector of sums of stored coectio weights, which makes it possible to implemet the proposed model i a computatio- ad memory efficiet way. Oly two coectio weight matrices are eeded, oe for short term workig memory weights ad the other oe for log term workig memory weights. The short term workig memory matrix is of size (umber of actios)*(umber of states ecoutered durig curret episode). The log term workig memory matrix is of size (umber of actios)*(umber of states ever ecoutered). Straight matrix multiplicatio ad additio is eough to perform the eeded calculatios. 3.2 Search for "usable" iitial solutio Exploratio ad exploitatio happe simultaeously, which oe is predomiat depeds o the value of α ad o the umber of istaces i log term workig memory. Log term workig memory is iitially empty for a completely uexplored maze. Therefore actio selectio accordig to equatio (4) is radom the first time a ew state is ecoutered because the ew stimulus euro created i short term workig memory has radom iitial coectio weights. If a state already ecoutered durig the same episode is visited agai, it is either due to comig back from a dead ed or goig aroud i circles. I both cases it would be uwise to take the same actio as the previous time i the same state. I order to kow what actio was take the previous time, it is sufficiet to evaluate equatio (4) for the curret state ad see which actio wis. The wiig actio is puished accordig to the ew Set Lowest Actio Priority priciple, shortly SLAP. The "slapped" actio is puished by decreasig its weights eough to make it become the least activated amog possible actios the ext time we are i the same state. This is doe accordig to the formula: stw i, = ( a ami ) stwi, si a (6) where a is the activatio value of the slapped euro ad a mi is the ew activatio desired, obtaied by takig the lowest actio activatio amog possible actios (possible directios) ad subtractig a small ratio of it. Slappig is ot oly used for puishig actios that lead to dead eds ad circuits, slappig is also applied to the directio the aget comes from directly after eterig a

state. Otherwise the probability that the aget would go back i the same directio as it came from would be as high as takig a ew directio. The goal of SLAP is therefore maily to make the exploratio go to the goal as quickly as possible with miimal exploratio effort. Sutto ad Barto [7] call this priciple trajectory samplig ad show for a simple problem that this techique greatly reduces computatio time compared to exhaustive search, especially for problems with a great umber of states. For the sample ru i Fig 2, the first episode took oly 104 steps ad still directly gives the rather good solutio of 16 steps o the secod episode (14 is the optimal solutio). This result ca be cosidered excellet compared to the 1700 steps reported for TD ad Dya-Q i [7] for the first episode, ot to metio that they eed up to 30 episodes before reachig a 16-step solutio. The last actios used for each state are implicitly stored i short term workig memory weights by SLAP reiforcemets. Therefore a exploitatio ru that uses these weights will directly follow the shortest path discovered as i Fig 2b. This is also true if the aget starts from some other state ecoutered durig exploratio tha the iitial startig state as i Fig 2c. a b c Fig 2. a) First episode, 104 steps, b) secod episode, 16 steps, c) differet startig poit tha iitial oe, 13 steps. Oce a episode is fiished, short term workig memory weights ca be copied as a istace i log term workig memory. This ca either be doe directly or after a additioal reiforcemet has bee applied. This is implemeted by doig a "replay" of all stimuli activatios (states) ad rewardig the wiig actios by icreasig the value of the coectio weight betwee the stimulus ad the wiig actio by a fial reward value. Dispatchig the fial reward i this way is actually very similar to TD(λ) with γ = 1. The mai differece is that the eligibility trace is ot stored aywhere, it is recostructed istead. Eve though o formal proof is show here for the similarity with TD(λ), it ca still be assumed that TD(λ) methods could be used for propagatig the fial reward backwards as well. Therefore most existig experiece ad kowledge about TD methods could be applicable cocerig covergece, calculatio complexity etc. 3.2 Search for optimal solutio Sice oly a part of the state space is usually visited durig the iitial exploratio ad oly a part of the possible actios i differet states are used, the iitially idetified solutio has a high probability of beig sub optimal. This is also the case i Fig 2, where the optimal solutio would be 14 steps. However, the optimal solutio is very difficult to fid sice it has a much smaller probability of occurrig durig radom exploratio tha other solutios. At least two possibilities exist i order to fid the optimal solutio: 1. Lettig several eural et agets search for a solutio ad see which aget foud the best oe. 2. Usig low α ad/or low fial reward at the ed of episodes ad lettig the same aget do a great umber of exploratio rus. This could also be combied with ε-greedy ad softmax exploratio. The first possibility might seem to be rather wasteful, but sice iitial exploratio oly requires a average of 115 steps, there ca still be 15 agets explorig before reachig the 1700 steps used by the iitial ru with TD(λ) i [7]. Classical TD(λ) (without model as i Dya-Q) apparetly eeds far over 10000 exploratio steps before fidig a path requirig oly 16 steps, but the quite rapidly fids the optimal path with 14 steps. The experimetal probability of a aget usig radom policy to fid the optimal path is 0.007, which meas that it takes about 143 agets o the average before the optimal path is foud. Therefore a average of 16445 (143*115) steps are eeded for SLAP agets to fid the optimal solutio. This umber seems to be approximately the same as for TD(λ), but the huge advatage of SLAP agets over TD(λ) is that they fid a rather good solutio already after oe episode ad about 115 exploratio steps. Such a solutio ca be directly usable, so further exploratio ca be deferred to whe there is spare time to do it. This also correspods rather well to huma behavior first fid a "usable" solutio ad be curious about other solutios whe there is time for it. Table 1 shows the umber of steps eeded for the first ad the secod episodes of te sample SLAP agets. After each episode, the agets received a fial reward of oe at the ed of the episode before storig the solutio as a istace i log term workig memory. All agets had α = 1. For 30 sample agets, the logest iitial episode took 336 steps ad the shortest took 26 steps. The total umber of iitial episode steps for the 30 agets was 3460 ad the optimal solutio of 14 steps was foud by oe of these. Eve the worst secod episode solutio requirig 26 steps could be usable i may applicatios. Table 1. Exploratio steps for first episode versus secod episode for te differet agets. Ru # 1 2 3 4 5 6 7 8 9 10 Episode 1 44 174 66 148 26 110 136 218 40 168 Episode 2 22 18 14 24 22 18 18 16 20 16 The secod possibility to fid the optimal path is to use the same aget all the time ad let it gradually improve.

This possibility has so far oly bee studied for the case of usig low α values ad a fial reward of oe. However, oly episodes that are shorter tha ay previous episode are stored as istaces i log term workig memory, which meas that episodes ted to get shorter as the umber of episodes icreases. Whe usig α = 0.01, the path followed became stable after a average of about 500 episodes ad a total of about 15000 steps. All agets that discovered the 14 step solutio at least oce (about oe aget out of five) evetually coverged to that solutio, while the others coverged to a solutio of 16 steps. Covergece could certaily be made much quicker i several ways. Oe way would be to use adaptive fial reward values, where a reward couter would cout the total amout of fial rewards give ad the give a bigger fial reward tha this amout for better solutios, thus slightly overridig all previous solutios. Ufortuately, despite its simplicity, this method has ot bee tested yet. Adjustig the values of α, the fial episode reward ad the iterval for radom iitial weights of ew stimuli i short term workig memory determie the balace betwee radom exploratio ad greedy exploratio. But if the solutios foud durig the first episodes are too far from the optimal solutio, these parameters are ot sufficiet for covergig to the optimal solutio. Usig ε-greedy exploratio should solve this problem sice it would itroduce stochastic behavior. Testig this is oe of the first issues of future research. Future research will also focus o comparig existig RL methods ad those proposed i this paper for other mazes ad for other kids of problems. It would be especially iterestig to exted the approach to problems requirig geeralizatio for differet states based o state descriptios. Oe such problem is the miefield avigatio problem treated i [4], which is more geeral tha well-kow cases like backgammo [8] that require a great amout of domai kowledge. I the miefield avigatio problem there are o states, oly cotiuousvalued state descriptios, where the umber of stimuli is costat while the degree of activatio of stimuli chages. All calculatios used i this paper are applicable to this kid of stimuli, but they will certaily eed to be further developed i order to solve this kid of problem. 4 Coclusio This paper presets how iitial exploratio rus i reiforcemet learig ca be sigificatly shorteed. This is achieved by the SLAP reiforcemet learig priciple, which makes the aget avoid comig back to states already visited. SLAP also has the side effect of memorizig the shortest path foud durig a episode i the weights of the eural et model preseted here, thus fidig "usable" solutios with miimal exploratio. Sice "usable" solutios are foud very quickly, it becomes feasible to let multiple agets do simultaeous exploratio ad retai the best oes. Lettig these agets commuicate ad exchage their iformatio would be a iterestig topic for future research sice that could further reduce exploratio time. Notios of short- ad log term memory preseted offer agets a possibility to maitai a balace betwee previously foud solutios ad searchig for eve better solutios. This gives agets a much more "huma like" behavior tha do existig RL methods, i.e. first fidig a usable solutio ad the beig curious eough to improve the solutio whe there is time for it. Most curret RL methods first exhaustively explore the whole state space several times ad the coverge towards a optimal solutio, which is defiitely ot how a huma idividual fids a ew way to avigate through a tow, for istace. Methods preseted here are still at a early stage of research, so a lot of work remais before their positio i the research area of reiforcemet learig ca be established. The results preseted i this paper should still give a clear idicatio that the methods developed give several big advatages compared to existig methods. If similar results are obtaied for other problems ad problem domais, reiforcemet learig could probably be used i may ew applicatio areas where they are ot yet feasible due to excessive exploratio times. Refereces: [1] Geesereth, M.R., Nilsso, N.J., Logical Foudatios of Artificial Itelligece, Morga Kaufma Publishers, 1987. [2] Jeigs, N.R., Sycara, K., Woolridge, M., A Roadmap of Aget Research ad Developmet, Autoomous Agets ad Multi-Aget Systems, Vol. 1, No. 1, 1998, pp. 3-38. [3] Louie, K., Wilso, M.A., Temporally Structured Replay of Awake Hippocampal Esemble Activity durig Rapid Eye Movemet Sleep, Neuro, Vol. 29, No. 1, 2001, pp. 145-156. [4] Su, R., Merrill, E., Peterso, T., From Implicit Skills to Explicit Kowledge: A Bottom-Up Model of Skill Learig, Cogitive Sciece, Vol. 25, No. 2, 2001. [5] Sutto, R.S., Learig to predict by the method of temporal differeces, Machie Learig, Vol. 3, 1988, pp. 9-44. [6] Sutto, R.S., Itegrated architectures for learig, plaig, ad reactig based o approximatig dyamic programmig, i Proceedigs of the Seveth Iteratioal Coferece o Machie Learig, Morga Kaufma Publishers, 1990. [7] Sutto, R.S., Barto, A.G., Reiforcemet Learig, A Bradford Book, MIT Press, Cambridge, MA, 1998. [8] Tesauro, G.J., Temporal differece learig ad TD- Gammo, Commuicatios of the ACM, Vol. 38, 1995, pp. 58-68.