USING REINFORCEMENT LEARNING TO INTRODUCE ARTIFICIAL INTELLIGENCE IN THE CS CURRICULUM

USING REINFORCEMENT LEARNING TO INTRODUCE ARTIFICIAL INTELLIGENCE IN THE CS CURRICULUM Scott M. Thede Department of Computer Science DePauw University E-Mail: sthede@depauw.edu Phone: (765) 658-4736 ABSTRACT: There are many interesting topics in artificial intelligence that would be useful to stimulate student interest at various levels of the computer science curriculum. They can also be used to illustrate some basic concepts of computer science, such as arrays. One such topic is reinforcement learning teaching a computer program how to play a game or traverse an environment using a system of rewards and punishments. There are reinforcement learning projects that can be used at all levels of the computer science curriculum. This paper describes a few examples of reinforcement learning and how they might be used. INTRODUCTION TO REINFORCEMENT LEARNING: Reinforcement learning is a topic from artificial intelligence, specifically machine learning. It functions by allowing the computer program (or agent) to learn correct choices by repeatedly making choices and observing the results. For example, a computer game would learn to play a good game of chess by playing many games of chess, and favoring moves that in the past had led to a win, while avoiding moves that have led to a loss. There are many ways to implement a reinforcement learning program, but most work by keeping a value (usually called a utility value or Q value) associated with each state and/or move option, increasing its value when it leads to a win (or a reward) and decreasing its value when it leads to a loss (or a penalty). Reinforcement learning is typically taught in artificial intelligence courses, but it can be used in other courses in the computer science curriculum either to stimulate student interest or to spotlight a specific topic in the course or both. A simple version of reinforcement learning is 1

appropriate at the end of an introductory programming course knowledge of two-dimensional arrays is required. A SIMPLE REINFORCEMENT LEARNING ASSIGNMENT: A good first assignment for reinforcement learning is the implementation of a simple computer game. We typically use the game of NIM as our example. For those unfamiliar with the game, it involves two players with a pile of sticks between them (the number of sticks can vary we typically start with ten). Players alternate turns, which consist of taking a number of sticks from the pile (the number you can take varies from one to some maximum value, which can also vary we typically have a maximum of three). The player who takes the last stick is the winner. This is a particularly good example for reinforcement learning, because it is a simple game and easily defined. For game playing, the computer must calculate a utility value for each possible action that can be taken in each possible state. For a game like chess or checkers, this is an exceptionally large number of values. For NIM, with ten sticks in the starting pile and with players allowed to pick up a maximum of three sticks at a time, this amounts to thirty values. For example, it would calculate the value of picking up two sticks when there are six sticks in the pile. We can teach the computer to play NIM (or more accurately, create a program that teaches itself to play NIM) by creating an M x N array of numbers (typically integers, but any numerical variable type will work), where N is the number of sticks initially in the pile, and M is the maximum number of sticks a player can take at one time. 1 When it is the computer s turn to move, it should look at the column that corresponds to the number of sticks left in the pile. 1 This could become a more advanced assignment by designing the program to allow the user to choose the initial number of sticks and maximum number a player can take, thus requiring the use of dynamically allocated twodimensional arrays. It could also use a vector-based matrix class, if you are using C++ with the standard template library. 2

Whichever row in that column contains the largest value is the number of sticks the computer should remove from the pile. 2 The learning comes from the initial values of the array and the modifications the game program makes to the array after a game is completed. Initially, the array should contain all zeroes. Then, the program needs to keep track of what moves it made in which states during the game. For example, the program should remember that when there were ten sticks, it chose to take three, and when there were five sticks, it chose two. How this is remembered is not important we use an array of N elements, where the i-th element indicates the number of sticks chosen in state i. A linked list containing state / action pairs would also be appropriate. When the game concludes, the computer player has either won or lost. If it has lost, it goes to the spaces in the array indicated by its move list and decrements them if it has won, it increments them. In this way, it improves the values of moves that led to a win, and decreases the value of moves that lead to a loss. It is very fascinating to watch the computer make its moves, seeing it try out different moves, and eventually become a player that is impossible to beat. It takes a number of games for the computer to learn the game well enough to always win, so the assignment should also involve saving the current array to the disk when the human quits playing, and loading it again when the game is begun. The training may take longer than you want, 3 even with the saving and loading, so an interesting twist to the task is to let the computer play itself to train. Each computer player could share the array this would allow fast training. This would allow the concept to be applied to larger games for example, tic-tac-toe requires 177,147 values to be calculated (not all of which are legal position / move combinations, however). An array this size is large, but not too large to handle, but the number of games required to train is prohibitive if done by hand. Other questions can arise in the coding of the algorithm. For example, if two or more moves have the same value, how do you choose between them? It doesn t really matter random choice is one solution, or simple picking the lower number of sticks will also work (this might be 2 Of course, the rows are numbered beginning with zero, while the smallest number of sticks you can choose is one, so an appropriate mapping must be made. This sort of thing is frequently missed by students. 3 It took probably 40-50 games with N = 10 and M = 3 before the computer won all the time. 3

a better choice than a random selection, since it allows more moves per game, which will allow faster training). The algorithm will not work if you are able to change the parameters N and M between games particularly M. Changing N can be handled as long as you can resize the array. Changing M will invalidate the data, so it will have to beginning learning all over again. We have used this assignment in our artificial intelligence course, but it is well-suited to a beginning level computer science course. It would be a particularly good assignment when introducing two-dimensional arrays at our school, this would be done either at the end of CS 1 or the beginning of CS 2. A MORE COMPLICATED ASSIGNMENT: The assignment mentioned in the previous section is a simple form of reinforcement learning. It uses an approach known as naïve updating, which updates each state based upon the final result of the state sequence. This is suitable for a simple assignment, but some might wish to go into greater detail. This would work especially well for a more mathematically oriented class. When using reinforcement learning for a game, we are estimating the value of a function Q( a, i ) that estimates the value of taking the action a when the game is in state i. At some point in the process, the program updates its estimates of the Q values using some rule. Once we have reached equilibrium (trained with enough games), the Q values will accurately reflect the actual value to the program to take action a in state i. In naïve updating, the update occurs only at the end of the game, and takes the form Q( a, i ) Q( a, i ) + reward received at end of sequence Initial values should be Q( a, i ) = 0. The reward received at the end is typically +1 for a win, -1 for a loss, and 0 for a draw (if possible), but other rewards are possible. Backgammon, for example, can have payoffs from 192 to +192. Naïve updating is a very simple approach 4

there are other update approaches that converge faster to the actual values. For example, using an approach called temporal difference, or TD-learning, the updates occur after every move using the following update rule: Q( a, i ) Q( a, i ) + α[ R( i ) + max a Q( a, j ) Q( a, i ) ] In this update equation, R( i ) is the reward received in state i (for most game playing, this term is 0 and will disappear, since most games receive an award only at completion), and j is the state that will result from taking action a in state i. α is the learning rate parameter, which determines how fast the function will converge to an equilibrium value. So, this equation basically says the new value of Q( a, i ) is equal to the old value plus α times the reward received from state i plus the difference between the best move you can make next and the current value. In this way, moves that lead to good positions have their values gradually increased, while moves that lead to bad values are decreased. So, for a more challenging assignment, we can assign the students to use this equation to do the learning. The challenge is perhaps not in the programming, but in understanding the mathematical concepts behind the programming. In fact, we could have them do both versions, play them against each other, and see which one can learn faster. Experiments with various values of α are also possible. One problem with this sort of approach is that it requires tabulating the values for all possible actions in all possible states. For any non-trivial game, this table will be exceptionally large, and exceptionally slow to learn. We can address this issue by using a heuristic function to estimate the Q values. If we assume that the function Q( a, i ) can be expressed in this format Q( a, i ) = w 1 f 1 ( a, i ) + w 2 f 2 ( a, i ) + + w n f n ( a, i ) we can use reinforcement learning to learn the weights in the above equation. Of course, we need to choose appropriate f functions to combine in the Q function for example, tic-tac-toe 5

might have one of the functions be the number of rows that have two Os in a row, and another might be the number of rows containing an X. Other choices are of course possible. When learning the weights, we need to adjust our update rule a bit. For TD-learning, we can use the following update equation: w w + α [ R( i ) + max a Q( a, j ) Q( a, i ) ] w Q( a, i ) This performs a gradient descent in the weight space. This might be a bit advanced mathematically, but it could be written and presented as individual rewrite rules for each weight, if we didn t want to get into the linear algebra. LEARNING AN ENVIRONMENT: Another classic problem that reinforcement learning is applied to is that of learning an environment. The typical example of this is an agent that can wander throughout a world that is laid out in a grid pattern, achieving rewards and penalties on its way to a goal. The plan is that the agent will learn the environment and eventually find the best path to take from start to finish, collecting the most rewards along the way. For this particular problem, we are typically not interested in learning the value of taking certain actions in certain states we are more interested in simply the value of being in a certain state. We assume that the agent can move from state to state easily. This can be modeled using TDlearning with an equation similar to the one outlined in the previous section: U( i ) U( i ) + α [ R( i ) + U( j ) U( i ) ] Here, U( i ) is the utility value of being in state i. This update is applied immediately after the transition from state i to state j. In this case, R( i ) is most likely non-zero in many states. Some models could make it slightly negative in all non-reward states, to impose a penalty for the agent 6

wandering about in the environment. This depends on whether the goal of the agent is to find a path from start to finish, or simply to map the environment. This problem illustrates an issue that may not be obvious in the game playing problems. A typical approach for the agent to select its moves is to choose to move to the state with the highest utility value U( i ). This strategy seems to make sense, and is probably the best choice for a game playing agent after all, if the strategy is winning, keep using it. However, when exploring an environment, this may not be the optimal strategy. After all, there may be a better path somewhere else that we haven t explored yet. So, we should encourage our learning agent to explore the terrain more and use the utility values less. However, if we favor exploration over utility values, the agent will learn the utility values very well, but will not really make use of them. So just exploring the environment is not perfect either. What we need is a combination of strategies, where the agent is encouraged to explore early in the exploration, and encouraged to exploit the calculated utilities later in the exploration. The exact method to determine how to make this switch is not obvious, but could involve making the learning rate α vary with the number of explorations. Another option is that the perceived utility function values could be altered to seem larger if that state hasn t been explored, or explored very often. The options are wide open this is a good sort of assignment to give a class that enjoys working on open-ended problems. This could be a good research type problem as well. SUMMARY: In summary, the area of reinforcement learning is not only a very interesting area of artificial intelligence, but also an interesting area of computer science in general. It can be used to teach various points about computer science, from two-dimensional arrays to exploration agents. There are some fundamental underlying mathematical concepts that can be used to tie closer in students minds the areas of mathematics and computer science. And it is something that is fun for the students they can write a program that actually learns! 7

BIBLIOGRAPHY: Russel, Stuart and Norvig, Peter. Artificial Intelligence: A Modern Approach. Prentice Hall, 1995. Tesauro, Gerald. Temporal Difference Learning and TD-Gammon. Communications of the ACM, March 1995, Volume 38, Number 3. www.cut-the-knot.com. Contains the rules for NIM, as well as many other games and puzzles. 8