# FF+FPG: Guiding a Policy-Gradient Planner

Size: px
Start display at page:

Transcription

3 Fast Forward (FF) and FF-replan Fast Forward A detailed description of the Fast Forward planner (FF) can be found in Hoffmann & Nebel (2001) and Hoffmann (2001). FF is a forward chaining heuristic state space planner for deterministic domains. Its heuristic is based on solving with a graphplan algorithm a relaxation of the problem where negative effects are removed, which provides a lower bound on each state s distance to the goal. This estimate guides a local search strategy, enforced hillclimbing (EHC), in which one step of the algorithm looks for a sequence of actions ending in a strictly better state (better according to the heuristic). Because there is no backtracking in this process, it can get trapped in dead-ends. In this case, a complete best-first search (BFS) is performed. FF-replan Variants of FF include metric-ff, conformant- FF and contingent-ff. But FF has also been successfully used in the probabilistic track of the international planning competition in a version called FF-replan (Yoon, Fern, & Givan 2004; 2007). FF-replan works by planning in a determinized version of the domain, executing its plan as long as no unexpected transition is met. In such a situation, FF is called for replanning from current state. One choice is how to turn the original probabilistic domain into a deterministic one. Two possibilities have been tried: in IPC4 (FF-replan-4): for each probabilistic action, keep its most probable outcome as the deterministic outcome; a drawback is that the goal may not be reachable anymore; and in IPC5 (FF-replan-5, not officially competing): for each probabilistic action, create one deterministic action per possible outcome; a drawback is that the number of actions grows quickly. Both approaches are potentially interesting: the former should give more efficient plans if it is not necessary to rely on low-probability outcomes of actions (it is necessary in Zenotravel), but will otherwise get stuck in some situations. Simple experiments with the blocksworld show that FFreplan-5 can prefer to execute actions that, with a low probability, achieve the goal very fast. E.g., the use of put-on-block?b1?b2 when put-down?b1 would be equivalent to put a block?b1 on the table and safer from the point of view of reaching the goal with the highest probability. This illustrates the drawback of the FF approach of determinizing the domain. Good translations somewhat avoid this by removing action B in cases where actions A and B have the same effects, action A s preconditions are less or equally restrictive as action B s preconditions, and action A is more probable than action B. Note: at the time of this work, no details about FF-replan were published, which is now fixed (Yoon, Fern, & Givan 2007). Off-Policy FPG FPG relies on OLPOMDP, which assumes that the policy being learned is the one used to draw actions while learning. As we intend to also take FF s decisions into account while learning, OLPOMDP has to be turned into an off-policy algorithm by the use of importance sampling. Importance Sampling Importance sampling (IS) is typically presented as a method for reducing the variance of the estimate of an expectation by carefully choosing a sampling distribution (Rubinstein 1981). For a random variable X distributed according to p, E p [f(x)] is estimated by 1 n i f(x i) with i.i.d samples x i p(x). But a lower variance estimate can be obtained with a sampling distribution q having higher density where f(x) is larger. Drawing x i q(x), the new estimate is 1 n i f(x i)k(x i ), where K(x i )= p(xi) q(x i) is the importance coefficient for sample x i. IS for OLPomdp: Theory Unlike Shelton (2001), Meuleau, Peshkin, & Kim (2001) and Peshkin & Shelton (2002), we do not want to estimate R(θ) but its gradient. Rewriting the gradient estimation given by Baxter, Bartlett, & Weaver (2001), we get: ˆ R(θ) = r(x) p(x) p(x) p(x) q(x) q(x), X where the random variable X is sampled according to distribution q rather than its real distribution p. In our setting, a sample X is a sequence of states s 0,...,s t obtained while drawing a sequence of actions a 0,...,a t from Q[a s; θ] (the teacher s policy). This leads to: where ˆ R(θ) = p(s t )= q(s t )= p(s t ) p(s t ) = = t 1 t 1 t 1 t 1 t 1 T t=0 r(s t ) p(s t) p(s t ) p(s t ) q(s t ), P[a t s t ; θ] P[s t +1 s t,a t ], Q[a t s t ; θ] P[s t +1 s t,a t ], and (P[a t s t ; θ] P[s t +1 s t,a t ]) P[a t s t ; θ] P[s t +1 s t,a t ] P[a t s t ; θ], hence P[a t s t ; θ] t 1 p(s t ) p(s t ) p(s t ) q(s t ) = P[a t s t ; θ] t 1 Q[a t s t ; θ] The off-policy update of the eligibility trace is then: e t+1 = e t + K t+1 log P[a t s t ; θ], where K t+1 = Q t P[a t s t ;θ] Q t Q[a t s t ;θ], = K t P[a t s t;θ] Q[a t s t;θ]. P[a t s t ; θ]. P[a t s t ; θ] 44

4 IS for OLPomdp: Practice It is known that there are possible instabilities if the true distribution differs a lot from the one used for sampling, which is the case in our setting. Indeed, K t is the probability of a trajectory if generated by P divided by the probability of the same trajectory if generated by Q. This typically converges to 0 when the horizon increases. Weighted importance sampling solves this by normalizing each IS sample by the average importance co-efficient. This is normally performed in a batch setting, where the gradient is estimated from several runs before following its direction. With our online policygradient ascent we use an equivalent batch size of 1. The update becomes: K t+1 = 1 t k t, and k t = P[at st;θ] t Q[a, t s t;θ] t =1 e t+1 = e t + K t+1 log P[a t s t ; θ], θ t+1 = θ t + 1 re t+1. K t+1 Learning from FF We have turned FF into a library (LIBFF) that makes it possible to ask for FF s action in a given state. There are two versions: EHC: use enforced hill climbing only, or EHC+BFS: do a breadth first search if EHC fails. Often, the current state appears in the last plan found, so that the corresponding action is already in memory. Plus, to make LIBFF more efficient, we cache frequently encountered state-action suggestions. Choice of the Sampling Distribution Off-policy learning requires that each trajectory possible under the true distribution be possible under the sampling distribution. Because FF acts deterministically in any state, the sampling distribution cannot be based on FF s decisions alone. Two candidate sampling distributions are: 1. FF(ɛ)+uni(1 ɛ): use FF with probability ɛ, a uniform distribution with probability 1 ɛ; and 2. FF(ɛ)+FPG(1 ɛ): use FF with probability ɛ, FPG s distribution with probability 1 ɛ. As long as ɛ 1, the resulting sampling distribution has the same support as FPG. The first distribution favors a small constant degree of uniform exploration. The second distribution mixes the FF suggested action with FPG s distribution, and for high ɛ we expect FPG to learn to mimic FF s action choice closely. Apart from the expense of evaluating the teacher s suggestion, the additional computational complexity of using importance sampling is negligible. An obvious idea is to reduce ɛ over time, so that FPG takes over completely, however the rate of this reduction is highly domain dependent, so we chose a fixed ɛ for the majority of optimization, reverting to standard FPG towards the end of optimization. FF+FPG in Practice Both FF and FPG accept the language considered in the competition (with minor exceptions), i.e., PDDL with extensions for probabilistic effects (Younes et al. 2005). Note that source code is available for FF 2, FPG 3 and libpg 4 (the policy-gradient library used by FPG). Excluding parameters specific to FF or FPG, one has to choose: 1. whether to translate the domain in either an IPC4 oripc5 type deterministic domain for FF; 2. whether to use EHC or EHC+BFS; 3. ɛ (0, 1); and 4. how long to learn with and without a teacher. Experiments The aim is to let FF help FPG. Thus the experiments will focus on problems from the 5 th international planning competition for which FF clearly outperformed FPG, in particular the Blocksworld and Pitchcatch domains. In the other 6 IPC domains FPG was close to, or better, than the version of FF we implemented. However, we begin by analyzing the behavior of FF+FPG. Simulation speed The speed of the simulation+learning loop in FPG (without FF) essentially depends on the time taken for simple matrix computation. FF, on the other hand, enters a complete planning cycle for each new state, slowing down planning dramatically in order to help FPG reach a goal state. Caching FF s answers greatly reduced the slowdown due to FF. Thus, an interesting reference measure is the number of simulation steps performed in 10 minutes while not learning FPG s default behavior being a random walk as it helps evaluating how time-consuming the teacher is. Various settings are considered: the teacher can be EHC, EHC+BFS or none; and the type of deterministic domain is IPC4 (most probable effects) or IPC5 (all effects). Table 1 gives results for the blocksworld 5 problems p05 and p10 (involving respectively 5 and 10 blocks), with different ɛ values. Having no teacher is here equivalent to no learning at all as there are very few successes. Considering the number of simulation steps, we observe that EHC is faster than EHC+BFS only for p05, with ɛ = 0.5. Indeed, if one run of EHC+BFS is more timeconsuming, it usually caches more future states, which are only likely to be re-encountered if ɛ = 1. With p05, the score of the fastest teacher ( ) is close to the score of FPG alone ( ), which reflects the predominance of matrix computations compared to FF s reasoning. But this changes with p10, where the teacher becomes necessary to get FPG to the goal in a reasonable number of steps. Finally, we clearly observed that the simulation speeds up as the cache fills up. 2 joergh/ff.html daa/software.htm 5 Errors appear in this blocksworld domain, but we use it as a 45

5 Table 1: Number of simulation steps ( 10 3 ), [number of successes ( 10 3 )] and (average reward) in 10 minutes in the Blocksworld problem p05, ɛ =0.5 p05, ɛ =1 p10, ɛ =1 domain IPC4 IPC5 IPC4 IPC5 IPC4 IPC5 no *5=1375 teacher [0.022] (4.96e-3) [0*5] (0*5) EHC [1.9] [7.9] [5.2] [26.5] [0.05] [1.0] (0.5) (2.4) (1.5) (6.5) (0.1) (1.7) EHC [7.6] [10.1] [199.2] [65.5] [10.3] [5.0] BFS (3.6) (6.7) (55.2) (30.7) (20.0) (9.0) Note: FPG with no teacher stopped after 2 minutes in p10, because of its lack of success. (Experiments performed on a P4-2.8GHz.) Success Frequency Another important aspect of the choice of a teacher is how efficiently it achieves rewards. Two interesting measures are: 1) the number of successes that shows how rewarding these 10 minutes have been; and 2) the average reward per time step (which is what FPG optimizes directly). As can be expected, both measures increase with ɛ (ɛ =0 implies no teacher) and decrease with the size of the problem. With a larger problem, there is a cumulative effect of FF s reasoning being slower and the number of steps to the goal getting larger. Unsurprisingly, EHC+BFS is more efficient than EHC alone when wall-clock time is not important. Also, unsurprisingly in blocksworld, IPC4 determinizations are better than IPC5, due to the fact that blocksworld is made probabilistic by giving the highest probability (0.75) to the normal deterministic effect. Learning Dynamics We look now at the dynamics of FPG while learning, focusing on two difficult but still accessible problems: Blocksworld/p10 and Pitchcatch/p07. EHC+BFS was applied in both cases. Pitchcatch/p07 required an IPC5-type domain, while IPC4 was used for blocksworld/p10. Figures 2 and 3 show the average number of successes per time step when using FPG alone or FPG+FF. But, as can be observed on Table 1, FPG s original random walk does not initially find the goal by chance. To overcome this problem, the competition version of FPG implemented a simple progress estimator counting how many facts from the goal are gained or lost in a transition to modify the reward function, i.e., reward shaping. This leads us to also consider results with and without the progress estimator (the measured average reward not taking it into account). In the experiments performed on a P4-2.8GHz the teacher is always used during the first 60 seconds (for a total learning time of 900 seconds, as in the competition). The settings include two learning step sizes: α and α tea (a specific step size while teaching). If a progress estimator is used, each goal fact made true (respectively false) brings a reward of +100 (resp. -100). Note that we used our own reference from the competition. simple implementation of FF-replan. Based on published results, the IPC FF-Replan (Yoon, Fern, & Givan 2004) performs slightly better. The curves appearing on Fig. 2 and 3 are over a single run, in a view to exhibit typical behaviors which have been observed repeatedly. No accurate comparison between the various settings should be done. On Fig. 2, it appears that the progress estimator is not sufficient for Blocksworld/p10, so that no teacher-free approach starts learning. With the teacher used for 60 seconds, a first high-reward phase is observed before a sudden fall when teaching stops. Yet, this is followed by a progressive growth up to higher rewards than with just the teacher. Here, ɛ is high to ensure that the goal is met frequently. Combining the teacher and the progress estimator led to quickly saturating parameters θ, causing numerical problems. In Pitchcatch/p07, vanilla FPG fails, but the progress estimator makes learning possible, as shown on Fig. 3. Using the teacher or a combination of the progress estimator and the teacher also works. The three approaches give similar results. As with blocksworld, a decrease is observed when teaching ends, but the first phase is much lower than the optimum, essentially because ɛ is set to a relatively low 0.5. R FPG FPG+prog FF+FPG Figure 2: Average reward per time step on Blocksworld/p10 ɛ =0.95, α =5.10 4, α tea =10 5, β =0.95 Blocksworld Competition Results We recreated the competition environment for the 6 hardest blocksworld problems, which the original IPC FPG planner struggled with despite the use of progress rewards. Optimization was limited to 900 seconds. The EHC+BFS teacher was used throughout the 900 seconds with ɛ =0.9 and discount factor β =1(the eligibility trace is reset after reaching the goal). The progress reward was not used. P10 contains 10 blocks, and the remainder contain 15. As in the competition, evaluation was over 30 trials of the problem. FF was not used during evaluation. Table 2 shows the results. The IPC results were taken from the 2006 competition results. The FF row shows our implementation of the FF-based replanner without FPG, using the faster IPC-4 determinization of domains, hence the discrepancy with the IPC5-FF row. The results demonstrate t 46

6 stochastic policy finding the appropriate action only half of the time; with FF+FPG(3L), FPG really learn FF s behavior, i.e. the optimal policy. R Table 3: Success probability on the XOR problem FPG 0.05 FPG+prog FF+FPG FF+FPG+prog Figure 3: Average reward per time step on Pitchcatch/p07 ɛ =0.5, α =5.10 4, α tea =10 5, β =0.85 Table 2: Number of success out of 30 for the hardest probabilistic IPC5 blocksworld problems. Planner p10 p11 p12 p13 p14 p15 FF+FPG FF IPC5-FPG IPC5-FF that FPG is at least learning to imitate FF well, and particularly in the case of Blocksworld-P15 FPG bootstraps from FF to find a better policy. This is a very positive result considering how difficult these problems are. Where FPG Fails: A XOR Problem We present here some experiments on a toy problem whose optimal solution cannot be represented with the usual linear networks. In this XOR problem, the state is represented by two predicates A and B (randomly initialised), and the only two actions are α and β. Applying α if A B leads to a success, as well as applying β if (A B). Any other decision leads to a failure. Table 3 shows results for various planners, two function approximators being used within FPG: the usual linear network (noted 2L because it is a 2-layer perceptron) and a 3-layer perceptron 3L (with two hidden units). The observed results can be interpreted as follows: FPG(2L) finds the best policy it can express: it picks one action in 3 cases out of 4, and the other in the last case; there is a misclassification in only a quarter of all situations; FPG(3L) but usually falls in a local optimum achieving the same result as FPG(2L); FF always finds the best policy; with FF+FPG(2L), FPG tries with no success to learn the true optimal policy, as exhibited by FF; the result is a t FPG(2L) FPG(3L) FF FF+FPG(2L) FF+FPG(3L) 74% 81% 100% 44% 100% Discussion Because classical planners like FF return a plan quickly compared to probabilistic planners, using them as a heuristic input to probabilistic planners makes sense. Our experiments demonstrate that this is feasible in practice, and makes it possible for FPG to solve new problems efficiently, such as 15 block probabilistic blocksworld problems. Choosing ɛ well for a large range of problems is difficult. Showing too much of a teacher s policy (ɛ 1) will lead to copying this policy (provided it does reach the goal). This is close to supervised learning where one tries to map states to actions exactly as proposed by the teacher, which may be a local optimum. Avoiding local optima is made possible by more exploration (ɛ 0), but at the expense of losing the teacher s guidance. Another difficulty is finding an appropriate teacher. As we use it, FF proposes only one action (no heuristic value for each action), making it a poor choice for sampling distribution without mixing it with another. Computation times can be expensive, however this is more than offset by its ability to initially guide FPG to the goal in combinatorial domains. And the choice between IPC-4 and IPC-5 determinization of domains is not straightforward. There is space to improve FF which may result in FF being an even more competitive stand-alone planner, as well as assisting stochastic local search based planners. In particular, recently published details on the original implementation of FF-rePlan (Yoon, Fern, & Givan 2007) should help us develop a better replanner than the version we are using. In many situations, the best teacher would be a human expert. But importance sampling cannot be used straightforwardly in this situation. In similar approach to ours, Mausam, Bertoli, & Weld (2007) use a non-deterministic planner to find potentially useful actions, whereas our approach exploits a heuristic borrowed from a classical planner. Another interesting comparison is with Fern, Yoon, & Givan (2003) and Xu, Fern, & Yoon (2007). Here, the relationship between heuristics and learning is inverted, as the heuristics are learned rather than used for learning. Given a fixed planning domain, this can be an efficient way to gain knowledge from some planning problems and reuse it in more difficult situations. Conclusion FPG s benefits are that it learns a compact and factored representation of the final plan, represented as a set of parame- 47

7 ters; and the per step learning algorithm complexity does not depend on the complexity of the problem. However FPG suffers in problems where the goal is difficult to achieve via initial random exploration. We have shown how to use a non-optimal planner to help FPG to find the goal, while still allowing FPG to learn a better policy than the original teacher, with initial success on IPC planning problems that FPG could not previously solve. Acknowledgments We thank Sungwook Yoon for his help on FF-replan. This work has been supported in part via the DPOLP project at NICTA. References Aberdeen, D., and Buffet, O Temporal probabilistic planning with policy-gradients. In Proceedings of the Seventeenth International Conference on Automated Planning and Scheduling (ICAPS 07). Baxter, J.; Bartlett, P.; and Weaver, L Experiments with infinite-horizon, policy-gradient estimation. Journal of Artificial Intelligence Research 15: Buffet, O., and Aberdeen, D The factored policy gradient planner (ipc 06 version). In Proceedings of the Fifth International Planning Competition (IPC-5). Fern, A.; Yoon, S.; and Givan, R Approximate policy iteration with a policy language bias. In Advances in Neural Information Processing Systems 15 (NIPS 03). Glynn, P., and Iglehart, D Importance sampling for stochastic simulations. Management Science 35(11): Hoffmann, J., and Nebel, B The FF planning system: Fast plan generation through heuristic search. Journal of Artificial Intelligence Research 14: Hoffmann, J FF: The fast-forward planning system. AI Magazine 22(3): Little, I Paragraph: A graphplan-based probabilistic planner. In Proceedings of the Fifth International Planning Competition (IPC-5). Mausam; Bertoli, P.; and Weld, D. S A hybridized planner for stochastic domains. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI 07). Meuleau, N.; Peshkin, L.; and Kim, K Exploration in gradient-based reinforcement learning. Technical Report AI Memo , MIT - AI lab. Peshkin, L., and Shelton, C Learning from scarce experience. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML 02). Rubinstein, R Simulation and the Monte Carlo Method. John Wiley & Sons, Inc. New York, NY, USA. Sanner, S., and Boutilier, C Probabilistic planning via linear value-approximation of first-order MDPs. In Proceedings of the Fifth International Planning Competition (IPC-5). Shelton, C Importance sampling for reinforcement learning with multiple objectives. Technical Report AI Memo , MIT AI Lab. Teichteil-Königsbuch, F., and Fabiani, P Symbolic stochastic focused dynamic programming with decision diagrams. In Proceedings of the Fifth International Planning Competition (IPC-5). Williams, R Simple statistical gradient-following algorithms for connectionnist reinforcement learning. Machine Learning 8(3): Xu, Y.; Fern, A.; and Yoon, S Discriminative learning of beam-search heuristics for planning. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI 07). Yoon, S.; Fern, A.; and Givan, R FF-rePlan. sy/ffreplan.html. Yoon, S.; Fern, A.; and Givan, B FF-Replan: a baseline for probabilistic planning. In Proceedings of the Seventeenth International Conference on Automated Planning and Scheduling (ICAPS 07). Younes, H. L. S.; Littman, M. L.; Weissman, D.; and Asmuth, J The first probabilistic track of the international planning competition. Journal of Artificial Intelligence Research 24:

### Discriminative Learning of Beam-Search Heuristics for Planning

Discriminative Learning of Beam-Search Heuristics for Planning Yuehua Xu School of EECS Oregon State University Corvallis,OR 97331 xuyu@eecs.oregonstate.edu Alan Fern School of EECS Oregon State University

### Lecture 10: Reinforcement Learning

Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation

### Learning and Transferring Relational Instance-Based Policies

Learning and Transferring Relational Instance-Based Policies Rocío García-Durán, Fernando Fernández y Daniel Borrajo Universidad Carlos III de Madrid Avda de la Universidad 30, 28911-Leganés (Madrid),

### Lecture 1: Machine Learning Basics

1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

### Reinforcement Learning by Comparing Immediate Reward

Reinforcement Learning by Comparing Immediate Reward Punit Pandey DeepshikhaPandey Dr. Shishir Kumar Abstract This paper introduces an approach to Reinforcement Learning Algorithm by comparing their immediate

### Artificial Neural Networks written examination

1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

### Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

### The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

### ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Proceedings of 28 ISFA 28 International Symposium on Flexible Automation Atlanta, GA, USA June 23-26, 28 ISFA28U_12 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM Amit Gil, Helman Stern, Yael Edan, and

### Axiom 2013 Team Description Paper

Axiom 2013 Team Description Paper Mohammad Ghazanfari, S Omid Shirkhorshidi, Farbod Samsamipour, Hossein Rahmatizadeh Zagheli, Mohammad Mahdavi, Payam Mohajeri, S Abbas Alamolhoda Robotics Scientific Association

### Transfer Learning Action Models by Measuring the Similarity of Different Domains

Transfer Learning Action Models by Measuring the Similarity of Different Domains Hankui Zhuo 1, Qiang Yang 2, and Lei Li 1 1 Software Research Institute, Sun Yat-sen University, Guangzhou, China. zhuohank@gmail.com,lnslilei@mail.sysu.edu.cn

### On the Combined Behavior of Autonomous Resource Management Agents

On the Combined Behavior of Autonomous Resource Management Agents Siri Fagernes 1 and Alva L. Couch 2 1 Faculty of Engineering Oslo University College Oslo, Norway siri.fagernes@iu.hio.no 2 Computer Science

### Generative models and adversarial training

Day 4 Lecture 1 Generative models and adversarial training Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University What is a generative model?

### Learning Methods for Fuzzy Systems

Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +49.39.67.876, Fax : +49.39.67.8

### TD(λ) and Q-Learning Based Ludo Players

TD(λ) and Q-Learning Based Ludo Players Majed Alhajry, Faisal Alvi, Member, IEEE and Moataz Ahmed Abstract Reinforcement learning is a popular machine learning technique whose inherent self-learning ability

### Georgetown University at TREC 2017 Dynamic Domain Track

Georgetown University at TREC 2017 Dynamic Domain Track Zhiwen Tang Georgetown University zt79@georgetown.edu Grace Hui Yang Georgetown University huiyang@cs.georgetown.edu Abstract TREC Dynamic Domain

### University of Groningen. Systemen, planning, netwerken Bosman, Aart

University of Groningen Systemen, planning, netwerken Bosman, Aart IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document

### OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

### Python Machine Learning

Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

### AMULTIAGENT system [1] can be defined as a group of

156 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 2, MARCH 2008 A Comprehensive Survey of Multiagent Reinforcement Learning Lucian Buşoniu, Robert Babuška,

### Regret-based Reward Elicitation for Markov Decision Processes

444 REGAN & BOUTILIER UAI 2009 Regret-based Reward Elicitation for Markov Decision Processes Kevin Regan Department of Computer Science University of Toronto Toronto, ON, CANADA kmregan@cs.toronto.edu

### A Reinforcement Learning Variant for Control Scheduling

A Reinforcement Learning Variant for Control Scheduling Aloke Guha Honeywell Sensor and System Development Center 3660 Technology Drive Minneapolis MN 55417 Abstract We present an algorithm based on reinforcement

### Software Maintenance

1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories

### Introduction to Simulation

Introduction to Simulation Spring 2010 Dr. Louis Luangkesorn University of Pittsburgh January 19, 2010 Dr. Louis Luangkesorn ( University of Pittsburgh ) Introduction to Simulation January 19, 2010 1 /

### Exploration. CS : Deep Reinforcement Learning Sergey Levine

Exploration CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1. Homework 4 due on Wednesday 2. Project proposal feedback sent Today s Lecture 1. What is exploration? Why is it a problem?

(Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include

### Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

### Evolutive Neural Net Fuzzy Filtering: Basic Description

Journal of Intelligent Learning Systems and Applications, 2010, 2: 12-18 doi:10.4236/jilsa.2010.21002 Published Online February 2010 (http://www.scirp.org/journal/jilsa) Evolutive Neural Net Fuzzy Filtering:

### ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

UNSW Australia Business School School of Risk and Actuarial Studies ACTL5103 Stochastic Modelling For Actuaries Course Outline Semester 2, 2014 Part A: Course-Specific Information Please consult Part B

### Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

### Improving Fairness in Memory Scheduling

Improving Fairness in Memory Scheduling Using a Team of Learning Automata Aditya Kajwe and Madhu Mutyam Department of Computer Science & Engineering, Indian Institute of Tehcnology - Madras June 14, 2014

### An Introduction to Simio for Beginners

An Introduction to Simio for Beginners C. Dennis Pegden, Ph.D. This white paper is intended to introduce Simio to a user new to simulation. It is intended for the manufacturing engineer, hospital quality

### Improving Action Selection in MDP s via Knowledge Transfer

In Proc. 20th National Conference on Artificial Intelligence (AAAI-05), July 9 13, 2005, Pittsburgh, USA. Improving Action Selection in MDP s via Knowledge Transfer Alexander A. Sherstov and Peter Stone

### A simulated annealing and hill-climbing algorithm for the traveling tournament problem

European Journal of Operational Research xxx (2005) xxx xxx Discrete Optimization A simulated annealing and hill-climbing algorithm for the traveling tournament problem A. Lim a, B. Rodrigues b, *, X.

### SARDNET: A Self-Organizing Feature Map for Sequences

SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu

### Running Head: STUDENT CENTRIC INTEGRATED TECHNOLOGY

SCIT Model 1 Running Head: STUDENT CENTRIC INTEGRATED TECHNOLOGY Instructional Design Based on Student Centric Integrated Technology Model Robert Newbury, MS December, 2008 SCIT Model 2 Abstract The ADDIE

### Knowledge Transfer in Deep Convolutional Neural Nets

Knowledge Transfer in Deep Convolutional Neural Nets Steven Gutstein, Olac Fuentes and Eric Freudenthal Computer Science Department University of Texas at El Paso El Paso, Texas, 79968, U.S.A. Abstract

### CS Machine Learning

CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

### CSL465/603 - Machine Learning

CSL465/603 - Machine Learning Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Introduction CSL465/603 - Machine Learning 1 Administrative Trivia Course Structure 3-0-2 Lecture Timings Monday 9.55-10.45am

### Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

### Major Milestones, Team Activities, and Individual Deliverables

Major Milestones, Team Activities, and Individual Deliverables Milestone #1: Team Semester Proposal Your team should write a proposal that describes project objectives, existing relevant technology, engineering

### A Comparison of Annealing Techniques for Academic Course Scheduling

A Comparison of Annealing Techniques for Academic Course Scheduling M. A. Saleh Elmohamed 1, Paul Coddington 2, and Geoffrey Fox 1 1 Northeast Parallel Architectures Center Syracuse University, Syracuse,

### An empirical study of learning speed in backpropagation

Carnegie Mellon University Research Showcase @ CMU Computer Science Department School of Computer Science 1988 An empirical study of learning speed in backpropagation networks Scott E. Fahlman Carnegie

### Laboratorio di Intelligenza Artificiale e Robotica

Laboratorio di Intelligenza Artificiale e Robotica A.A. 2008-2009 Outline 2 Machine Learning Unsupervised Learning Supervised Learning Reinforcement Learning Genetic Algorithms Genetics-Based Machine Learning

### Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Machine Learning and Data Mining Ensembles of Learners Prof. Alexander Ihler Ensemble methods Why learn one classifier when you can learn many? Ensemble: combine many predictors (Weighted) combina

### Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots Varun Raj Kompella, Marijn Stollenga, Matthew Luciw, Juergen Schmidhuber The Swiss AI Lab IDSIA, USI

### Planning with External Events

94 Planning with External Events Jim Blythe School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 blythe@cs.cmu.edu Abstract I describe a planning methodology for domains with uncertainty

### THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS ELIZABETH ANNE SOMERS Spring 2011 A thesis submitted in partial

### Softprop: Softmax Neural Network Backpropagation Learning

Softprop: Softmax Neural Networ Bacpropagation Learning Michael Rimer Computer Science Department Brigham Young University Provo, UT 84602, USA E-mail: mrimer@axon.cs.byu.edu Tony Martinez Computer Science

### Speeding Up Reinforcement Learning with Behavior Transfer

Speeding Up Reinforcement Learning with Behavior Transfer Matthew E. Taylor and Peter Stone Department of Computer Sciences The University of Texas at Austin Austin, Texas 78712-1188 {mtaylor, pstone}@cs.utexas.edu

### High-level Reinforcement Learning in Strategy Games

High-level Reinforcement Learning in Strategy Games Christopher Amato Department of Computer Science University of Massachusetts Amherst, MA 01003 USA camato@cs.umass.edu Guy Shani Department of Computer

### Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses Thomas F.C. Woodhall Masters Candidate in Civil Engineering Queen s University at Kingston,

### Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Patterns of activities, iti exercises and assignments Workshop on Teaching Software Testing January 31, 2009 Cem Kaner, J.D., Ph.D. kaner@kaner.com Professor of Software Engineering Florida Institute of

### Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad

### AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders

### AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM Consider the integer programme subject to max z = 3x 1 + 4x 2 3x 1 x 2 12 3x 1 + 11x 2 66 The first linear programming relaxation is subject to x N 2 max

### Knowledge-Based - Systems

Knowledge-Based - Systems ; Rajendra Arvind Akerkar Chairman, Technomathematics Research Foundation and Senior Researcher, Western Norway Research institute Priti Srinivas Sajja Sardar Patel University

### Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

School of Computer Science Human-Computer Interaction Institute Carnegie Mellon University Year 2007 Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation Noboru Matsuda

### Acquiring Competence from Performance Data

Acquiring Competence from Performance Data Online learnability of OT and HG with simulated annealing Tamás Biró ACLC, University of Amsterdam (UvA) Computational Linguistics in the Netherlands, February

### Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email Marilyn A. Walker Jeanne C. Fromer Shrikanth Narayanan walker@research.att.com jeannie@ai.mit.edu shri@research.att.com

### A Pipelined Approach for Iterative Software Process Model

A Pipelined Approach for Iterative Software Process Model Ms.Prasanthi E R, Ms.Aparna Rathi, Ms.Vardhani J P, Mr.Vivek Krishna Electronics and Radar Development Establishment C V Raman Nagar, Bangalore-560093,

### Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

### An OO Framework for building Intelligence and Learning properties in Software Agents

An OO Framework for building Intelligence and Learning properties in Software Agents José A. R. P. Sardinha, Ruy L. Milidiú, Carlos J. P. Lucena, Patrick Paranhos Abstract Software agents are defined as

### QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

### Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Session 1793 Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I John Greco, Ph.D. Department of Electrical and Computer Engineering Lafayette College Easton, PA 18042 Abstract

### Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics http://dx.doi.org/10.3991/ijet.v9i6.4109 J.L. Du 1, P.F. Yu 1 and M.L. Li 2 1 Guangdong University of Foreign Studies,

### Learning From the Past with Experiment Databases

Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University

### Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Chapter 2: The Representation of Knowledge Expert Systems: Principles and Programming, Fourth Edition Objectives Introduce the study of logic Learn the difference between formal logic and informal logic

### Reducing Features to Improve Bug Prediction

Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science

### Domain Knowledge in Planning: Representation and Use

Domain Knowledge in Planning: Representation and Use Patrik Haslum Knowledge Processing Lab Linköping University pahas@ida.liu.se Ulrich Scholz Intellectics Group Darmstadt University of Technology scholz@thispla.net

### Truth Inference in Crowdsourcing: Is the Problem Solved?

Truth Inference in Crowdsourcing: Is the Problem Solved? Yudian Zheng, Guoliang Li #, Yuanbing Li #, Caihua Shan, Reynold Cheng # Department of Computer Science, Tsinghua University Department of Computer

### System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

### A Version Space Approach to Learning Context-free Grammars

Machine Learning 2: 39~74, 1987 1987 Kluwer Academic Publishers, Boston - Manufactured in The Netherlands A Version Space Approach to Learning Context-free Grammars KURT VANLEHN (VANLEHN@A.PSY.CMU.EDU)

### On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

On Human Computer Interaction, HCI Dr. Saif al Zahir Electrical and Computer Engineering Department UBC Human Computer Interaction HCI HCI is the study of people, computer technology, and the ways these

### Using focal point learning to improve human machine tacit coordination

DOI 10.1007/s10458-010-9126-5 Using focal point learning to improve human machine tacit coordination InonZuckerman SaritKraus Jeffrey S. Rosenschein The Author(s) 2010 Abstract We consider an automated

### Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Xinying Song, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research, One Microsoft Way, Redmond, WA 98052, U.S.A.

### An Investigation into Team-Based Planning

An Investigation into Team-Based Planning Dionysis Kalofonos and Timothy J. Norman Computing Science Department University of Aberdeen {dkalofon,tnorman}@csd.abdn.ac.uk Abstract Models of plan formation

### arxiv: v1 [cs.lg] 15 Jun 2015

Dual Memory Architectures for Fast Deep Learning of Stream Data via an Online-Incremental-Transfer Strategy arxiv:1506.04477v1 [cs.lg] 15 Jun 2015 Sang-Woo Lee Min-Oh Heo School of Computer Science and

### Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Automatic Discretization of Actions and States in Monte-Carlo Tree Search Guy Van den Broeck 1 and Kurt Driessens 2 1 Katholieke Universiteit Leuven, Department of Computer Science, Leuven, Belgium guy.vandenbroeck@cs.kuleuven.be

### Lecture 1: Basic Concepts of Machine Learning

Lecture 1: Basic Concepts of Machine Learning Cognitive Systems - Machine Learning Ute Schmid (lecture) Johannes Rabold (practice) Based on slides prepared March 2005 by Maximilian Röglinger, updated 2010

### Switchboard Language Model Improvement with Conversational Data from Gigaword

Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword

### BMBF Project ROBUKOM: Robust Communication Networks

BMBF Project ROBUKOM: Robust Communication Networks Arie M.C.A. Koster Christoph Helmberg Andreas Bley Martin Grötschel Thomas Bauschert supported by BMBF grant 03MS616A: ROBUKOM Robust Communication Networks,

### Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

### Learning to Schedule Straight-Line Code

Learning to Schedule Straight-Line Code Eliot Moss, Paul Utgoff, John Cavazos Doina Precup, Darko Stefanović Dept. of Comp. Sci., Univ. of Mass. Amherst, MA 01003 Carla Brodley, David Scheeff Sch. of Elec.

### Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

4 Interior point algorithms for network ow problems Mauricio G.C. Resende AT&T Bell Laboratories, Murray Hill, NJ 07974-2070 USA Panos M. Pardalos The University of Florida, Gainesville, FL 32611-6595

### Why Did My Detector Do That?!

Why Did My Detector Do That?! Predicting Keystroke-Dynamics Error Rates Kevin Killourhy and Roy Maxion Dependable Systems Laboratory Computer Science Department Carnegie Mellon University 5000 Forbes Ave,

### School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne

School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne Web Appendix See paper for references to Appendix Appendix 1: Multiple Schools

### College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

College Pricing Ben Johnson April 30, 2012 Abstract Colleges in the United States price discriminate based on student characteristics such as ability and income. This paper develops a model of college

### GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

GCSE Mathematics B (Linear) Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education Mark Scheme for November 2014 Oxford Cambridge and RSA Examinations OCR (Oxford Cambridge

### Active Learning. Yingyu Liang Computer Sciences 760 Fall

Active Learning Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven,

### BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

### OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

### Learning Cases to Resolve Conflicts and Improve Group Behavior

From: AAAI Technical Report WS-96-02. Compilation copyright 1996, AAAI (www.aaai.org). All rights reserved. Learning Cases to Resolve Conflicts and Improve Group Behavior Thomas Haynes and Sandip Sen Department

### Chapter 4 - Fractions

. Fractions Chapter - Fractions 0 Michelle Manes, University of Hawaii Department of Mathematics These materials are intended for use with the University of Hawaii Department of Mathematics Math course

### UC Merced Proceedings of the Annual Meeting of the Cognitive Science Society

UC Merced Proceedings of the nnual Meeting of the Cognitive Science Society Title Multi-modal Cognitive rchitectures: Partial Solution to the Frame Problem Permalink https://escholarship.org/uc/item/8j2825mm

### Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany Jana Kitzmann and Dirk Schiereck, Endowed Chair for Banking and Finance, EUROPEAN BUSINESS SCHOOL, International

### IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

IAT 888: Metacreation Machines endowed with creative behavior Philippe Pasquier Office 565 (floor 14) pasquier@sfu.ca Outline of today's lecture A little bit about me A little bit about you What will that