Model Free Deep Learning With Deferred Rewards For Maintenance Of Complex Systems. *Alan DeRossett 1, Pedro V Marcal 2, Inc., 1

Similar documents
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Python Machine Learning

Lecture 10: Reinforcement Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Generative models and adversarial training

Artificial Neural Networks written examination

TD(λ) and Q-Learning Based Ludo Players

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

INPE São José dos Campos

Lecture 1: Machine Learning Basics

Knowledge Transfer in Deep Convolutional Neural Nets

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Georgetown University at TREC 2017 Dynamic Domain Track

Reinforcement Learning by Comparing Immediate Reward

Modeling function word errors in DNN-HMM based LVCSR systems

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Evolutive Neural Net Fuzzy Filtering: Basic Description

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Evolution of Symbolisation in Chimpanzees and Neural Nets

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

A Reinforcement Learning Variant for Control Scheduling

A Pipelined Approach for Iterative Software Process Model

Modeling function word errors in DNN-HMM based LVCSR systems

Using focal point learning to improve human machine tacit coordination

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

AI Agent for Ice Hockey Atari 2600

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Visit us at:

On-Line Data Analytics

Software Maintenance

Learning Methods for Fuzzy Systems

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Forget catastrophic forgetting: AI that learns after deployment

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Exploration. CS : Deep Reinforcement Learning Sergey Levine

A Case Study: News Classification Based on Term Frequency

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

CPS122 Lecture: Identifying Responsibilities; CRC Cards. 1. To show how to use CRC cards to identify objects and find responsibilities

Mining Association Rules in Student s Assessment Data

Artificial Neural Networks

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

Value Creation Through! Integration Workshop! Value Stream Analysis and Mapping for PD! January 31, 2002!

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Human Emotion Recognition From Speech

CSL465/603 - Machine Learning

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Ansys Tutorial Random Vibration

Seminar - Organic Computing

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Test Effort Estimation Using Neural Network

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Learning to Schedule Straight-Line Code

Softprop: Softmax Neural Network Backpropagation Learning

Managerial Decision Making

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Automating the E-learning Personalization

arxiv: v3 [cs.cl] 7 Feb 2017

Circuit Simulators: A Revolutionary E-Learning Platform

Abstractions and the Brain

The Evolution of Random Phenomena

arxiv: v4 [cs.cl] 28 Mar 2016

(Sub)Gradient Descent

Word Segmentation of Off-line Handwritten Documents

Laboratorio di Intelligenza Artificiale e Robotica

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Breaking the Habit of Being Yourself Workshop for Quantum University

A study of speaker adaptation for DNN-based speech synthesis

Soft Computing based Learning for Cognitive Radio

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speaker Identification by Comparison of Smart Methods. Abstract

arxiv: v1 [cs.dc] 19 May 2017

We re Listening Results Dashboard How To Guide

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Modeling user preferences and norms in context-aware systems

SCOPUS An eye on global research. Ayesha Abed Library

ME 443/643 Design Techniques in Mechanical Engineering. Lecture 1: Introduction

A deep architecture for non-projective dependency parsing

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

A cognitive perspective on pair programming

*** * * * COUNCIL * * CONSEIL OFEUROPE * * * DE L'EUROPE. Proceedings of the 9th Symposium on Legal Data Processing in Europe

Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation

Model Ensemble for Click Prediction in Bing Search Ads

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma

Book Review: Build Lean: Transforming construction using Lean Thinking by Adrian Terry & Stuart Smith

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

On the Combined Behavior of Autonomous Resource Management Agents

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Transcription:

Model Free Deep Learning With Deferred Rewards For Maintenance Of Complex Systems. *Alan DeRossett 1, Pedro V Marcal 2, Inc., 1 Boxx Health Inc.., 3538 S. Thousand Oaks Blvd., Thousand Oaks CA. 91362 2 MPACT Corp.,5297 Oak bend Lane, Suite 105, Oak Park, CA. 91377 *Presenting and Corresponding author : alan@mb1.com Abstract This paper reviews progress in Deep Learning and the successful Application of Deep Q Networks to Competitive Games. The basis of the method is Watkins' Deferred Rewards Learning[1]. However its implementation in its current form is due to D. Hassabis et al [2]. In its current form the technology is heavily dependent on image processing. This suggests that the method may be adapted for the establish ment of Optimal Maintenance Policies for Complex systems such as Airliners and/or racing cars with the addition of monitoring of Sound. Keywords : deep learning, Deferred Rewards Learning, game theory, maintenance of complex systems. Introduction The authors started off by investigating the recent explosive growth of the GPU in numerical processing. We soon discovered that the Deep Learning Community provided the largest growth in using the GPU. In fact it may be said that progress in Deep Learning owes its recent progress to massive computing which was able to achieve the scale necessary to solve problems in imaging. Two programs that enabled this was Theano [3] from the University of Montreal and the recent open source system from Google, Tensor Flow [4]. Theano may be looked upon as a system for code optimization and deployment on GPUs. The program was developed and used for neural network problems. The Tensor Flow program achieves the same means by allowing itd users to make use of a data flow model. Once users cast their algorithms in data flow format. The program will allow the user to bring all the available computing power to bear on the data flow object to be completed. The goal of massive parallelization has been elusive. For a long time the people in this audience have tried to apply it to FEM for example. The existence of multiple core machines have sped up the solution of our problems. However the Artificial Neural Network Community has shown us that concurrency is much simpler when problems are increased by two orders of magnitude. It is interesting to speculate that the FEM community may be able to take advantage of the two programs for FEA. In order to achieve our objective of explaining the technology behind the success of DQN [2] and its spectacular achievement of beating the world Go Champion, we will start introduce the three technologies behind it. We should also note that one of us (ADR) has spent considerable time and effort downloading and installing Theano and Tensor Flow on local computers as well as on the Cloud. Theoretical Considerations. Deep Learning Nielsen [5] has pre-released the first Chapter of a book in preparation by Bengio et al [6] that explains the theory of deep learning applies it to decyphering handwritten numbers The exposition is particularly instructive because of the demonstration of the theory in the form of a Python computer program.

Following [5] we start by defining a sigmoid network. Fig. 1 Sigmoid neuron The neuron has 3 inputs x and a bias b (scalar) with weights w. The input to the neuron is w.x+ b. The output is modified by the sigmoid or logistic function written as, Fig.2 Sigmoid function The sigmoid function σ (z) has the desirable property of small input changes resulting in small and smooth changes in output. In deep networks, we introduce additional layers between the input and output. Fig. 3 Deep network with hidden layers (2). We introduce an additional index to indicate the layer, then the input to a neuron in a hidden layer is z ij =Σ x ij. w ij + b ij and its output is σ ij (z) for j= 0 to d layers where d is the output layer.

Hence an input x i0 results in a nonlinearly mapped output σ id (z). The network is defined by a number of training inputs x with n outputs y(x). We define a cost function C(w,b)= 1/2 n. Σ (y id -σ id (z)) 2 then the objective is to vary the weights w and biases b to reduce the value of C for all the members of the training set. So we see the training problem as the minimization of the quadratic function C. The numerical procedure for solving such problems is well known. However the most used method is that of back propagation of errors. As mentioned earlier, we are at a stage where the combined power of software and hardware can be applied to large problems with many layers (typically up to 10). I think of this process as building a large interpolation function so that any other input x will result in an output that conforms to the output function defined by the input output set x, y respectively. This concludes our brief discussion of deep learning. Learning With Delayed Rewards In his thesis[1], Watkins investigated animal behavior in both the wild and under lab controlled condition. There is a rich set of experiences and theories. One may view an animal's reaction as an intelligent response to a set of stimuli as an immediate as well as a long term reaction with a view to survival, (the ultimate response). Watkins framed the problem as a number of variables x i with state a it at time t. At any stage the problem was assumed to be a Markov process. The problem is framed in turns of a cycle or epoch in terms of which all the states are changed in sequence according to some set policy. Initially this could even be a random one. The time step within the epoch is assumed to take n steps. At a step t we assume that some action Q( a it ) results in some reward r it at every time step but depreciated by a factor γ. Because γ < 1 the return tends to 0 with n the number of steps being large. Hence we have the n-step truncated return given by a change in a i at time t. The rewards for actions Q( a it ) over n steps is given by the shifted n-step rewards R n = Σ r it [n]t At this stage the framing of the problem has introduced a further number of unknowns. We note that the state a changes with every action Q. The reward function r is not known and the policy for selecting Q is also not known.the problem is however given specificity by selecting the changes Q( a it ) by the principle of dynamic programming(dp).[7]. Watkins showed that with DP the action Q can be achieved iteratively and Watkins and Dayan[8] gave further proof governing the iterative procedure. Watkins also assumed that the reward function could also be defined by repeated observation of the actual game over time. The thesis did not go into the application of the theory. One must assume that a large amount of numerical calculations must have been performed for Watkins to be able to speak so authoritatively on the problem. The reader is referred to [9] to see a tutorial on learning with deferred rewards. Theory of Games with deep learning and DQN. It was up to Hassabis and his colleagues at Deep Mind[2] to bring substance to Watkins' methods. Hassabis reformulated the problem in neural network terms. This was demonstrated by applying it to a whole set of Atari Games. In many ways the Atari games were the perfect form as a project. The games already had a scoring system that provided the Reward Function R and the games were built for a user to specify an action Q at any stage. The pixels on the screen were used as input. They were turned into values by applying a convolution mapping to the screen. The input state a i0 were defined by the screen image. The Q function was defined as a square matrix giving all the possible combination of the

changes in the state a sequence with time. The preprocessing for input to the neural net is shown in Fig. 6. The Q functions were given by the coding on the right while the conversion to convolutions were obtained on the left side. Both results were fed into the neural network. Fig. 6 Schematic for preprocessing raw input for input to the neural net. The project proceeded in two phases. In the first training phase the program was set to collect a massive amount of data resulting from changes in the Q functions for different starting state values a. The program stores all the experiences in a database D e where e is the total sum of recordings for training a particular game. The training of DQN networks is known to be unstable. In the second phase the database D was used to train the neural net in what is known as a experience replay developed in [2] and using a biologically inspired mechanism that randomizes over the data, thereby removing correlations in the observation sequence and smoothing over changes in the data distribution. The second improvement used an iterative update that adjusts Q towards targets that are only periodically updated. Results Here we show the results for the Space Invader game using the two useful metrics of average score and averagepredicted action-value Q.

Fig. 7 Scores for space invader. We conclude by noting that the results for all the Atari Games beat the results obtained by specialized computer playing games (tailored to one game).hence the project even at this first step, proved the power of the method for implementing the DQN as a deep learning network. Possible Applications in Complex Engineering Systems. Such a general approach to applying DQN to Model Free problems has many potential applications. So we should pick the more important problems. One such problem that requires little alteration in the Lua Open Source program provided by the authors could be in the study of the maintenance problem in Airlines. Currently the fleet is overhauled and serviced on a regular basis. At such times parts are examined and sometimes replaced. The maintenance actions have an impact on the performance of an aircraft. Perhaps this improvement can be detected by video cameras that record the visual performanc of the plane during taxiing and parking. Such records could replace those captured for the Atari Games. In another similar vein, we could probably also record the roar of the racing car engines and add this to figure out the best maintenance policy. References

[1] CJCH Watkins, 'Learning From Delayed Rewards', Ph. D. Thesis, King's College, Cambridge, 1989 [2] Google Deep Mind 1 'Human-level Control Through Deep Reinforcement Learning', Nature, Vol 518, February, 2015 [3]. I. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde- Farley and Y. Bengio. Theano: A CPU and GPU Math Expression Compiler.Proceedings of the Python for Scientific Computing Conference (SciPy) 2010. June 30 - July 3, Austin, TX. [4] Google Brain Team,'Tensor Flow', Open source, 2016. [5] Michael Nielsen,'Deep Learning', Chapter 1. of [6], January, 2016 [6] Y. Bengio, I. Goodfellow, A. Courville, 'Deep Learning', MIT Press, (in Press), 2016. [7] R.E. Bellman and S.E. Dreyfus, 'Appied Dynamic Programming', Rand Corp., 1962. [8] CJCH. Watkins and P. Dayan, 'Q-Learning', Machine Learning, 8, 272-279, 1992. [9] 'A Painless Q Learning Tutorial',http://mnemstudio.org/path-finding-q-learning-tutorial.htm, 2016