Introduction to the Theories of Machine Learning

Size: px

Start display at page:

Download "Introduction to the Theories of Machine Learning"

Daniela Merritt
6 years ago
Views:

Introduction to the Theories of Machine Learning with Feed-Forward Artificial Neural Networks and Evolving with Genetic Algorithms Second Research Paper Bachelor course on Media

1 Introduction to the Theories of Machine Learning with Feed-Forward Artificial Neural Networks and Evolving with Genetic Algorithms Second Research Paper Bachelor course on Media Technology at St. Pölten University of Applied Sciences by: Peter Alexander Kopciak mt Supervising Tutors: FH-Prof. Dipl.-Ing. Markus Seidl, Dipl.-Ing. Mag. Marlies Temper Vienna,

2 Declaration of Authorship - The attached research paper is my own, original work undertaken in partial fulfillment of my degree. - I have made no use of sources, materials or assistance other than those which have been openly and fully acknowledged in the text. If any part of another person's work has been quoted, this either appears in inverted commas or (if beyond a few lines) is indented. - Any direct quotation or source of ideas has been identified in the text by author, date, and page number(s) immediately after such an item, and full details are provided in a reference list at the end of the text. - I understand that any breach of the fair practice regulations may result in a mark of zero for this research paper and that it could also involve other repercussions Place, Date Signature 2

3 Abstract The main objective for machine learning in supervised systems is to find a hypothesis, which fits the data as good as possible. A hypothesis can be of varying form; although a linear fit is much simpler and faster to compute than a non-linear fit. Predicting future numbers or classifying inputs can be achieved by regression and classification. The process of finding the best linear fit for a regression or classification problem is by minimizing the expected empirical loss and minimizing error rate. To classify data, thresholds can be used, whereby a soft or logistic threshold is preferred, especially when the problem is not linearly separable. Artificial Neural Networks are structures consisting of Artificial Neurons, which are units that take a number of inputs, sums them and then pass this through their activation or threshold function to return a specific output. The classic Artificial Neuron is depending on its threshold, a hard perceptron or a soft sigmoid the sigmoid is preffered most of the times. There are a number of different networks, like a feed-forward network, where information flows unidirectional and recurrent networks, where information can be fed back into the network, thus creating loops. A network consists of input, output, and hidden layers, and depending on the nature of the problem, more hidden layers with a number of hidden neurons have to be used. One of the most common learning algorithm for neural networks is the back-propagation algorithm where the network feeds the error back to the former layers so they may adjust their weights and in turn, feed their error back to another layer. Genetic algorithms are inspired by Darwinian evolution: selection, sexual reproduction and diversity. These search algorithms can explore the function space at different positions and find global maxima and minima, overcoming locals and plateaus more easily than, as example, classic back-propagation. Thus, it is sometimes used to find the weights in a neural network, especially in unsupervised or reinforcement learning or when we have no knowledge of the function space. We can also use the algorithms to find an optimal network structure by coding the network and connection structure as a formal language. And we can use the algorithms to find the optimal learning parameters and initial weight settings. 3

4 Table of Contents Declaration of Authorship 2 Abstract 3 Table of Contents 4 1. Introduction 6 2. Machine Learning 7 A. What is Machine Learning 8 B. Supervised Learning 9 i. Fitting Data with a Hypothesis 9 ii. Errors and Loss 12 iii. Minimizing Loss 13 iv. Linear Regression 14 a. Univariate Regression 14 b. Multivariate Regression 17 v. Linear Separators 19 a. Hard Threshold 19 b. Soft Threshold Artificial Neural Networks 23 A. Biology and Background 24 B. The Artificial Neuron 26 C. Network Types 28 D. Learning in Feed-Forward Networks 30 i. Single-Layer Networks 30 ii. Multi-Layer Networks 33 E. Choosing a Network Structure Training with Genetic Algorithms 37 A. Genetic Algorithms 38 B. Evolving Weights 43 C. Evolving Connections 45 i. Direct Encoding 45 ii. Grammar Encoding 46 D. Evolving Parameters Conclusion and Future 49 Figures 51 Bibliography 53 4

5 5

6 1. Introduction The human nature and the nature of evolution are very fascinating fields and have attracted much attention throughout history. People were always wondering what makes us humans so intelligent and proficient at specific tasks. Various scientific fields are researching the inner workings of our brain and its functions and structure. Now, since the emergence of computers, numerous scientists struggle emulating human intellect and try to solve basic problems and tasks of pattern recognition. Humans can identify various symbols as numbers or discuss various philosophies in coherent speech and even discern a brown cat from a brown dog whereas computers fail at understanding these concepts, which make up the human experience. Computers are programmed and act according to their code and their algorithms; they do not possess a brain or consciousness. But what if? What if we could somehow teach computers the difference between a cat and a dog and grant them the ability of thinking? The field of machine learning and computational intelligence tries to answer these questions and attempts to solve these problems by learning from the human brain and recreating it with the help of so-called artificial neural networks (ANN). However, recreating the structure is just one-half of the equation. We also have to bestow them with the ability to learn from examples and let it evolve an optimal structure just like a real brain evolved over thousands of years. In this paper, we will try to learn the basic principles of machine learning by studying the mathematics behind these processes to find out how learning looks like in equations. We will also find out what artificial neural networks are and how they are able to learn from examples. Finally, we want to find out how we can train and adjust them by using genetic algorithms. This serves not as a complete description and reference for machine learning, artificial neural networks, and genetic algorithms it is merely a quick introduction for the interested student who wants to gain knowledge about the mathematical background in machine learning and a basic understanding of neural networks and genetic algorithms. Knowledge in algebra, calculus, probability, and algorithm theory is not necessarily a prerequisite but will help the reader understand the key concepts of the mathematics explained in this paper. Citations refer to a sentence, if placed inside a sentence or to the whole paragraph, if placed outside of a sentence or at the end of a paragraph. 6

7 2. Machine Learning 7

8 A. What is Machine Learning Machine Learning is a subfield of computer science, and has its roots in neuroscience, cognitive science, theoretical informatics, and computational learning. It is the study of algorithms for various intelligent agents and teaching them to recognize patterns and make educated guesses on new data based on previous knowledge, experience and learning; and the study of improving the ability of these agents so they may one day outperform humans in pattern recognition task. Concluding that an agent is learning if it improves its performance on future tasks after making observations about the world and taking its past experience and training into account (Norvig & Russel, 2013, p. 705). We can discern two types of learning. The first one is deductive learning, which learns and concludes from rules to find new rules and especially patterns based on the prior knowledge (Norvig & Russel, 2013, p. 705). This can often be found in agents exercising first-order logic as they try to learn and recognize patterns and objects following rules and imposed axioms. The other type of learning is called inductive learning this type of learning presents the agent a limited number of examples and then it tries to notice or grasp a pattern in the data, attempts to classify it and formulate a rule or a hypothesis which fits the examples as good as possible (2013, p. 705). These are two different approaches to the same problem. Whereas deductive learning formulates a general rule and leaves the agent fitting and rating examples as good as it can, inductive learning feeds the agent with examples and leaves the agent to learn a general rule for the data. Additionally, Inductive learning can be subdivided into three classes: Supervised, Unsupervised, and Reinforcement learning. If we show our agent pairs of input and output data thus training and helping it to associate specific input with a desired output, it can try to find a rule which fits the data we call this Supervised Learning (2013, p. 705). On the other hand, we can gather a big amount of data and let the agent find a pattern or classification on its own. There is no feedback or desired output in this type of learning, thus it is called Unsupervised Learning or clustering (2013, p. 705). At last, Reinforcement Learning needs another agent or human to act as a teacher and the learning agent will do things and try to pick up a rule by reinforcement, e.g. positive feedback/reward or punishment (2013, p. 705). In this paper we will be focusing solely on Supervised Learning and how to find a good representation for input and output pairs. 8

9 B. Supervised Learning i. Fitting Data with a Hypothesis So, in supervised learning the task is to find a hypothesis or a general rule, which fits the data. For training we will show our agent a set of corresponding data points, called the training set. This set consists of data pairs [(x 1, y 1 ), (x 2, y 2 ), ] where x n as the input and y n as the corresponding output. This dataset can be plotted on a graph and represented by a specific but unknown function y = f(x). The task is now finding a hypothesis, which approximates the unknown function as good as possible, thus finding a h(x) = f(x). (Norvig & Russel, 2013, p. 706) Because there are many functions h which could represent f it is important to define a space where this potential h could come from. This is called the hypothesis space H (2013, p. 707). Looking at Figure 1, we can see various proposed hypothesis for different datasets. It is important to find not just a good fitting and generalizing hypothesis but also taking into account that more complex hypothesis (1.b as opposed to 1.a) are more complex to compute and may overfit data. This means it would try to find a pattern where there is no pattern (such as trying to determine the result of throwing a fair dice) (2013, p. 716). This happens when we give in to more complex hypotheses (such as polynomials), have insufficient data, too much attributes and inputs or a hypothesis space which does not contain a hypothesis with a good enough approximation for our data (2013, pp. 707, 716). So sometimes, the solution is expanding H to include more candidates. In Figure 1.c we see a complex polynomial of 6 th grade proposed as the hypothesis. If we extend the hypothesis space to include polynomials over x and sin(x) we quickly find a consistent hypothesis (2013, p. 708). Figure 1: Data and proposed hypotheses, which fit the data. Example a/b and c/d are based on the same data samples. a) Exact fitting hypothesis b) another fitting but very complex hypothesis for the same dataset c) a complex and fitting hypothesis and a linear approximation d) a simple exact sinusoidal fit. Adapted from (Norvig & Russel, 2013, p. 707) 9

10 Of course we do not know if H contains a consistent hypothesis h(x) and if it contains a solution we do not know how good and complex it is. Thus, it is vital to be careful at defining hypothesis space and let the agent judge solutions based on different parameters. Most of the time we have to decide between a complex, fitting but maybe overfitting hypothesis and a simpler, generalizing but maybe underfitting hypothesis. We can let the agent decide based on various parameters such as nodes in a tree, count of variables, degree of the polynomial and then just try different hypotheses until we find the best, first underfitting the dataset and then increasing complexity until we overfit. (2013, p. 720) Even then, searching for the true f instead of h is complicated and sometimes even impossible. Sometimes the problem may not be realizable, which means f, is not included in our model of hypothesis space. It could also be hidden behind many good-enough local minima in hypothesis space so we will not reach the global minima (which is the function f). The data could possess a high amount of variance, therefore leading to different hypotheses. For this, we could just take the mean or the majority of the predictions of a limited number of hypothesis. Sometimes, f could be non-deterministic or noisy and returning different y for the same x. If our data is noisy, we could remove outliers outside of a statistic significant area or just gather more data samples both methods promoting a more generalized hypothesis. At last, we could fail due to complexity. The true f could be a high-grade polynomial, the hypothesis space H, which contains f would be too complex to formulate or we would just reach the limit of our available computational power. (2013, p. 723) Following that, we can say that the learning problem is realizable if the hypothesis h for approximating the function f is found inside hypothesis space H and fits the data as good as possible, thus formulating a probable hypothesis. (2013, p. 708) h = max h H P(Data h) P(h) In this equation, P(h) is the prior probability for a hypothesis, where P is high for a more simple hypothesis (2013, p. 708). After we found a candidate for a valid hypothesis, it is important to test it on new data. Just because the hypothesis fits the data in the training set, it does not mean it will perform good at new data. During training, we can measure the effectiveness of our algorithm with a learning curve. With learning curves we can also compare different learning algorithms. (2013, p. 714) 10

11 Figure 2: A learning curve for an algorithm. The y-axis shows the accuracy of the prediction for a test set, the x-axis show the amount of training data. Adapted from (Norvig & Russel, 2013, p. 714) There is a number of ways to validate our hypothesis. The most common are holdout cross-validation and k-fold cross-validation. The more simple approach is the holdout cross-validation. With this method, we just split our available data into two groups: a training set to learn a hypothesis and a test set to validate our hypothesis. Although this type of cross-validation follows the principle that we should not mix test and training data, it also means that we do not use our data to its full potential. The disadvantage for this method is that we do not use all available data for the tasks of training and validating. If our training set is too small, we may get a weak, wrong, or overfitting hypothesis. If the test set is too small, we can t validate our hypothesis. (2013, p. 719) The other method, which is more commonly used, is the k-fold cross-validation. During this process, we split our available data into k groups. Then, for each group we will do k rounds of training and testing. During each round we take, at random, 1 of the data and use these as test and validation data and the rest as training data. k Usually k is a number between 5 and 10. (2013, p. 719) 11

12 ii. Errors and Loss When searching for our hypothesis, our true goal is not finding an exact replica of function f but finding just a good enough hypothesis where the error h(x) y is small, or more specific, where the difference between the desired and the actual output (or h(x) f(x) ) is small. (Norvig & Russel, 2013, p. 719) For this, we will first make an assumption, called the stationarity assumption. This means we assume our whole data is independent and identically distributed (i.i.d). This means every data pair is as valid and as probable as every other, that every data pair does not depend on other data pairs and that we do not have a time dimension in our data which could alter the probability of its occurrence or relevance. (2013, p. 719) Now we can formulate our informal error function to something more specific, called the loss function. It is defined as the amount of utility lost by predicting h(x) = y when the correct answer is f(x) = y.(2013, p. 721) L(x, y, y ) = Utility(using y when input is x) Utility(using y when input is x) With this, we can attach number of the impact of errors. Also, when talking about the Loss function, the x is dropped most of the time, leaving a L( y, y ) behind. However, not all errors are the same. Most of the time we prefer a false positive instead of a false negative, e.g. classifying spam as non-spam is mildly dissatisfying whereas classifying non-spam as spam could be much more disastrous.(2013, p. 722) Even then, how can we measure the difference between our desired and actual output? For this, one of three common types of the loss function is used. The absolute value of the difference with the L 1 loss, the square value of the difference with the L 2 loss, and the binary loss with L 0/1 loss, which returns 1 for a wrong answer and 0 for an exact answer. (2013, p. 722) L 1 ( y, y ) = y y L 2 ( y, y ) = (y y )² L 0/1 ( y, y ) returns 0 if y = y, else 1 12

13 iii. Minimizing Loss The agent could try finding a hypothesis by minimizing the loss for each data pair (x, y) in all possible data pairs ε. For this we should define the probability P(x, y) for all data pairs to find out how likely x and y are correlated, e.g. how close they are in probability space. Then we could find a hypothesis with the smallest amount of generalization loss. (Norvig & Russel, 2013, p. 722) GenLoss L (h) = L (y, h(x)) P(x, y) (x,y) ε Where the best hypothesis h is the one where we have a minimum in generalization loss. h = min h H GenLoss L(h) Of course, we do not know how likely the data correlates, so we have to assume that we have gathered the data to the best of our knowledge, beliefs, and skills. Thus, we will define P(x, y) as 1.0, e.g. 100%. In addition, we do not have the whole data set ε, but rather a subset. With these adjustments in mind, the agent can then try to find the best estimated hypothesis ĥ by minimizing the estimated generalization loss, which is the empirical loss on a specific set of n elements in our collected set E. (2013, p. 723) EmpLoss L,E (h) = 1 n L (y, h(x)) (x,y) E ĥ = min h H EmpLoss L,E(h) Finally, we can also take into account the Complexity Cost of the hypothesis. Of course, defining complexity is another part of the endeavor in finding the best h. One can take the sum or something similiar of the coefficients (later called weights), the highest exponent respectively the grade of the polynomial, the number of used variables and so forth. (2013, p. 724) Cost(h) = EmpLoss L,E (h) + u Complex(h) In this equation, u describes the relationship between the complexity of the hypothesis to the estimated loss, or, what we value more in our search. 13

14 iv. Linear Regression We will often face two different problems in machine learning. The first, regression, is fitting a line to represent numerical values and predict future numbers. The second, classification, is separating data with a line (or a plane) to divide it between two or more classes and return a classification based on inputs. (Norvig & Russel, 2013, p. 707) a. Univariate Regression In this section we will be focusing on linear regression, which is fitting a straight line across our data set (2013, p. 729). We will not be dealing with the mathematics of higher-grade polynomials, since this is beyond the scope of this thesis. First, we will be discussing univariate linear regression, with only one x. h w (x) = w 1 x + w 0 The form of the hypothesis is that of a linear equation and we will be searching for the coefficients to find the best linear fit for our data. These coefficients, w 1 and w 0, are called weights. We will also be referring to them as the vector w. We will be again taking the Loss function, this time the L 2, and calculate the best weights for fitting data over the available data. (2013, p. 729) n Loss(h w ) = L 2 (y j, h w (x j )) = (y j h w (x j ))² = (y j (w 1 x j + w 0 ))² j=1 n j=1 So finding the ideal weight vector w we have to minimize the Loss (2013, p. 730). How do we achieve this? With partial derivatives: The minima or maxima of a function is reached when the first derivate, or the slope, is zero. On the other hand, we know that there is no maximum because as we are a moving away from the global minimum of the function, the error gets bigger and bigger. In Figure 3.b we can observe what is called the weight space, which shows us the relation between w 1, w 0 and the Loss Function. We are using the L 2 to solve for a linear regression problem, so the weight space will always look like this; convex and with one global minimum and no local minima.(2013, p. 730) n j=1 14

15 Figure 3: a) Best approximate linear fit for data, depicting price-m 2 -relationship for houses. b) The convex weight space of a linear regression problem with L 2 loss. Adapted from (Norvig & Russel, 2013, p. 729) Now, finding the weights in this special example is easy, as mentioned before, with partial derivatives and will lead to a very unique solution (2013, p. 730). n (y w j (w 1 x j + w 0 )) 2 = 0 0 j=1 n (y w j (w 1 x j + w 0 )) 2 = 0 1 j=1 w 0 = y j w 1 ( y j ) n w 1 = n( x j y j ) ( x j ) ( y j ) n( x j ²) n( x j )² Now, for this special kind of regression problem we arrived at a solution. We can solve the equations and have calculated our weights. However, sometimes, especially when we go beyond linear univariate regression, we will not be able to solve this just so easily, especially when there are many solutions or when we leave the world of closed-form equations. For this we will be using the gradient descent algorithm, by choosing a random point for the initial weight in weight space and then going downhill, descending to the next neighbor until we converge on a solution (2013, p. 730). We will be subtracting the first derivate, the direction of the gradient, from our position. This is due to the nature of weight space and because there is just one global minimum. This means that, when we are standing in front of the minimum, the gradient is negative, so we make it positive to go further down and forward. If the minimum is behind us, the gradient will be positive, so we make it negative to go backwards and again downhill. w i w i α w i Loss(w) The parameter α is the learning constant, which can be constant or changing over time. Finding the ideal learning constant is another separate task. If it is too low, the algorithm may take a very long time, if it is too low, weights may oscillate around the global minimum (Norvig & Russel, 2013, pp. 730, 731). 15

16 J. Freeman and D. Skapura suggest that there could be two ways two find an optimal learning constant: The first idea is a vague rule that one should choose α small enough that the weights do not change more than a small fraction of its current value, usually a number between 0.05 and 0.25(Freeman & Skapura, 1991, p. 105). The other is a bit more specific, but requires knowledge of the underlying statistics of the example data in the form of an input correlation/covariance 1 matrix R. Having this, the ideal learning constant is between > α > 0, where λ max λ max is the largest eigenvalue of matrix R. The latter won t work for us this time, because λ max is 1 (Freeman & Skapura, 1991, p. 67). So, again we will be deriving. For this problem, we will use the L 2 loss function and use the chain rule where f(b(w)) = f (b(w)) b (w). w i L 2 (w) = w i (y h w (x)) 2 = 2(y h w (x)) = 2(y h w (x)) w i (y h w (x)) w i (y (w 1 x + w 0 )) Then if we solve for w 0 the last term is 1 and for w 1 the last term is x. w 0 L 2 (w) = 2(y h w (x)), w 1 L 2 (w) = 2(y h w (x)) x With this, we just insert this into our gradient descent equation. The 2 is included into α and the Loss function is replaced and of course we want to respect the sum of all errors over our n training examples to get an accurate weight adjustment. w 0 w 0 + α (y j h w (x j )), w 1 w 1 + α (y j h w (x j )) x j j j This formula is also called the batch gradient descent learning rule. This guarantees convergence, but it can be sometimes slow due to the amount of data. A quicker way, given a changing learning rate, would be stochastic gradient descent, where we take an example at a time, which will not always converge on a solution. (Norvig & Russel, 2013, p. 731) 16

17 b. Multivariate Regression The same principle can be applied for learning in multivariate linear regression problems, e.g. data pairs with inputs consisting of a vector of multiple x, written as x with n dimensions. (2013, p. 731) h w (x j ) = w 0 + w 1 x j,1 + + w n x j,n = w 0 + w i x j,i Where j is the j th input and i is the i th dimension of a n-dimensional vector x j. To make things a little bit easier on the eyes, we will be introducing a dummy input x j,0 called the bias, which is always at 1 and serves to include w 0 into the summation, which can be expressed as a dot product. (2013, p. 732) n h w (x j ) = w i x j,i = w x j i=0 Again, we want to find the best weight vector w which can be done by finding the weight vector where the square loss is minimized. (2013, p. 732) w = min L 2(y j, w x j ) w j Following that, we can just adjust our batch gradient descent learning rule from univariate regression by taking the rule for w 1 and formulating it for the general purpose of multivariate regression. (2013, p. 732) n i=1 w i w i + α (y j h w (x j )) x j,i j During learning, we can start increasing the learning rate if the error reaches a smaller value, so that we may converge faster on a specific solution. We can also implement a function called momentum (which is optional), which adds a fraction of the previous weight change to the actual weight to calculate the next weight. In the following equation β is a positive number between 0 and 1, and p w i is the previous change, the difference between the weight at time t and the weight at time t 1. (Freeman & Skapura, 1991, p. 105) w i w i + α (y j h w (x j )) x j,i + β p w i j p w i = w i (t) w i (t 1) Now we may arrive at an aforementioned problem of overfitting. Now, that we have more inputs and that some of them may be less or even irrelevant for the output, it may happen that they are still considered useful and exaggerate the linear function (Norvig & Russel, 2013, p. 732). 17

18 Cost(h) = EmpLoss L,E (h) + u Complex(h) Here we have our Cost function, defined in a former section. For linear multivariate regression we can consider the Complexity as a sum of all the weights using a loss function (2013, p. 732). Complex(h) = L q (w) = w i q The question is now, which loss function should we use. Most of the time we prefer the sparse model using L 1, thus preferring hypotheses with less weights and variables. Looking at Figure 4 we can see that with a L 1 loss, a weight can reach 0 because the intersection lies exactly on the axis, thus eliminating this dimension from our equation. With L 2 loss, a weight will just reach a very small number but still influence the output. (2013, p. 732) i Figure 4: a) The L 1 loss function (blue square) in relationship to the minimal achievable loss. b) The same data set with the same achievable loss, now in relationship with the L 2 loss (blue circle). Adapted from (Norvig & Russel, 2013, p. 733) Another point, which can influence our decision, is the nature of our data. Because the L 2 loss function is a circle in weight space, it means it can be rotated around and does not value the order and orientation of the axis. On the other hand, changing or rotating the axis while using L 1 loss means that a different weight is eliminated. Looking at Figure 4, if we would rotate the weight space by 45 counterclockwise or switch the axes, the L 1 loss function would eliminate w 2 instead of w 1. (2013, p. 733) 18

19 v. Linear Separators a. Hard Threshold Linear functions are good for regression problems, e.g. finding a number output from a number input based on previous data - but it is also helpful for classification and making guesses if an observation is a specific thing or if it is not. As an example, let us take a look at Figure 5 where we have seismic data, listed as x 1 and x 2. We want to know, based on this data to learn, if a future event providing the same data as x 1 and x 2 is either an earthquake (white points) or an explosion (black points). For this, our algorithm has to find a linear separator, which is also called a decision boundary, to classify the data and divide it between different groups or regions. Finding a linear separator is only possible for data sets, which can be separated into two regions by a single straight line these data sets are then linearly separable. (Norvig & Russel, 2013, p. 734) Figure 5: a) Mapping of the parameters concerning seismic activities. The white dots are earthquakes and black dots are explosions. B) More data and noise, producing a nonlinearly separable problem and making classification harder. Adapted from (Norvig & Russel, 2013, p. 734) Therefore, if the new input lands above the line we want to receive an output of 0 or earthquake and if it lands beneath the line we want an output of 1 or not an earthquake. For this kind of output, we will need a hard threshold function, like in Figure 6.a. (2013, p. 735) In our example, the linear separator is a linear function of form x 2 = 1.7x or 0 = x 1 x 2. Everything under the line is 0 < x 1 x 2 and everything above the line is 0 > x 1 x 2. This means for our hypothesis that h w (x) = 1 if the above equation w x solves for a value 0, otherwise it is 0, which can be also written as Treshold(h w (x)) = 1 if {h w (x) 0} else 0. (2013, p. 735) 19

20 But now we will be facing a small problem, if we want to find w: the threshold function is not differentiable and a discontinuous function, the gradient is almost everywhere zero, except for the points which lie on the linear separator at w x = 0, where the gradient is not defined. So we cannot solve by derivation or using our gradient descent. (Norvig & Russel, 2013, p. 735) Figure 6: a) A hard Threshold function, which is not differentiable and returns either 0 or 1 b) A logistics Threshold function which returns a continuous value. Adapted from (Norvig & Russel, 2013, p. 737) For this problem, there is a simple learning rule, based on our previous gradient descent learning rule, which can converge on a solution if the data is linearly separable. If not, the weights will oscillate and never come to rest or we can take the best we can get and reduce the learning constant over time. This is the perceptron learning rule; for a data pair of (x, y) (2013, p. 735): w i w i + α (y h w (x)) x i Learning is, just like in stochastic gradient descent, gradually by feeding the algorithm one data pair after another. It looks very similar, but poses important differences. The desired output y and the received output h w (x) can be either 0 or 1. This means following implications to learning: - If h w (x) = y, the weights stay the same - If y = 0 but h w (x) = 1, we have to increase w x so that h w (x) returns 0 and includes this input. For this, the specific weight w i will get bigger if x i is positive, or smaller if x i is negative. - If y = 1 but h w (x) = 0, we have to decrease w x so that h w (x) returns 1 and excludes this input. For this, the specific weight w i will get smaller if x i is positive, or bigger if x i is negative. (2013, p. 735) In Figure 7.a we can see based on the learning curve that the algorithm finds a hypothesis for the data. The x-axis is the number of iterations, the y-axis is the error. Whereas in Figure 7.b the perceptron learning rule fails at fitting a hypothesis 20

21 for the noisy data set depicted in Figure 5.b. In Figure 7.c, we reduced, just as proposed, the learning constant over time, thus approaching a more or less consistent hypothesis. Figure 7: a) The algorithm finds a hypothesis for the linearly separable data. b) The algorithm fails at a noisy (and not linearly separable) dataset, even after iterations. c) The same algorithm with the same data set as in b, but this time with a decaying learning rate over time. Adapted from (Norvig & Russel, 2013, p. 736) b. Soft Threshold As mentioned, a hard threshold function confronts us with a few problems. Primarily, it is not differentiable, thus hindering us at devising better learning algorithms and making it impossible at predicting the process (we can see this by the means of a very unstable learning curve in Figure 7.a). In addition, it can only return a true or false value, even if the data is really close to the threshold. We want something smoother and which could make a statement of probability if something is in a region or not. We can solve these issues if we change our hard threshold function to a soft threshold function, called a logistic soft threshold or sigmoid threshold (due to the S shape of the function). We can see how the function behaves in Figure 6.b. (2013, p. 736) g(z) = Logistic(z) = If we use this for our Threshold function we get: e z Treshold(h w (x)) = Logistic(h w (x)) = g(w x) = e w x This returns a floating point number between 0 and 1 which acts as a probability of an object belonging to a specific class. Now, with this differentiable mathematical form, we can start derivation, find weights, and get back to our gradient descent algorithm. This process of minimizing loss is called logistic regression. (2013, pp. 737, 738) For this derivation we will be using the chain rule two times. First for the L 2 loss and second for the new logistic function g(h w (x)). 21

22 w i L 2 (w) = w i (y h w (x)) 2 = 2(y h w (x)) = 2(y h w (x)) w i (y h w (x)) w i (y (g(w x)) = 2(y h w (x)) g (w x) w i (w x) = 2(y h w (x)) g (w x) x i Now inserting for g (w x) the term g(w x) (1 g(w x)) which can also be written as h w (x) (1 h w (x)) and then updating our gradient descent weight learning rule with the derivate of the logistic regression L 2 loss function we receive our new gradient descent logistic learning rule for a single example (x, y). (2013, p. 738) w i w i + α (y h w (x)) h w (x) (1 h w (x)) x i Now, if we train our algorithm we can see the enormous advantage. In Figure 8.a logistic regression is much more elegant and predictable although it takes longer for this specific data set. In Figure 8.b we can observe the true advantage, when searching for a hypothesis for noisy data, where the algorithm converges on a solution, unlike the perceptron learning rule (Figure 7.b). The same procedure can be examined in Figure 8.c, where we choose a decaying learning constant α and reach much sooner and consistent hypothesis. (2013, p. 738) Figure 8: a) The same linearly separable data set as in 7.a, this time with logistic regression where we converge in a more predictable fashion. b) In the noisy data set, used in 7.b, logistic regression outperforms the perceptron learning rule c) Here, the same as in 8.c, with the addition of a decaying learning parameter. Adapted from (Norvig & Russel, 2013, p. 738) 22

23 3. Artificial Neural Networks 23

24 A. Biology and Background Neural Networks are inspired by the actual human brain, which has been evolved and perfected for its sole task of pattern recognition, classification and finding solutions for problems in uncertain environment based on prior knowledge, for millions of years (Rojas, 1996, p. 9). Figure 9: A representation of the structure of a single neuron. Adapted from (Freeman & Skapura, 1991, p. 8) A neuron is a unit composed of the mitochondrion (the cell body and nucleus), the dendrites, the axon and axon hillock, and the synapses (Rojas, 1996, p. 10). The mitochondrion is the powerhouse of the cell, providing it with energy and being responsible for most of the chemical and electrical processes in the neuron. The neuron receives its signals from other neuron at the dendrites, processes them through the axon hillock and transmit the outputs through the axon to other neurons (1996, p. 10). The contact points, where the dendrite or axons are connected to another neuron, are called synapses. At these synapses the neuron receives, through a complex chemical and electrical process, impulses called action potentials which will be fed into the neuron (1996, p. 15). These action potentials can be stronger for a greater stimulus (1996, p. 18). Depending on the type of the synapse, this action potentials can be inhibitory, hindering the neuron to be activated, or excitatory and thus encouraging it to be activated; if the neuron receives enough strong excitatory action potentials, the neuron may fire, reach its activation threshold at the axon hillock and send an activation potential across its axon to the next neuron (Rojas, 1996, p. 19). The influence of a neuron on another neuron can be very diverse, and not all connections are the same. They learn it over time through hebbian learning proposing that every neuron is responsible 24

25 for a specific kind or class of input and thus if a neuron a takes part in activating neuron b often enough, the bond between those cells grows stronger, thus increasing the efficiency and influence of this connection; allowing the neural network to change during its lifetime (Freeman & Skapura, 1991, p. 15). Following that, we can use biology as inspiration and create an artificial neuron. This neuron, or unit, has a cell body, being responsible for the activation and processing of the input, has various input and output connections, similar to axons and dendrites, and then finally for each connection a weight, similar to synapses to indicate the intensity and type of the bond. (Rojas, 1996, p. 11) Many artificial neurons make up an artificial neural network, which is a hierarchical and sometimes unidirectional multilayered system. The neurons inside the network can form links to any other neuron and pass around information, ignoring classical approaches, like Turing machines, where information flows sequential to other neighbors and there is no program or code handed over to the network, the network has to learn its parameters itself, based on examples and learning algorithms (Rojas, 1996, pp. 6, 8). 25

26 B. The Artificial Neuron Figure 10: An artificial neuron and its various components. Adapted from (Norvig & Russel, 2013, p. 739) A neural network consists of several artificial neurons, also called units or nodes. These are connected with directional links from external neuron i to neuron j, where each unit has a corresponding weight attached to its various input links to determine the strength of the connection (Norvig & Russel, 2013, p. 739). The connection from neuron i to neuron j has a weight w i,j which can be any number. Sometimes, we have networks with a positive floating number as weights, which means they use absolute inhibition. However, if we allow negative weights, we will be dealing with relative inhibition, which is similar to real neural networks with inhibitory and excitatory synapses and is more common (Rojas, 1996, p. 40). A negative weight is here equivalent with an inhibitory synapse, and a positive weight would be an excitatory synapse. An example for this would be a classification problem where we want to stop one neuron from firing, when a different neuron has been activated, or when we are trying to train the nodes in the output layer to monopolize on one specific learned pattern (Freeman & Skapura, 1991, p. 23). A unit receives a number of signals a via its input links from other neurons. These signals are then multiplied with the weights and most of the time just summed with the help of the input function which is our hypothesis. Like in linear regression, there is a dummy input a 0 associated with the dummy or bias weight w 0,j, where j is the j th unit, belonging to this specific weight and a i is an output from a neuron i, used as an input for unit j. (Norvig & Russel, 2013, p. 739) n in j = w i,j a i i=0 26

27 Afterwards the calculated sum is applied to the activation function g, which returns a specific output a depending on the type of activation threshold (2013, p. 739). a j = g(in j ) = g ( w i,j a i ) Usually we can differentiate between two types of activation thresholds, thus two types of neurons. If we apply a hard threshold we are dealing with a perceptron and if we are using a soft or logistic threshold we are dealing with a sigmoid perceptron or logistic perceptron (2013, p. 740). There are probably even more types, but this would be out of the scope of this paper to name them all. For our artificial neurons there will not be a limit on the number of inputs and outputs. This model goes by the name of unlimited fan-in (Rojas, 1996, p. 31). n i=0 27

28 C. Network Types Now, we are going to combine a number of neurons to form an artificial neural network (abbreviated as ANN)! There are many types of ANNs, differing in implementation, application and network design, but we can discern two basic types of network design: The feed-forward network and the recurrent network. The feed-forward network connects neurons following one direction, usually from left to the right, starting at the input neurons and ending at the output neurons, traversing neurons in-between. The output of a node flows, just like water, downstream whereas neurons receive their input from the nodes before them or upstream. Most of the time, networks are a composite of an input layer, an output layer and a number of hidden layers, which are located between input and output layer. They are called hidden because they do not receive the direct input and are not returning the final output. In Figure 11 we can see two kinds of feed-forward networks: 11.a is a single-layer network, also called a perceptron network, 11.b is a multi-layer network. (Norvig & Russel, 2013, p. 740) Figure 11: a) A Single-Layer Feed-Forward ANN with two input nodes and two output nodes (bias weights are not shown) b) A Multi-Layer Feed-Forward ANN with two input nodes, two output nodes and one hidden layer consisting of two hidden nodes (bias weights are not shown). Adapted from (Norvig & Russel, 2013, p. 741) The other type of network is the recurrent network. This type of network is extended to include a time parameter t. In a recurrent network we can find loops and recursive behavior of nodes, which means a neuron can compute its output and feed it back either in the network or into itself. It also can take the output from the whole network and use it in the next iteration as another input, just like the output of an ordinary neuron (Rojas, 1996, p. 30). This means we have to synchronize the output of all neurons and also the output of the network to prepare the input for the next iteration altogether, we can view the recurrent network, in contrast to the feed-forward network, as a stateful network depending on the outputs and time (Norvig & Russel, 2013, p. 740). Another way to look at this can 28

29 be examined in Figure 12 as an unfolded feed-forward network during various time periods. At the beginning t = 0 the network receives an input x (0) ; every neuron executes its functions and at the end returns an output o (0). Then, during the next cycle at t = 1 the network receives another input x (1). However, this time the neurons of the network will compute their outcome with an additional input: the result from the last iteration o (0), producing another output o (1) and repeating the cycle until a solution, endpoint or maximum lifetime T has been reached. (1996, p. 172) o (0) = Network( x (0) ); o (t) = Network( x (t), o (t 1) ) Figure 12: A diagram of an unfolded recurrent ANN. Adapted from (Rojas, 1996, p. 172) 29

D. Learning in Feed-Forward Networks i. Single-Layer Networks We arrived at one of the central questions in this paper: How do neural networks learn?

30 D. Learning in Feed-Forward Networks i. Single-Layer Networks We arrived at one of the central questions in this paper: How do neural networks learn? We will start with learning in perceptron networks, which may seem daunting at first but with the right perspective, it is solvable with our knowledge from Section 2.B.iv and 2.B.v. As mentioned before, a perceptron network consists of neurons, which are just units representing linear separators and weights. If we look again at Figure 11.a, a network with two inputs and two outputs, we can identify two relationships: the weights w 1,3 and w 2,3 influencing neuron 3 and the weights w 1,4 and w 2,4 influencing neuron 4. This means, that these are two equations requiring two different learning processes. a 3 = g(w 0,3 + w 1,3 a 1 + w 2,3 a 2 ); a 4 = g(w 0,4 + w 1,4 a 1 + w 2,4 a 2 ) With this, we can just let our network train the weights by using (depending on the threshold function) either the perceptron learning rule or the gradient descent logistic learning rule. Generally speaking, if we have m output nodes in the output layer we need m learning processes, because the weight to learn influences only their specific output which means that we could just split the network into smaller networks with just one output and with different weights. (Norvig & Russel, 2013, p. 741) We will now proceed to train our perceptron network to learn how to add two bits together. In Figure 13, we see the desired outputs y 3 for neuron 3 and y 4 for neuron 4, given the input of x 1 for neuron 1 and x 2 for neuron 2. This is also our training data. Figure 13: The training data and truth table for the two-bit-adder. Adapted from (Norvig & Russel, 2013, p. 740) If we let the perceptron network try to solve this problem, it will fail. Well, not completely whereas the neuron 3 converges on a solution, neuron 4 will struggle at finding the right weights, regardless of the type of threshold type. The reason for this is the nature of the task for neuron 4. If we take a look at Figure 13, we can 30

31 identify the Carry task as a logical AND problem, while the Sum task is a logical XOR problem. If we plot these in input space, like in Figure 14, we can see that the XOR problem is not linearly separable, thus unsolvable for our single-layer perceptron network. (Norvig & Russel, 2013, p. 741) Figure 14: Plotting the input space for the logical a) AND b) OR c) XOR. The black dots are 1, the white dots are 0. In a) and b) we can observe that the problem is linearly separable and that there is a linear separator. In c) no such thing can be found. Adapted from (Norvig & Russel, 2013, p. 741) This is the big disadvantage of single layer networks: they can only solve for linear regression and classification problem; nothing outside this problem set will converge on a solution this is true for a hard threshold as well as for the logistic threshold function. A single-layer network is only able to find a solution for linearly separable problems and the only linearly separable problems are the one which can be visualized as the logical operations AND, OR, and NOT. (Norvig & Russel, 2013, p. 742) We could also try to take this problem to a higher dimension (d = 3 instead of 2) and search inside this d-dimensional hyperspace for a separator using a hyperplane, which is a linear separator with d 1 dimensions (Freeman & Skapura, 1991, p. 28). In this example, this would be really a plane in a cube, like in Figure 15.a. Or we could move away from linear equations and separators for the neurons input or sum function and use a decision curve (See Figure 15.b) with a hypothesis space of polynomials or splines, but there is an easier solution for now (Rojas, 1996, p. 66). 31

32 Figure 15: a) Example of Hyperspace for the XOR Problem, solved with a Hyperplane. b) Example of a decision curve, which also solves the XOR problem. Adapted from (Rojas, 1996, pp. 64, 66) 32

33 ii. Multi-Layer Networks To solve the XOR problem, we have to abandon our single-layer approach. The idea is, given a large enough network we could combine a different amount of AND, OR and NOT gates, form NOR and NAND gates and then proceed to combine them in various ways to solve the task. (Norvig & Russel, 2013, p. 742) For this, we will be implementing a hidden layer, as in Figure 11.b. This network has one hidden layer with two hidden units. With this, we can solve the XOR problem. Looking at Figure 16.a we can see how the network function looks in weight space by applying a soft threshold on two hidden units with opposing soft thresholds, we can get a ridge, which is already able to solve for non-linear regression. If we take this further, and threshold four hidden units with opposing thresholds standing at a right angle to each other we can get a bump like in Figure 16.b with enough hidden nodes we can express any continuous function. Now if we add another hidden layer, we can create more ridges and bumps in different places and thus making it possible to solve for discontinuous problems and classifications that are more complex! (2013, p. 743) Figure 16: a) A ridge after tresholding two hidden sigmoid perceptrons. B) A bump after tresholding four hidden sigmoid perceptrons. Adapted from (Norvig & Russel, 2013, p. 743) Again, our network will have to learn the weights. We will start again from behind, solving the weights for the output layer. Figure 11.b shows our multi-layer network which will try to solve our XOR problem. This time we will not be able to split this into m differente learn problems, because the output in both a 5 and a 6 depend on all the weights of all the nodes in the former layers and all the weights in the former layer depend on the errors at a 5 and a 6. (2013, p. 744) a 5 = g(w 0,5 + w 3,5 a 3 + w 4,5 a 4 ) a 5 = g(w 0,5 + w 3,5 g(w 0,3 + w 1,3 a 1 + w 2,3 a 2 ) + w 4,5 g(w 0,4 + w 1,4 a 1 + w 2,4 a 2 )) a 5 = g(w 0,5 + w 3,5 g(w 0,3 + w 1,3 x 1 + w 2,3 x 2 ) + w 4,5 g(w 0,4 + w 1,4 x 1 + w 2,4 x 2 )) 33

34 This is how the output for node 5 is achieved, where x 1 and x 2 make up the input vector x, a n-dimensional vector for n input nodes, here x = (x 1, x 2 ) = (a 1, a 2 ). In order to compute the weights we need to know the output of both nodes, and the error the network produced for the input vector x. For this, we have to extend our scalar function h w (x) = y to a vector function h w (x) = y, which returns a mdimensional output vector y. We will now again use our loss function L 2 to calculate the weights in our output layer. Because our loss functions are additive we will be adding the individual losses of the m output nodes, which is the difference between the desired output y m (taken from our training data) and the actual output of unit a m, to compute the total loss. (2013, p. 744) w L 2(w) = w (y h w(x)) 2 = w (y m a m ) 2 = w m (y m a m )² With this, we can again split learning into m different learning processes, provided that we later add the sum of the derived gradients when updating weights and update the weights layer after layer. Now, we have a small problem. To update the weight of the nodes in the output layer we have to know their output and the desired output which we both know. However, we do not know these values for the neurons in the hidden layer. For this problem we will use a technique called back-propagation, by feeding a specific fraction of the error from the output layer back to former hidden layers, repeating this process until we reach the input layer. (2013, p. 744) w j w j + α (y h w (x)) h w (x) (1 h w (x)) x j This was our gradient descent logistic learning rule, which we will now use to formulate a learning rule for the back-propagation. First we define Err k, which is the error y k a k for the kth output unit and the kth element of the m-dimensional error vector y h w, from your loss function mentioned earlier. Then, we define k = Err k g (in k ) as the modified error, which will show us the direction and steepness of the slope. (2013, p. 744) g (in k ) = g(w x) (1 g(w x)) = h w (x) (1 h w (x)) m w j,k w j,k + α Err k h w (x) (1 h w (x)) x j w j,k w j,k + α k a j Which is the update rule for the output layer and our back-propagation update rule. This changes the weight from neuron j to output neuron k, depending on the output a j from neuron j and the error k. Now, it is time to back-propagate the errors, so we can update the weights in the hidden layer. The idea is, that neuron j has an influence on the k of all connected 34

35 neurons in the latter layer. The influence is stronger, the bigger the weight w j,k is. Thus we take all the errors of the connected neurons, and calculate the error of j (2013, p. 744) j = g (in j ) w j,k k k With this, we can then update the weights for the hidden neurons. For our example from Figure 11.b the following rule will update the connections between the input layer neuron i and the hidden layer neuron j. w i,j w i,j + α j a i This concludes the basic back-propagation algorithm. If we would have more hidden layers, we would just repeat the step of calculating a new j and adjusting the weights accordingly. Of course, this method is still susceptible for converging to a local minimum instead of the global minimum, but in practice, this is solved by starting the learning process with a different number of hidden units, connections, initial weights or learning parameter. It is important to note that we do not have to reach the global minimum. If we have found a solution, which satisfies our problem and has an acceptable error rate, we can stop learning at a local minimum. (Freeman & Skapura, 1991, p. 105) In addition, back-propagation can not be used for neural networks where we use discontinuous functions for activation (such as the hard threshold) or for errors and loss and classic back-propagation is harder and even unreliable for unsupervised and reinforcement learning problems (Montana D. & Davis L., 1989, p. 763). Figure 17: The weight space of a learning task. There are two local minima at z 1 and z 2, and one global minimum at z min. If the gradient descent algorithm converges on z 1 we will restart learning because the error rate is still too high. An acceptable solution can be found at z 2 and z min..adapted from (Freeman & Skapura, 1991, p. 106) 35

36 E. Choosing a Network Structure So, creating a neural network is a compound of three tasks: finding the correct weights which minimize the estimated loss for the hypothesis, choosing the kind of network and the amount of layers and finally determining the number of neurons in each layer and the best connection (Rojas, 1996, p. 24). After we have adjusted our weights, according to the training set, we have finished a training epoch, which consists of a training cycle for every data pair (Mitchell, 1998, p. 66). However, how do we find the right structure and the right amount of layers and nodes for our network? Depending on the network it may happen that it solves with a hypothesis that overfits the data like in Figure 1.c, this happens more often if our network has many parameters, e.g. inputs, and too many neuron or layers (Norvig & Russel, 2013, p. 747). We can make a guess concerning the number of nodes and layers based on the fact which kind of function space we need or which kind of problem we think we are facing. Alternatively, we could create many different network designs and then measure the performance of our topology, or network structure, with k-fold cross-validation, choosing the best performing with the lowest cost (2013, p. 746). If we want to find a network with a lower number of connections, we could also try an approach called optimal brain damage. First, we train a fully connected network, which means that every node in a layer is connected to every node in the next layer. Then we drop some connections, based on a specific algorithm, such as smallest weights or lowest output and so forth. If the performance after training stays the same or gets better, we repeat the process until there is too much damage (2013, p. 748). We could use a tiling algorithms which starts with a single unit and tries to learn as many examples as possible, then we add another unit which tries to learn and solve the examples the first got wrong until we reach a state of overfitting (2013, p. 748). Another approach is divide-and-conquer, which divides the problem into many smaller parts. For these smaller problems, we try to generate a network, which tries to solve this and thus conquers the issue (2013, p. 711). On the other hand, we could do it like nature and find the ideal network structure and weights with evolution! 36

37 4. Training with Genetic Algorithms 37

38 A. Genetic Algorithms Inspired by Darwinian evolution, a special kind of search algorithm has emerged in the last decades: Genetic Algorithms. Most of the terms and the underlying procedure are borrowed from evolution, so we will first give a quick overview of some principles and terms used. All organisms consist of cells and every cell consists of chromosomes, strings of DNA and information that act as a blueprint for the organism. A chromosome is a collection of a specific amount of genes, with each gene encoding a special trait. Each gene and trait can be turned off or on, or depending on the gene have even more states these states are called alleles. A gene has a specific position in the chromosome this position is the locus. The whole set of chromosomes is called the genome or the genotype, which is the genetic representation and coding of an organism. How the genes are expressed and how they manifest in the real world is the phenotype. In Figure 18 we can observe the difference between genotype and phenotype at the example of neural networks. (Mitchell, 1998, p. 5) Figure 18: a) A chromosome and the genotype of a network with 6 genes. B) Two different phenotype networks based on the same genotype. Adapted from (Schaffer, Whitley, & Eshelman, 1992, p. 5) Evolution works by letting two organisms, the parents, combine and shuffle their genes during sexual reproduction and thus create one or more children. The process of gene combination is called recombination or crossover. During crossover, the child is also subject to mutation, which occurs at a random locus in the genome and changes the allele of the gene, or sometimes swaps the gene with a gene at another locus. If an organism can reproduce depends on their adaptability to their environment and if they live long enough. (Mitchell, 1998, p. 6) 38

39 This is called the fitness a fitter organism is more likely to mate and more likely to beget more children, which in turn, are fitter than other children and more likely to survive. (Mitchell, 1998, p. 6) A genetic algorithm is a special kind of stochastic beam search, and a blend of a global and local search algorithm. In local search with have one node, which tries to find a solution, such as a global minimum, in function space our gradient descent algorithm is a kind of a local search algorithm (Norvig & Russel, 2013, p. 128). To maximize our chances at finding a solution, we have k nodes, which start at random locations, and then try to find a solution. Of course, some of these nodes won t be as efficient as others; in addition we will be choosing the best m node positions as a starting point and then distribute a new amount of k nodes, which can continue their search from this point this is beam search (2013, p. 128). But it could happen that out best nodes were exploring a local minimum to prevent this we perform stochastic beam search, and choose m nodes, with better nodes being more likely to be chosen (2013, p. 129). The big difference to classic stochastic beam search is that a genetic algorithm combines two nodes to create more nodes, instead of modifying a single node and due to this and some other factors, makes faster progress in earlier stages (2013, p. 131). In genetic algorithms (GA), a node or solution candidate is called an individual or sometimes chromosome. The underlying data structure is the genotype of the individual, where information, such as weights, number of nodes, or a function, is encoded in the form of genes. Most of the time, the information represented by the genes has one of three types of allele: bit (1/0 or on/off), bit strings (1011) or an alphanumerical value. It depends on the task and encoded information which type is better. A chromosome is a one-dimensional array of length l, where l is the number of genes or traits we want to encode (Mitchell, 1998, p. 7). Each trait has a specific location in the genome, this means that changing the locus of a gene will also change the trait it is representing although there are models where a gene can switch locus and still represent the same trait (Mitchell, 1998, p. 160). Depending on the fitness, or how valuable the solution candidate is, the organism may reproduce with another fit candidate. To create a new offspring the chromosomes of the parents are either copied or split at one or more points and then the crossover process recombines the chromosome. It can also happen that a gene may flip or change its allele based on a small mutation chance; this ensures diversity (Mitchell, 1998, p. 7). In Figure 19 we can observe how two chromosomes are recombined and how two genes are flipped during the crossover process. 39

Figure 19: A recombination of two different chromosomes at a single split point. The chromosome is also mutated at two locus. Adapted from (Rojas, 1996, p.

The landscape shows us a relationship between the various combinations of genes in the chromosome and the fitness of each combination, thus it has l + 1 dimensions, where the (l + 1) th dimension is

40 Figure 19: A recombination of two different chromosomes at a single split point. The chromosome is also mutated at two locus. Adapted from (Rojas, 1996, p. 431) Like weight space, there is also a plot representation for genetic algorithms called the fitness landscape. The landscape shows us a relationship between the various combinations of genes in the chromosome and the fitness of each combination, thus it has l + 1 dimensions, where the (l + 1) th dimension is the fitness value. Like any space, the fitness landscape can have local and global minima and maxima. Most of the time the fitness landscape just shows us the fitness of all current individuals, which is called the population, and serves as an indicator where our population will probably evolve to. The difference between two separate individuals is measured by the number of different genes or the hamming distance. Sometimes the difference of the alleles in the different genes is included. In Figure 20 we can see a fitness landscape for a chromosome of length l = 2. (Mitchell, 1998, pp. 7, 8) Figure 20: A fitness landscape for a population of four individuals (0,0), (1,0), (0,1) and (1,1). We can see that the fittest individual is (0,1). Adapted from (Mitchell, 1998, p. 8) After we have evaluated the fitness of all individuals, we can now begin to choose the individuals, which are allowed to reproduce. We take a fixed number of m 40

41 individuals and, depending on the fitness, a single individual may have a larger chance of being selected. This is fitness-proportionate selection or roulettewheel sampling (Mitchell, 1998, p. 11). There are basically two types of selection: fitness-based selection and rank-based selection. For fitness-based selection, the probability of being selected is relative to the ratio of the individual fitness to the mean fitness of the population. An alternative model, especially when the algorithm is running for a longer time and the differences between the fitness of each individual is small, is rank-based selection. This assigns a fixed probability to each rank, where the highest ranked individual is the one with the highest fitness value and thus the one with the highest probability to be chosen for reproduction. (Koehn Philipp, 1994, p. 13) Figure 21: Two roulette-wheel methods, based on either relative fitness or rank in the population. Adapted from (Koehn Philipp, 1994, p. 13). After the parents have been selected, crossover and mutation happens. These process, together with selection, are the genetic operators of our algorithm (Mitchell, 1998, p. 11). Crossover is defined by two parameters: the crossover rate p c and an optional splitting point mechanism (1998, p. 10). Depending on p c the two parents combine their chromosomes according to the splitting point mechanism or just copy themselves by creating two children, one for each parent. Mutation is defined by the mutation rate p m (1998, p. 10). This is a small number, which tells us the chance of a single gene changing its allele by randomly choosing another allele. The mutation rate can be changed during the algorithm if we want to use adaptive mutation (1998, p. 75). This means that the mutation rate is lower if the parents are more diverse, e.g. a larger hamming distance, and higher if the parents are more similar, which helps to keep diversity when individuals in a population becomes more similar. After this process, the children become the new population and replace the old population thus starting a new generation. A genetic algorithm runs for a designated number of generations or a run. This distinction is important because 41

42 every run, even if we do not change population number, mutation rate, crossover rate etc., will lead to a different solution due to the random factor in search and reproduction and the fact that each starting population is randomly generated with a random combination of genes in the genome. Most of the time, a genetic algorithm is run a number of times, each with a specific number of generations we can then look at the best evolved solutions or compare the performance (Mitchell, 1998, p. 11). In the next sections, we will show how we can use neuroevolution to combine genetic algorithms and neural network by evolving weights, the number of connections and units, and the learning parameter. 42

B. Evolving Weights The classic backpropagation is not the most efficient method for finding weights and cannot be used for all learning problems.

43 B. Evolving Weights The classic backpropagation is not the most efficient method for finding weights and cannot be used for all learning problems. When we have no knowledge of the gradient of the learning space or have no data pairs and move away from supervised learning to unsupervised learning, or to reinforcement learning with sparse feedback, backpropagation will fail or perform poorly. In addition, learning in a fully connected recurrent network will pose a problem as well as learning with non-differentiable thresholds and nodes. Additionally, if we have a big network or a big set of training data, genetic algorithm could be faster than classic backpropagation. (Schaffer et al., 1992, p. 12) Each chromosome represents a vector of genes for weights for a network with a fixed number neurons, like in Figure 21. The fitness of an individual is calculated by training on a set of data (thus, supervised learning) and then calculating the sum of the total L 2 loss and error. A smaller number means here a higher fitness for the individual chromosome. (Montana D. & Davis L., 1989, p. 764) (Mitchell, 1998, p. 67) Then, based on fitness, we choose a number of fit chromosomes and start to recombine them by choosing a non-input neuron (In Figure 22 neuron 4, 5, and 6) from a random parent and copy its incoming weights to create an offspring. In Figure 22 we can observe, that the child has copied the incoming weights for neuron 4 from the left-hand parent and the incoming weights for neuron 5 and 6 from the right-hand parent. If applicable, we also mutate the incoming weights of a neuron by adding a number between 1 and 1, not exceeding the limit 1 1 (Mitchell, 1998, p. 68) preferring weak nodes with very small weights which don t contribute or in fact hinder the network (Montana D. & Davis L., 1989, p. 765). Figure 22: Two parents create an offspring with crossover. The chromosome representation is depicted under the network. Adapted from (Mitchell, 1998, p. 69) 43

44 D. Montana and L. Davis (1989) applied this approach to evolve weights for an underwater sonic detector. They used a fully connected feed-forward network with four input units, one output unit and two hidden layers, the first with 7 hidden units and the second with 10 hidden units. The fitness was evaluated with the help of 236 training examples and the fittest was chosen based on the total L 2 loss and selected according to the rank in the population. The population of the genetic algorithm was fixed at 50 (where each individual is a specific network) and has been running for 200 generations, which is equivalent to network evaluations where one network evaluation is learning all examples and returning a loss. A backpropagation algorithm has been allowed to run for 5000 training epochs (where one epoch is learning for all 236 examples) (Mitchell, 1998, p. 68). This means two individuals are equivalent to one backpropagation epoch. These numbers were chosen because one backpropagation epoch has to feed-forward 236 examples and then back propagate the error 236 times whereas an individual has only to feed-forward the 236 examples and calculate the sum of the error. Still, two individuals are faster at computing their total L 2 loss than one backpropagation epoch. (Montana D. & Davis L., 1989, p. 767) In Figure 23 we can see how the genetic algorithms outperforms back-propagation in terms of speed and error-minimization, providing us with a better solution and using less computational power (Mitchell, 1998, p. 69). This does not mean per se that back-propagation is bad, but that, depending on the problem, some approaches work better than others and that knowledge about the specific domain of appliance helps with finding a satisfying solution (Montana D. & Davis L., 1989, p. 767). Apart from that, maybe a different back-propagation algorithm (such as quickprop) would have yielded better results (Mitchell, 1998, p. 70). Figure 23: Comparing the performance of the genetic algorithm to backpropagation on the basis of total loss. Adapted from (Mitchell, 1998, p. 70) 44

45 C. Evolving Connections i. Direct Encoding Another problem is finding the right amount of connections and the amount of nodes. Generally we have to choose them before we start learning or we create a number of networks and test them manually. Because there are no special rules but only general guidelines, like more nodes and layers for more complex problems, we are left estimating and guessing or we can use genetic algorithms (Mitchell, 1998, p. 70). There are basically two ways to encode the needed information in the genome: direct encoding and grammar encoding. Direct encoding is used to determine weights for a network of fixed size. We take a N N matrix for N units where each entry encodes a connection. In Figure 24 you can see a network, encoded with a connection matrix and the corresponding chromosome for the network. Each row depicts a single neuron i, and each element in the row stands for an outgoing connection. With this, one can create simple feed-forward networks or recurrent networks, although it is possible to ignore connections leading back to former layers or input layers as well as self-referencing nodes (Mitchell, 1998, p. 71). Measuring fitness is similar to the approach taken by Montana and Davis, whereby we create our network and then try to learn the weights with either backpropagation or a genetic algorithm and then measuring the sum of the square error L 2. Crossover happens by taking, at random, a row from each parent and then mutating the columns by flipping alleles (Mitchell, 1998, p. 72). The major disadvantage of this system is the exponential growth of N 2, where the genome becomes bigger and bigger, until computing becomes too slow, the fact that it can not repeat structures and create nested structures and that it cannot create networks of varying size as easy (Schaffer et al., 1992, p. 18). Figure 24: A direct coding of a network, with the representing chromosome. Adapted from (Schaffer et al., 1992) 45

ii. Grammar Encoding Due to the mentioned problems of direct encoding, Kitano (1990) proposed another way of encoding network structure, in the form of a formal language called graphgeneration

46 ii. Grammar Encoding Due to the mentioned problems of direct encoding, Kitano (1990) proposed another way of encoding network structure, in the form of a formal language called graphgeneration grammar (Mitchell, 1998, p. 73). Figure 25: a) A base set of grammar rules b) An example of a executed grammar rule, unfolded to see the connection matrix. C) The resulting network. Adapted from (Mitchell, 1998, p. 74) An example of such a generated grammar rule set can be seen in Figure 25.a. The grammar consists of non-terminal upper-case symbols, which can be replaced by other non-terminal symbols or terminal lower-case symbols. These lower-case symbols are predetermined and not encoded in the genome and represent all possible combinations of 0 and 1. This means that there are 16 terminal symbols a p, and the algorithm can encode up to 26 different pattern as terminal symbols A Z. The algorithms then generates rules based on these terminal symbols, where a gene consists of five symbols, the first symbol being the symbol which is then substituted by the following four symbols. An example of a genome can be seen in Figure 26, which is then used to create the connection matrix in Figure 25.b. Every genome starts with a gene with a starting point S, everything else is subject to the algorithm (Mitchell, 1998, p. 74). Figure 26: A part of the genome for the network displayed in Figure 25. Adapted from (Mitchell, 1998, p. 74) 46

Then, after the whole genome has been unfolded into a connection matrix, we take the same approach as during direct encoding, where every row represents a neuron (here starting at 0), and every

47 Then, after the whole genome has been unfolded into a connection matrix, we take the same approach as during direct encoding, where every row represents a neuron (here starting at 0), and every column depicts the receiving neuron. In addition to this, Kitano extended the value of the connection matrix by defining that a neuron i is only present in the network structure, if the cell at row i and column i is a 1, otherwise the connections from and to the neuron are ignored. Also, depending on what we want, we can also ignore connections back to former layers or input layers. (Mitchell, 1998, p. 73) The big advantage of this approach is the smaller genome and the possibility to create clusters and patterns of structure designs, which may help in some specific problem. Additionally, the network size is variable because the connection matrix has no fixed dimensions. Also, this approach could create sub-network which could solve problems similar to a divide-and-conquer approach. Kitano has tested and compared his grammar encoding with classic direct encoding and has measured the fitness and performance as the sum of the total L 2 loss, and concluded that grammar encoding outperformed direct encoding due to the smaller genome and repeated structures (Figure 27) (Mitchell, 1998, p. 75). Figure 27: A direct comparison of the both types of encoding. Adapted from (Mitchell, 1998, p. 76) 47

Lecture 1: Machine Learning Basics

1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3