Question of the Day. 2D1431 Machine Learning. Exam. Exam. Exam preparation

Size: px

Start display at page:

Download "Question of the Day. 2D1431 Machine Learning. Exam. Exam. Exam preparation"

Kevin McDaniel
5 years ago
Views:

1 Question of the Day 2D1431 Machine Learning Take two ordinary swedish kronor coins and touch them together. Tough, huh? w take a third coin and position it in a fashion so that it touches the other two. How many coins can you add so that each coin touches every other coin? Exam preparation Exam Time: Saturday 14/12/ Place: L21+L22 Drottn. Kristinas Väg 30 books or other study aids allowed Language: English & Swedish About 14 theoretical and practical questions 40 points max. (2-5 points/ per question) Grade 0-22 points : U points : points : points : 5 Exam Exam topics Introduction Machine Learning Concept Learning Decision Tree Learning Artificial Neural Networks Bayesian Learning Boosting Computational Learning Theory Evolutionary Algorithms Reinforcement Learning / Dynamic Programming 1

2 Exam Material Machine Learning book: chapters 1,2,3,4,6,7,8,9,13 Articles: Tesauro: TD-gammon Schapire: A brief introduction to boosting Kaelbling: Reinforcement Learning a Survey (pp. 1-19) Schwefel, Bäck: An Overview of Evolutionary Algorithms for Parameter Optimization Lecture slides Lab description and assignments Theoretical Exam Questions Describe, explain and compare machine learning algorithms Describe, explain key concepts, definitions and principles of machine learning proofs or derivations Knowledge of some key equations: e.g. backpropagation and perceptron learning rule, k-nearest neighbors, regression, Bayes theorem, TD-learning rule, policy iteration, information gain, entropy Practical Exam Questions Related to lab assignments Simple calculations that you can do by hand Good knowledge of the algorithms that you programmed or applied in the labs Candidate elimination Decision tree learning Bayesian learning Boosting Cross-validation Instance based learning Dynamic programming Temporal difference learning To which of the three learning paradigms, learning with a teacher, learning with a critic and unsupervised learning do the following algorithms belong? Instance based learning Reinforcement learning Backpropagation (multi-layer perceptron ) Evolutionary algorithms Expectation maximization (EM) Decision tree learning 2

3 Practical To which of the three learning paradigms, learning with a teacher, learning with a critic and unsupervised learning do the following algorithms belong? Instance based learning (supervised) Reinforcement learning (learning with a critic) Backpropagation (multi-layer perceptron ) (supervised) Evolutionary algorithms (learning with a critic) Expectation maximization (EM) (unsupervised) Decision tree learning (supervised) Assume the following error function: E(w) = 1/2σ 2 r d w + ½ r x w 2 where σ and r are constants. Explore the method of gradient descent on the single weight w? Write down the update equation for w(k+1) given w(k). Find the optimum value of w for which E(w) becomes minimal. Practical E(w) = 1/2σ 2 r d w + ½ r x w 2 gradient: de/dw = -r d + r x w, Weight change in the direction along the negative gradient de/dw = r d r x w, update rule: w(k+1) = w(k) α de/dw = w(k) +α (r d r x w) quadratic optimization problem: de/dw= 0 implies A basic limitation of the single layer perceptron is that it cannot implement the XOR (exclusive-or) function. Explain the reason for this limitation. Are their Boolean functions that can not be implemented with a two-layer perceptron, explain why? w opt = r d /r x at which E(w opt ) = ½ σ 2 ½ r d2 /r x 3

4 Perceptron can only learn linearly separable concepts. The four points of the XOR-problem in a twodimensional space cannot be separated by a line. A multi-layer perceptron with sigmoid activation can implement arbitrary Boolean functions. Proof by construction: Write the Boolean function in its conjunctive (CNF) normal form for example ( X1 and not X2 and X3 ) or (X1 and X2 and not X3) for each term add a hidden neuron with threshold n-0.5 (where n is the number inputs) and weight (+1) if x_i occurs, weight (-1) if not x_i occurs as a literal. The weights from the hidden to the output layer are all +1, the threshold for the ouput neuron is 0.5, so it computes the OR disjunction ( at least one hidden neuron active) of all the conjunctions. Describe the interaction between policy update and policy evaluation in the policy iteration algorithm. Policy evaluation step: compute the state value V π function for some current policy π Policy improvement (update) step, update the current policy is updated with respect to the state value function taking the action with maximum of sum of reward and V-value of successor state 1. Start with an arbitrary initial policy π 0 2. Compute the the state value V πn (s) and/or state action value Q πn (s,a) for all states s and actions a using policy evaluation (iterate Bellmann equations) V k+1 π (s) = r(s,π(s)) + γ V k π (δ(s,π(s))) 3. for each state s compute the optimal policy π (s)=argmax a Q π (s,a) = argmax a r(s,a)+γ V π (δ(s,a)) 4. repeat steps 2 and 3 until V no longer changes Practical The figure shows a neural network involving a single hidden neuron, but where the input neurons are directly connected to the output. Specify the weights w ij and bias terms θ i such that the network solves the XOR problem. x t 0,0 0 0,1 1 1,0 1 1,1 0 w 14 θ 4 w 34 θ 3 w 13 w 23 w 24 4

5 Practical The figure shows a neural network involving a single hidden neuron, but where the input neurons are directly connected to the output. Specify the weights w ij and bias terms θ i such that the network solves the XOR problem. x t 0,0 0 0,1 1 1,0 1 1,1 0 w 14 θ w 34 θ 4 3 w 24 W 13 =1 W 23 =1 θ 3 =-1.5 W 14 =1 W 24 =1 W 34 =-2 θ 4 =-0.5 Practical Draw the Bayesian Belief network that represents the conditional independence assumption of the naïve Bayes classifier for the PlayTennis problem. Give the conditional probability tables associated with the nodes Wind and Outlook. What is according to the Bayesian Belief network the probability for playing tennis on a day with weak wind and sunny outlook, assuming one would ignore the other attributes. w 13 w 23 Training Examples Bayesian Belief Network Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 Outlook Overcast Overcast Overcast Overcast Temp. Hot Hot Hot Cool Cool Cool Cold Hot Humidity Wind Play Tennis PlayTennis Outlook Humidity Temperature Wind Outlook Wind Overcast

6 Bayesian Belief Network Naïve Bayes classifier P(<a 1,a 2,,a n > c j ) = Π i P(a i c j ) P(c j <a 1,a 2,,a n >) = P(c j ) Π i P(a i c j ) P(yes)=9/14, P(no)=5/14 P(yes wind=strong,outlook=sunny) ~ 9/14 * 0.66 * 0.22 = P(no wind=strong,outlook=sunny) ~ 5/14 * 0.66 * 0.4 = ization: P(yes wind=strong,outlook=sunny) = Practical Assume the current action value function of a system with three states s 1,s 2,s 3 and two actions a 1,a 2 in each state. Q(s 1,a 1 )=10, Q(s 1,a 2 )=5 Q(s 2,a 1 )=15, Q(s 2,a 2 )=0 Q(s 3,a 1 )=5, Q(s 3,a 2 )=10 The agent observes the following episode of states, actions and rewards. s 1,a 1,r=5 s 3,a 2,r=10 s 2,a 1,r=5 s 2,a 2,r=10 s 3,a 2 Compute the new values of Q(s,a) for the visited states and actions, assuming a discount factor γ=0.8 and a learning α=0.5. Consider on -policy TD-learning. Practical Q(s 1,a 1 )=10, Q(s 1,a 2 )=5 Q(s 2,a 1 )=15, Q(s 2,a 2 )=0 Q(s 3,a 1 )=5, Q(s 3,a 2 )=10 s 1,a 1,r=5 s 3,a 2,r=10 s 2,a 1,r=5 s 2,a 2,r=10 s 3,a 2 Q(s t,a t ) Q(s t,a t ) +α [r t+1 + γ Q(s t+1,a t+1 )-Q(s t,a t )] Q(s 1,a 1 ) [ *10 10)] = 12.5 Q(s 3,a 2 ) [ * 15 10] = 16 Q(s 2,a 1 ) [ *0 15] = 10 Q(s 2,a 2 ) [ * 16 0] = 11.4 Explain the difference between the perceptron learning rule and the delta rule for a single layer network. In particular, from what principles are the rules derived and under what conditions do the algorithms converge and find a solution that classifies all training examples correctly. 6

7 Perceptron rule is derived from the manipulation of a hyper-plane separating the two classes, if an example is wrongly classified (t y) the weight vector w is either moved towards or away from the input x learning rule ω=α (t-y) x Delta rule is derived from gradient descent (with respect to the weights) of the quadratic error between target and network output. ω= - α E/ ω Perceptron converges if and if only training patterns are linearly separable and learning rate is sufficiently small, in that case it finds a solution that classifies all examples correctly Delta rule always converges to the solution with minimum quadratic error if learning rate is sufficiently small (decrease over time) even if the examples are not linearly separable, in case of a one layer ANN delta rule converges to the unique global minimum (no local minima) The network with minimal error might still classify examples incorrectly even if the training data is linearly separable, it will never classify all examples correctly if the data is not linearly separable. Genetic algorithms and neural networks are both adaptive but in different ways. Briefly describe how both algorithms draw inspiration from biology. Which key concepts and mechanisms do both approaches inherit from their biological counterparts? Genetic algorithms are inspired by natural evolution and genetics. GA key concepts and mechanisms: learning on the population level focus is rather on adaptation to the environment than just learning population of competing individuals concept of a fitness that describes chance of reproducing offspring Darwinian selection survival of the fittest generation of new variants through mutation and recombination concept of a genetic code (DNA) that serves as a blueprint for a phenotype Neural networks are inspired by the biological neural processes occurring in the human and animal brains or nervous system Mechanism, key concepts: Inductive learning on an individual basis, distributed parallel processing, many computationally simple but highly connected processing elements (neurons) operate in parallel learning takes place in the synaptic connections (weights), robustness towards noisy, incomplete patterns ability to generalize from a set of training examples to previously unseen instances activity of a neuron depends on the summation of weighted inputs from other neurons 7

8 Explain the terms reward, policy, state value and action value function in the context of reinforcement learning. In what way differ dynamic programming and temporal difference learning? What is an optimal policy and how is it computed from the state value function. Genetic algorithms, evolution strategies and genetic programming are three different variants of evolutionary algorithms. Describe which concepts they share and what their differences are. For each method, describe one key feature in which the methods differ in detail. Question of the Day You can arrange up to five coins in a way that each coin touches every other coin. Starting with one horizontal coin as the base, arrange two additional coins to form a small "tent" over the first, resting lightly on top of it. The tricky part is adding two more. These two must touch each other at the tip, with their bases just barely filling the gap left by the previous two. You would think that this would make for an excellent bar bet, but before you challenge someone with this, try it yourself. It's a lot harder than it appears- you may need additional objects to keep the coins from sliding about as you try to arrange them. 8

Artificial Neural Networks written examination

1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14