The Government of the Russian Federation

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

Learning Methods for Fuzzy Systems

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

(Sub)Gradient Descent

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

Artificial Neural Networks written examination

Learning From the Past with Experiment Databases

Lecture 1: Basic Concepts of Machine Learning

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Probabilistic Latent Semantic Analysis

Rule Learning With Negation: Issues Regarding Effectiveness

Human Emotion Recognition From Speech

Mining Association Rules in Student s Assessment Data

CSL465/603 - Machine Learning

A Comparison of Standard and Interval Association Rules

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Rule Learning with Negation: Issues Regarding Effectiveness

Softprop: Softmax Neural Network Backpropagation Learning

Semi-Supervised Face Detection

Speech Emotion Recognition Using Support Vector Machine

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Time series prediction

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Evolutive Neural Net Fuzzy Filtering: Basic Description

Axiom 2013 Team Description Paper

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

CS Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Issues in the Mining of Heart Failure Datasets

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Assignment 1: Predicting Amazon Review Ratings

The Boosting Approach to Machine Learning An Overview

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

SARDNET: A Self-Organizing Feature Map for Sequences

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Evolution of Symbolisation in Chimpanzees and Neural Nets

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Test Effort Estimation Using Neural Network

The Method of Immersion the Problem of Comparing Technical Objects in an Expert Shell in the Class of Artificial Intelligence Algorithms

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

arxiv: v1 [cs.lg] 15 Jun 2015

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Knowledge Transfer in Deep Convolutional Neural Nets

INPE São José dos Campos

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Mining Student Evolution Using Associative Classification and Clustering

Reducing Features to Improve Bug Prediction

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

STA 225: Introductory Statistics (CT)

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A study of speaker adaptation for DNN-based speech synthesis

A survey of multi-view machine learning

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Calibration of Confidence Measures in Speech Recognition

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Agent-Based Software Engineering

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Generative models and adversarial training

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Australian Journal of Basic and Applied Sciences

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Integrating E-learning Environments with Computational Intelligence Assessment Agents

Data Fusion Through Statistical Matching

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

An OO Framework for building Intelligence and Learning properties in Software Agents

Learning Methods in Multilingual Speech Recognition

Probability and Statistics Curriculum Pacing Guide

Welcome to. ECML/PKDD 2004 Community meeting

Discriminative Learning of Beam-Search Heuristics for Planning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Speech Recognition at ICSI: Broadcast News and beyond

arxiv: v1 [cs.lg] 3 May 2013

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

arxiv: v1 [cs.cv] 10 May 2017

A Reinforcement Learning Variant for Control Scheduling

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Detailed course syllabus

Comparison of network inference packages and methods for multiple networks inference

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

Henry Tirri* Petri Myllymgki

arxiv: v2 [cs.cv] 30 Mar 2017

Learning Distributed Linguistic Classes

Switchboard Language Model Improvement with Conversational Data from Gigaword

Universidade do Minho Escola de Engenharia

Artificial Neural Networks

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Transcription:

The Government of the Russian Federation The Federal State Autonomous Institution of Higher Education "National Research University - Higher School of Economics" Faculty of Business Informatics Department of Innovation and Business in Information Technology Master s Program38.04.05 Big Data Systems Author: Dr. Sci., Prof. Andrey Dmitriev, a.dmitriev@hse.ru Moscow, 2014 This document may not be reproduced or redistributed by other Departments of the University without permission of the Authors.

Field of Application and Regulations The course "Applied Machine Learning" syllabus lays down minimum requirements for student s knowledge and skills; it also provides description of both contents and forms of training and assessment in use. The course isoffered to students of the Master s Program "Big Data Systems" (area code 080500.68) in the Faculty of Business Informatics of the National Research University "Higher School of Economics". The course is a part of the curriculum pool of elective courses (1 st year, M.2.B.1. Optional courses, M.2 Courses required by the M s program of the 2014-2015 academic year s curriculum), and it is a twomodule course (3 rd module and 4 th module). The duration of the course amounts to 48 class periods (both lecture and practices) divided into 24 lecture hours and 24 practice hours. Besides, 96 academic hours are set aside to students for self-studying activity. The syllabus is prepared for teachers responsible for the course (or closely related disciplines), teaching assistants, students enrolled on the course "Applied Machine Learning" as well as experts and statutory bodies carrying out assigned or regular accreditations in accordance with educational standards of the National Research University Higher School of Economics, curriculum ("Business Informatics", area code 38.04.05), Big Data Systems specialization, 1 st year, 2014-2015 academic year. 1 Course Objectives The main objective of the Course is to present, examine and discuss with students fundamentals and principles of machine learning. This course is focused on understanding the role of machine learning for big data analysis. Generally, the objective of the course can be thought as a combination of the following constituents: familiarity with peculiarities of supervised learning, parametric and multivariate methods, dimensionality reduction, clustering, nonparametric methods, decision trees, linear discrimination, kernel machines, Bayesian estimation as applied areas related to big data analysis, understanding of the main notions of machine learning theory, the framework of machine learning as the most significant areas of big data analysis, understanding of the role of machine learning in big data analysis, obtaining skills in utilizing machine learning in big data analysis. 2 Students' Competencies to be Developed by the Course While mastering the course material, the student will know main notions of the supervised learning, parametric and multivariate methods, dimensionality reduction, clustering, nonparametric methods, decision trees, linear discrimination, kernel machines, Bayesian estimation, acquire skills of big data analysis, gain experience in big data analysis with use main notions of the supervised learning, parametric and multivariate methods, dimensionality reduction, clustering, nonparametric methods, decision trees, linear discrimination, kernel machines, Bayesian estimation. In short, the course contributes to the development of the following professional competencies:

Ccompetencies Ability to offer concepts, models, invent and test methods and tools for professional work Ability to apply the methods of system analysis and modeling to assess, design and strategy development of enterprise architecture Ability to develop and implement economic and mathematical models to justify the project solu-tions in the field of information and computer technology Ability to organize self and collective research work in the enterprise and manage it FSES/ HSE code Descriptors main mastering features (indicators of result achievement) Training forms and methods contributing to the formation and development of competence SC-2 Demonstrates Lecture, practice, home tasks PC-13 Owns and uses Lecture, practice, home tasks PC-14 Owns and uses Lecture, practice, home tasks PC-16 Demonstrates Lecture, practice, home tasks 3 The Course within the Program s Framework The course "Applied Machine Learning" syllabus lays down minimum requirements for student s knowledge and skills; it also provides description of both contents and forms of training and assessment in use. The course is offered to students of the Master s Program "Big Data Systems" (area code 080500.68) in the Faculty of Business Informatics of the National Research University "Higher School of Economics". The course is a part of the curriculum pool of required courses (1 st year, M.2.B.1. Optional courses, M.2 Courses required by the M s program of the 2014-2015 academic year s curriculum), and it is a twomodule course (3 rd module and 4 th module). The duration of the course amounts to 48 class periods (both lecture and practices) divided into 24 lecture hours and 24 practice hours. Besides, 96 academic hours are set aside to students for self-studying activity. Academic control forms include 2 home tasks are done by students individually, herewith each student has to prepare electronic (PDF format solely) report; all reports have to be submitted in LMS; all reports are checked and graded by the instructor on ten-point scale by the end of the 3 rd module and the 4 th module, pass-final examination, which implies written test and computer-based problem solving. The Course is to be based on the acquisition of the following courses: Calculus Linear Algebra Probability Theory and Mathematical Statistics Data Analysis Economic and Mathematical Modeling Discrete Mathematics The Course requires the following students' competencies and knowledge: main definitions, theorems and properties from Calculus, Linear Algebra, Probability Theory and Mathematical Statistics, Data Analysis, Economic and Mathematical Modeling and Discrete Mathematics, ability to communicate both orally and in written form in English language,

ability to search for, process and analyze information from a variety of sources. Main provisions of the course should be used to further the study of the following courses: Risk analysis based on big data Predictive Modeling Marketing analytics based on big data 4 Thematic Course Contents Title of the topic / lecture Hours (total number) Lectures Class hours Seminars Practice Independent work 3 rd Module 1 Supervised Learning 12 2 2 8 2 Bayesian Decision Theory 12 2 2 8 3 Parametric and Multivariate Methods 12 2 2 8 4 Dimensionality Reduction 12 2 2 8 5 Clustering 12 2 2 8 3 rd Module TOTAL 10 10 40 4 th Module 6 Nonparametric Methods 12 2 2 8 7 Decision Trees 12 2 2 8 8 Linear Discrimination 12 2 2 8 9 Multilayer Perceptrons 12 2 2 8 10 Kernel Machines 12 2 2 8 11 Bayesian Estimation 12 2 2 8 12 Design and Analysis of Machine Learning Experiments 5 Forms and Types of Testing Type of control Current (week) Resultant Pass-fail exam 12 2 2 8 4 th Module TOTAL 14 14 56 TOTAL 24 24 96 Form of control 1 year Department Parameters 1 2 3 4 Home task 1 week 29 Innovation problems solving, written and Business report (paper) Home task 2 week 40 in In- problems solving, written formation report (paper) week 41 Technology written test (paper) and computer-based problem solving Evaluation Criteria Current and resultant grades are made up of the following components: 2 tasks are done by students individually, herewith each student has to prepare electronic (PDF format solely) report. All reports have to be submitted in LMS. All reports are checked and graded by the instructor on tenpoint scale by the end of the 1 st module. All home tasks (HT) are assessed on the ten-point scale summary. pass-final examination implies written test (WT) and computer-based problem solving (CS). Finally, the total course grade on ten-point scale is obtained as

O(Total) = 0,6 * O(HT) + 0,1 * O(WT) + 0,3 * O(CS). A grade of4 or higher means successful completion of the course ("pass"), while grade of 3 or lower mean sun successful result ("fail"). Conversion of the concluding rounded grade O(Total) to five-point scale grade. 6 Detailed Course Contents Lecture 1. Supervised Learning Examples of Machine Learning Applications. Learning Associations: Classification, Regression, Unsupervised Learning, Reinforcement Learning. Learning a Class from Examples. Vapnik-Chervonenkis (VC). Dimension. Probably Approximately Correct (PAC) Learning. Noise. Learning Multiple Classes. Regression. Model Selection and Generalization. Dimensions of a Supervised Machine Learning Algorithm. Practice 1. Probably Approximately Correct (PAC) Learning. Noise. Learning Multiple Classes. Regression. Model Selection and Generalization. Dimensions of a Supervised Machine Learning Algorithm. 1. Alpaydin E. Introduction to Machine Learning, 2 nd Edition, MIT Press Cambridge, 2010 1. Angluin, D. 1988. Queries and Concept Learning. Machine Learning 2: 319 342. 2. Blumer, A., A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. 1989. Learnability and the Vapnik-Chervonenkis Dimension. Journal of the ACM 36: 929 965. 3. Dietterich, T. G. 2003. Machine Learning. In Nature Encyclopedia of Cognitive Science. London: Macmillan. 4. Hirsh, H. 1990. Incremental Version Space Merging: A General Framework for Concept Learning. Boston: Kluwer. Lecture 2. Bayesian Decision Theory Introduction. Classification. Losses and Risks. Discriminant Functions. Utility Theory. Association Rules. Practice 2. Classification. Losses and Risks. Discriminant Functions. Utility Theory. Association Rules. 1. Alpaydin E. Introduction to Machine Learning, 2 nd Edition, MIT Press Cabridge, 2010 1. Agrawal, R., H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo. 1996. Fast Discovery of Association Rules. In Advances in Knowledge Discovery and Data Mining, ed. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 307 328. Cambridge, MA: MIT Press. 2. Duda, R. O., P. E. Hart, and D. G. Stork. 2001. Pattern Classification, 2nd ed. New York: Wiley. Li, J. 2006. On Optimal Rule Discovery. IEEE Transactions on Knowledge and Data Discovery 18: 460 471. 3. Newman, J. R., ed. 1988. The World of Mathematics. Redmond, WA: Tempus. Omiecinski, E. R. 2003. Alternative Interest Measures for Mining Associations in Databases. IEEE Transactions on Knowledge and Data Discovery 15: 57 69. 4. Russell, S., and P. Norvig. 1995. Artificial Intelligence: A Modern Approach. New York: Prentice Hall. Shafer, G., and J. Pearl, eds. 1990. Readings in Uncertain Reasoning. SanMateo,

CA: Morgan Kaufmann. 5. Zhang, C., and S. Zhang. 2002. Association Rule Mining: Models and Algorithms. New York: Springer. Lecture 3. Parametric and Multivariate Methods Maximum Likelihood Estimation: Bernoulli Density, Multinomial Density, Gaussian (Normal) Density. Evaluating an Estimator: Bias and Variance. The Bayes Estimator. Parametric Classification. Regression. Tuning Model Complexity: Bias/Variance Dilemma. Model Selection Procedures. Multivariate Data. Parameter Estimation. Estimation of Missing Values. Multivariate Normal Distribution. Multivariate Classification. Tuning Complexity. Discrete Features. Multivariate Regression. Practice 3. Maximum Likelihood Estimation. Multivariate Classification. Tuning Complexity. Discrete Features. Multivariate Regression. 1. Alpaydin E. Introduction to Machine Learning, 2 nd Edition, MIT Press Cambridge, 2010 1. Duda, R. O., P. E. Hart, and D. G. Stork. 2001. Pattern Classification, 2nd ed. New York: Wiley. 2. Friedman, J. H. 1989. Regularized Discriminant Analysis. Journal of American Statistical Association 84: 165 175. 3. Harville, D. A. 1997. Matrix Algebra from a Statistician s Perspective. New York: Springer. 4. Manning, C. D., and H. Schutze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press. 5. McLachlan, G. J. 1992. Discriminant Analysis and Statistical Pattern Recognition. New York: Wiley. 6. Rencher, A. C. 1995. Methods of Multivariate Analysis. New York: Wiley. 7. Strang, G. 1988. Linear Algebra and its Applications, 3rd ed. New York: Harcourt Brace Jovanovich. Lecture 4. Dimensionality Reduction Subset Selection. Principal Components Analysis. Factor Analysis. Multidimensional Scaling. Linear Discriminant Analysis. Isomap. Locally Linear Embedding. Practice 4. Principal Components Analysis. Factor Analysis. Multidimensional Scaling. Linear Discriminant Analysis. 1. Balasubramanian, M., E. L. Schwartz, J. B. Tenenbaum, V. de Silva, and J. C. Langford. 2002. The Isomap Algorithm and Topological Stability. Science 295: 7. 2. Chatfield, C., and A. J. Collins. 1980. Introduction to Multivariate Analysis. London: Chapman and Hall. 3. Cox, T. F., and M. A. A. Cox. 1994. Multidimensional Scaling. London: Chapman and Hall. 4. Devijer, P. A., and J. Kittler. 1982. Pattern Recognition: A Statistical Approach. New York: Prentice-Hall. 5. Flury, B. 1988. Common Principal Components and Related Multivariate Models. New York: Wiley.

6. Fukunaga, K., and P. M. Narendra. 1977. A Branch and Bound Algorithm for Feature Subset Selection. IEEE Transactions on Computers C-26: 917 922. Lecture 5. Clustering Mixture Densities. k-means Clustering. Expectation-Maximization Algorithm. Mixtures of Latent Variable Models. Supervised Learning after Clustering. Hierarchical Clustering. Choosing the Number of Clusters. Practice 5. Mixture Densities. k-means Clustering. Expectation-Maximization Algorithm. Mixtures of Latent Variable Models. 1. Alpaydın, E. 1998. Soft Vector Quantization and the EM Algorithm. Neural Networks 11: 467 477. 2. Barrow, H. B. 1989. Unsupervised Learning. Neural Computation 1: 295 311. 3. Bezdek, J. C., and N. R. Pal. 1995. Two Soft Relatives of Learning Vector Quantization. Neural Networks 8: 729 743. 4. Bishop, C. M. 1999. Latent Variable Models. In Learning in Graphical Models, ed. M. I. Jordan, 371 403. Cambridge, MA: MIT Press. 5. Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of Royal Statistical Society B 39: 1 38. 6. Gersho, A., and R. M. Gray. 1992. Vector Quantization and Signal Compression. Boston: Kluwer. 7. Ghahramani, Z., and G. E. Hinton. 1997. The EM Algorithm for Mixtures of Factor Analyzers. Technical Report CRG TR-96-1, Department of Computer Science, University of Toronto Lecture 6. Nonparametric Methods Nonparametric Density Estimation. Histogram Estimator: Kernel Estimator, k-nearest Neighbor Estimator. Generalization to Multivariate Data. Nonparametric Classification. Condensed Nearest Neighbor. Nonparametric Regression: Smoothing Models, Running Mean Smoother, Kernel Smoother, Running Line Smoother. How to Choose the Smoothing Parameter. Practice 6. Nonparametric Density Estimation. Nonparametric Regression. 1. Aha, D. W., ed. 1997. Special Issue on Lazy Learning, Artificial Intelligence Review 11(1 5): 7 423. 2. Aha, D. W., D. Kibler, and M. K. Albert. 1991. Instance-Based Learning Algorithm. Machine Learning 6: 37 66. 3. Atkeson, C. G., A. W. Moore, and S. Schaal. 1997. Locally Weighted Learning. Artificial Intelligence Review 11: 11 73. 4. Cover, T. M., and P. E. Hart. 1967. Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory 13: 21 27. 5. Dasarathy, B. V. 1991. Nearest Neighbor Norms: NN Pattern Classification Techniques. Los Alamitos, CA: IEEE Computer Society Press.

6. Duda, R. O., P. E. Hart, and D. G. Stork. 2001. Pattern Classification, 2nd ed. New York: Wiley. Geman, S., E. Bienenstock, and R. Doursat. 1992. Neural Networks and the Bias/Variance Dilemma. Neural Computation 4: 1 58. Lecture 7. Linear Discrimination Generalizing the Linear Model. Geometry of the Linear Discriminant: Two Classes, Multiple Classes. Pairwise Separation. Parametric Discrimination Revisited. Gradient Descent. Logistic Discrimination: Two Classes, Multiple Classes. Discrimination by Regression. Practice 7. Generalizing the Linear Model. Geometry of the Linear Discriminant. Logistic Discrimination. 1. Aizerman, M. A., E. M. Braverman, and L. I. Rozonoer. 1964. Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning. Automation and Remote Control 25: 821 837. 2. Anderson, J. A. 1982. Logistic Discrimination. In Handbook of Statistics, Vol. 2, Classification, Pattern Recognition and Reduction of Dimensionality, ed. P. R. Krishnaiah and L. N. Kanal, 169 191. Amsterdam: North Holland. 3. Bridle, J. S. 1990. Probabilistic Interpretation of Feedforward Classification Network Outputs with Relationships to Statistical Pattern Recognition. In Neurocomputing: Algorithms, Architectures and Applications, ed. F. Fogelman-Soulie and J. Herault, 227 236. Berlin: Springer. 4. Duda, R. O., P. E. Hart, and D. G. Stork. 2001. Pattern Classification, 2nd ed. New York: Wiley. McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. London: Chapman and Hall. Lecture 8. Multilayer Perceptrons Introduction: Understanding the Brain, Neural Networks as a Paradigm for Parallel Processing. The Perceptron. Training a Perceptron. Learning Boolean Functions. Multilayer Perceptrons. MLP as a Universal Approximator. Backpropagation Algorithm: Nonlinear Regression, Two-Class Discrimination, Multiclass Discrimination, Multiple Hidden Layers. Training Procedures: Improving Convergence, Overtraining, Structuring the Network, Hints. Tuning the Network Size. Bayesian View of Learning. Dimensionality Reduction. Learning Time. Time Delay Neural Networks. Recurrent Networks. Practice 8. Backpropagation Algorithm. Training Procedures. 1. Abu-Mostafa, Y. 1995. Hints. Neural Computation 7: 639 671. 2. Aran, O., O. T. Yıldız, and E. Alpaydın. 2009. An Incremental Framework Based on Cross- Validation for Estimating the Architecture of a Multilayer Perceptron. International Journal of Pattern Recognition and Artificial Intelligence 23: 159 190. 3. Ash, T. 1989. Dynamic Node Creation in Backpropagation Networks. Connection Science 1: 365 375. 4. Battiti, R. 1992. First- and Second-Order Methods for Learning: Between Steepest Descent and Newton s Method. Neural Computation 4: 141 166.

5. Bishop, C. M. 1995. Neural Networks for Pattern Recognition. Oxford: Oxford University Press. Bourlard, H., and Y. Kamp. 1988. Auto-Association by Multilayer Perceptrons and Singular Value Decomposition. Biological Cybernetics 59: 291 294. Lecture 9. Multilayer Perceptrons Introduction: Understanding the Brain, Neural Networks as a Paradigm for Parallel Processing. The Perceptron. Training a Perceptron. Learning Boolean Functions. Multilayer Perceptrons. MLP as a Universal Approximator. Backpropagation Algorithm: Nonlinear Regression, Two-Class Discrimination, Multiclass Discrimination, Multiple Hidden Layers. Training Procedures: Improving Convergence, Overtraining, Structuring the Network, Hints. Tuning the Network Size. Bayesian View of Learning. Dimensionality Reduction. Learning Time. Time Delay Neural Networks. Recurrent Networks. Practice 9. Backpropagation Algorithm. Training Procedures. 1. Abu-Mostafa, Y. 1995. Hints. Neural Computation 7: 639 671. 2. Aran, O., O. T. Yıldız, and E. Alpaydın. 2009. An Incremental Framework Based on Cross- Validation for Estimating the Architecture of a Multilayer Perceptron. International Journal of Pattern Recognition and Artificial Intelligence 23: 159 190. 3. Ash, T. 1989. Dynamic Node Creation in Backpropagation Networks. Connection Science 1: 365 375. 4. Battiti, R. 1992. First- and Second-Order Methods for Learning: Between Steepest Descent and Newton s Method. Neural Computation 4: 141 166. 5. Bishop, C. M. 1995. Neural Networks for Pattern Recognition. Oxford: Oxford University Press. Bourlard, H., and Y. Kamp. 1988. Auto-Association by Multilayer Perceptrons and Singular Value Decomposition. Biological Cybernetics 59: 291 294. Lecture 10. Kernel Machines Optimal Separating Hyperplane. The Nonseparable Case: Soft Margin Hyperplane. ν-svm. Kernel Trick. Vectorial Kernels. Defining Kernels. Multiple Kernel Learning. Multiclass Kernel Machines. Kernel Machines for Regression. One-Class Kernel Machines. Kernel Dimensionality Reduction. Practice 10. The Nonseparable Case: Soft Margin Hyperplane. ν-svm. Multiclass Kernel Machines. Kernel Machines for Regression. 1. Allwein, E. L., R. E. Schapire, and Y. Singer. 2000. Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers. Journal of Machine Learning Research 1: 113 141. 2. Burges, C. J. C. 1998. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2: 121 167. 3. Chang, C.-C., and C.-J. Lin. 2008. LIBSVM: A Library for Support Vector Machines. http://www.csie.ntu.edu.tw/cjlin/libsvm/. 4. Cherkassky, V., and F. Mulier. 1998. Learning from Data: Concepts, Theory, and Methods. New York: Wiley.

5. Cortes, C., and V. Vapnik. 1995. Support Vector Networks. Machine Learning 20: 273 297. 6. Dietterich, T. G., and G. Bakiri. 1995. Solving Multiclass Learning Problems via Error- Correcting Output Codes. Journal of Artificial Intelligence Research 2: 263 286. 7. Gonen, M., and E. Alpaydın. 2008. Localized Multiple Kernel Learning. In 25th International Conference on Machine Learning, ed. A. McCallum and S. Roweis, 352 359. Madison, WI: Omnipress. Lecture 11. Bayesian Estimation Estimating the Parameter of a Distribution: Discrete Variables, Continuous Variables. Bayesian Estimation of the Parameters of a Function: Regression, The Use of Basis/Kernel Functions, Bayesian Classification. Gaussian Processes. Practice 11. Estimating the Parameter of a Distribution. Bayesian Estimation of the Parameters of a Function. Gaussian Processes. 1. Bishop, C. M. 2006. Pattern Recognition and Machine Learning. New York: Springer. 2. Figueiredo, M. A. T. 2003. Adaptive Sparseness for Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 25: 1150 1159. 3. Gelman, A. 2008. Objections to Bayesian statistics. Bayesian Statistics 3: 445 450. 4. MacKay, D. J. C. 1998. Introduction to Gaussian Processes. In Neural Networks and Machine Learning, ed. C. M. Bishop, 133 166. Berlin: Springer. 5. MacKay, D. J. C. 2003. Information Theory, Inference, and Learning Algorithms. Cambridge, UK: Cambridge University Press. 6. Rasmussen, C. E., and C. K. I. Williams. 2006. Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press. 7. Tibshirani, R. 1996. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B 58: 267 288. Lecture 12. Design and Analysis of Machine Learning Experiments Factors, Response, and Strategy of Experimentation. Response Surface Design. Randomization, Replication, and Blocking. Guidelines for Machine Learning Experiments. Cross-Validation and Resampling Methods: K-Fold Cross-Validation, 5 2 Cross-Validation, Bootstrapping. Measuring Classifier Performance. Interval Estimation. Hypothesis Testing. Assessing a Classification Algorithm s Performance: Binomial Test, Approximate Normal Test, t Test. Comparing Two Classification Algorithms: McNemar s Test, K-Fold Cross-Validated Paired t Test, 5 2 cv Paired t Test, 5 2 cv Paired F Test. Comparing Multiple Algorithms: Analysis of Variance. Comparison over Multiple Datasets: Comparing Two Algorithms, Multiple Algorithms. Practice 12. Cross-Validation and Resampling Methods. Assessing a Classification Algorithm s Performance. 1. Alpaydın, E. 1999. Combined 5 2 cv F Test for Comparing Supervised Classification Learning Algorithms. Neural Computation 11: 1885 1892.

2. Bouckaert, R. R. 2003. Choosing between Two Learning Algorithms based on Calibrated Tests. In Twentieth International Conference on Machine Learning, ed. T. Fawcett and N. Mishra, 51 58. Menlo Park, CA: AAAI Press. 3. Demsar, J. 2006. Statistical Comparison of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 7: 1 30. 4. Dietterich, T. G. 1998. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation 10: 1895 1923. 5. Fawcett, T. 2006. An Introduction to ROC Analysis. Pattern Recognition Letters 27: 861 874. 6. Montgomery, D. C. 2005. Design and Analysis of Experiments. 6th ed., New York: Wiley. 7. Ross, S. M. 1987. Introduction to Probability and Statistics for Engineers and Scientists. New York: Wiley. 7 Educational Technology During classes various types of active methods are used: analysis of practical problems, group work, computer simulations in computational software program Mathematica 10.0, distance learning with use LMS. 8 Methods and Materials for Current Control and Attestation 8.1 Example of Problems for Home Tasks Problem 1. Imagine you have two possibilities: You can fax a document, that is, send the image, or you can use an optical character reader (OCR) and send the text file. Discuss the advantage and disadvantages of the two approaches in a comparative manner. When would one be preferable over the other? Problem 2. Somebody tosses a fair coin and if the result is heads, you get nothing; otherwise you get $5. How much would you pay to play this game? What if the win is $500 instead of $5? Problem 3. Show that as we move an item from the consequent to the antecedent, confidence can never increase: confidence(abc D) confidence(ab CD). Problem 4. Write the code that generates a normal sample with given μ and σ, and the code that calculates m and s from the sample. Do the same using the Bayes estimator assuming a prior distribution for μ. Problem 5. In Isomap, instead of using Euclidean distance, we can also use Mahalanobis distance between neighboring points. What are the advantages and disadvantages of this approach, if any? Problem 6. In image compression, k-means can be used as follows: The image is divided into nonoverlapping c c windows and these c2-dimensional vectors make up the sample. For a given k, which is generally a power of two, we do k-means clustering. The reference vectors and the indices for each window is sent over the communication line. At the receiving end, the image is then reconstructed by reading from the table of reference vectors using the indices. Write the computer program that does this for different values of k and c. For each case, calculate the reconstruction error and the compression rate. Problem 7. In the running smoother, we can fit a constant, a line, or a higher-degree polynomial at a test point. How can we choose between them? Problem 8. What is the implication of the use of a single η for all xj in gradient descent?

Problem 9. Consider a MLP architecture with one hidden layer where there are also direct weights from the inputs directly to the output units. Explain when such a structure would be helpful and how it can be trained. Problem 10. Incremental learning of the structure of a MLP can be viewed as a state space search. What are the operators? What is the goodness function? What type of search strategies are appropriate? Define these in such a way that dynamic node creation and cascade-correlation are special instantiations. 8.2 Questions for Pass-Final Examination Theoretical Questions 1. Examples of Machine Learning Applications. 2. Learning Associations: Classification, Regression, Unsupervised Learning, Reinforcement Learning. Learning a Class from Examples. Vapnik-Chervonenkis (VC). Dimension. Probably Approximately Correct (PAC) Learning. Noise. 3. Learning Multiple Classes. Regression. Model Selection and Generalization. Dimensions of a Supervised Machine Learning Algorithm. 4. Introduction. Classification. Losses and Risks. Discriminant Functions. Utility Theory. Association Rules. 5. Maximum Likelihood Estimation: Bernoulli Density, Multinomial Density, Gaussian (Normal) Density. 6. Evaluating an Estimator: Bias and Variance. The Bayes Estimator. Parametric Classification. Regression. 7. Tuning Model Complexity: Bias/Variance Dilemma. 8. Model Selection Procedures. 9. Multivariate Data. Parameter Estimation. Estimation of Missing Values. Multivariate Normal Distribution. Multivariate Classification. Tuning Complexity. Discrete Features. Multivariate Regression. 10. Subset Selection. Principal Components Analysis. 11. Factor Analysis. 12. Multidimensional Scaling. 13. Linear Discriminant Analysis. Isomap. Locally Linear Embedding. 14. Mixture Densities. k-means Clustering. Expectation-Maximization Algorithm. 15. Mixtures of Latent Variable Models. 16. Supervised Learning after Clustering. Hierarchical Clustering. Choosing the Number of Clusters. 17. Nonparametric Density Estimation. Histogram Estimator: Kernel Estimator, k-nearest Neighbor Estimator. 18. Generalization to Multivariate Data. Nonparametric Classification. Condensed Nearest Neighbor. 19. Nonparametric Regression: Smoothing Models, Running Mean Smoother, Kernel Smoother, Running Line Smoother. How to Choose the Smoothing Parameter. 20. Generalizing the Linear Model. Geometry of the Linear Discriminant: Two Classes, Multiple Classes. 21. Pairwise Separation. Parametric Discrimination Revisited. Gradient Descent. Logistic Discrimination: Two Classes, Multiple Classes. Discrimination by Regression.

22. Understanding the Brain, Neural Networks as a Paradigm for Parallel Processing. 23. The Perceptron. Training a Perceptron. Learning Boolean Functions. Multilayer Perceptrons. MLP as a Universal Approximator. 24. Backpropagation Algorithm: Nonlinear Regression, Two-Class Discrimination, Multiclass Discrimination, Multiple Hidden Layers. 25. Training Procedures: Improving Convergence, Overtraining, Structuring the Network, Hints. 26. Tuning the Network Size. Bayesian View of Learning. Dimensionality Reduction. 27. Learning Time. Time Delay Neural Networks. Recurrent Networks. 28. Optimal Separating Hyperplane. 29. The Nonseparable Case: Soft Margin Hyperplane, ν-svm. 30. Kernel Trick. Vectorial Kernels. Defining Kernels. 31. Multiple Kernel Learning. Multiclass Kernel Machines. 32. Kernel Machines for Regression. One-Class Kernel Machines. 33. Kernel Dimensionality Reduction. 34. Estimating the Parameter of a Distribution: Discrete Variables, Continuous Variables. 35. Bayesian Estimation of the Parameters of a Function: Regression, The Use of Basis/Kernel Functions, Bayesian Classification. Gaussian Processes. 36. Factors, Response, and Strategy of Experimentation. Response Surface Design. Randomization, Replication, and Blocking. 37. Guidelines for Machine Learning Experiments. 38. Cross-Validation and Resampling Methods: K-Fold Cross-Validation, 5 2 Cross-Validation, Bootstrapping. 39. Measuring Classifier Performance. Interval Estimation. Hypothesis Testing. Assessing a Classification Algorithm s Performance: Binomial Test, Approximate Normal Test, t Test. 40. Comparing Two Classification Algorithms: McNemar s Test, K-Fold Cross-Validated Paired t Test, 5 2 cv Paired t Test, 5 2 cv Paired F Test. 41. Comparing Multiple Algorithms: Analysis of Variance. Comparison over Multiple Datasets: Comparing Two Algorithms, Multiple Algorithms. Examples of Problems Problem 1. In a two-class problem, let us say we have the loss matrix where λ11 = λ22 = 0, λ21 = 1 and λ12 = α. Determine the threshold of decision as a function of α. Problem 2. The K-fold cross-validated t test only tests for the equality of error rates. If the test rejects, we do not know which classification algorithm has the lower error rate. How can we test whether the first classification algorithm does not have higher error rate than the second one? Hint: We have to test H0 : μ 0 vs. H1 : μ > 0. Problem 3. If we have two variants of algorithm A and three variants of algorithm B, how can we compare the overall accuracies of A and B taking all their variants into account? 9 Teaching Methods and Information Provision 9.1 Core Textbook Alpaydin E. Introduction to Machine Learning, 2 nd Edition, MIT Press Cambridge, 2010.

9.2 Required Reading Han, J., and M. Kamber. 2006. Data Mining: Concepts and Techniques, 2nd ed. San Francisco: Morgan Kaufmann. Leahey, T. H., and R. J. Harris. 1997. Learning and Cognition, 4th ed. New York: Prentice Hall. Witten, I. H., and E. Frank. 2005. Data Mining: Practical Machine Learning Toolsand Techniques, 2nd ed. San Francisco: Morgan Kaufmann. 9.3 Supplementary Reading Dietterich, T. G. 2003. Machine Learning. In Nature Encyclopedia of Cognitive Science. London: Macmillan. Hirsh, H. 1990. Incremental Version Space Merging: A General Framework for Concept Learning. Boston: Kluwer. Kearns, M. J., and U. V. Vazirani. 1994. An Introduction to Computational Learning Theory. Cambridge, MA: MIT Press. Mitchell, T. 1997. Machine Learning. New York: McGraw-Hill. Valiant, L. 1984. A Theory of the Learnable. Communications of the ACM 27: 1134 1142. Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. New York: Springer. Winston, P. H. 1975. Learning Structural Descriptions from Examples. In The Psychology of Computer Vision, ed. P. H. Winston, 157 209. New York: McGraw-Hill. 9.4 Handbooks Handbook of Statistics, Vol. 2, Classification, Pattern Recognition and Reduction of Dimensionality, ed. P. R. Krishnaiah and L. N. Kanal, 1982, Amsterdam: North Holland. 9.5 Software Mathematica v. 10.0 9.6 Distance Learning MIT Open Course (Machine Learning) HSE Learning Management System 10 Technical Provision Computer, projector (for lectures or practice), computer class