Machine Learning. Lecture 1: Introduction to Machine Learning. Nevin L. Zhang

Machine Learning Lecture 1: Introduction to Machine Learning Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and KP Murphy (2012). Machine learning: a probabilistic perspective. MIT Press. (Chapter 1) Nevin L. Zhang (HKUST) Machine Learning 1 / 24

What is Machine Learning? We are in the era of big data. There are about 1 trillion web pages; One hour of video is uploaded to YouTube every second, amounting to 10 years of content every day; The genomes of 1000s of people, each of which has a length of 3.8 9 base pairs, have been sequenced by various labs; Walmart handles more than 1M transactions per hour and has databases containing more than 2.5 petabytes (2.5 10 1 5) of information;... This deluge of data calls for automated methods of data analysis, which is what machine learning provides. Nevin L. Zhang (HKUST) Machine Learning 2 / 24

What is Machine Learning? We define machine learning as a set of methods that can automatically detect patterns in data, use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty. Nevin L. Zhang (HKUST) Machine Learning 3 / 24

Types of Machine Learning Machine learning algorithms are divided into three main types: Supervised learning Unsupervised learning Reinforcement learning Deep learning can be applied in all those three types of tasks. There are many relatively more specialized algorithms: Semi-supervised learning Active learning Ensemble learning Transfer learning... Nevin L. Zhang (HKUST) Machine Learning 4 / 24

Supervised Learning Problem statement: Given: A labeled training set D = {x i, y i } N i=1 To Learn: A mapping y = f (x) from inputs x to outputs y Training input x i can be Simply a vector of features (aka attributes, covariates), or Complex structured objects such as an image, a document, a graph, etc Output (aka response variable) y i is can be A categorical/nominal variable. In this case we have a classification problem. Or a real-valued variable. In this case we have a regression problem Nevin L. Zhang (HKUST) Machine Learning 5 / 24

Classification From labeled training data, learn a mapping y = f (x) where y {1,..., C}. When C = 2, we have a binary classification problem. When C > 2, we have a multiclass classification problem. We regard it as a function approximation problem: We assume that x and y are related by an unknown function y = f (x) The task is to obtain an estimate ˆf of f from the labeled training data. We want to use ˆf to make predictions on novel inputs, meaning ones that we have not seen before (this is called generalization). Nevin L. Zhang (HKUST) Machine Learning 6 / 24

Classification: An Illustrative Example Left: A training set of colored shapes. Right: The representation of the data as a design matrix There are three test cases to classify It is clear how to classify the blue crescent. The other two cases are less clear. This example shows that we need to use probability in classification. Nevin L. Zhang (HKUST) Machine Learning 7 / 24

A Probabilistic Perspective on Classification A probabilistic formulation of classification: From training data D = {x i, y i } N i=1, learn a conditional distribution p(y x). Assign an instance x to the classification with the maximum probability: ŷ = ˆf (x) = arg max p(y x) c=1 An advantage of the probabilistic method is that uncertainty is explicitly modeled. If the probability max C c=1 p(y x) is not high enough, we might want to delay the decision until more information becomes available. C Nevin L. Zhang (HKUST) Machine Learning 8 / 24

Real-World Classification Problems Object recognition and image classification (ImageNet) Character recognition (recognize handwritten characters) Document classification (Is a customer review positive and negative) Spam detection and filtering Intrusion detection Medical diagnosis... Nevin L. Zhang (HKUST) Machine Learning 9 / 24

Regression From labeled training data, learn a mapping y = f (x) where y is continuous. Example: Each training example consists of a single real-valued input x i R, and a single real-valued response y i R. Two possible models to fit to the data: a straight line and a quadratic function. In general, the inputs are high dimensional. Nevin L. Zhang (HKUST) Machine Learning 10 / 24

Real-World Regression Problems Predict tomorrows stock market price given current market conditions and other possible side information. Predict the age of a viewer watching a given video on YouTube. Predict the location in 3d space of a robot arm end effector, given control signals (torques) sent to its various motors. Predict the amount of prostate specific antigen (PSA) in the body as a function of a number of different clinical measurements. Predict the temperature at any location inside a building using weather data, time, door sensors.... Nevin L. Zhang (HKUST) Machine Learning 11 / 24

Unsupervised Learning Sometimes, we have only unlabeled data D = {x i } N i=1, where there isn t a response variable. The goal of unsupervised learning is to discover interesting structures/patterns in the data. Some examples of supervised learning: Clustering Dimension reduction Structure discovery... Nevin L. Zhang (HKUST) Machine Learning 12 / 24

Clustering Cluster analysis or clustering is the task of grouping a set of objects in such a way that Objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters) Nevin L. Zhang (HKUST) Machine Learning 13 / 24

Real-World Clustering Problems Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers, and for use in market segmentation, Product positioning, New product development and Selecting test markets. In the study of social networks, clustering may be used to recognize communities within large groups of people. In human genetic clustering, the similarity of genetic data is used in clustering to infer population structures. Recommender systems are designed to recommend new items based on a user s tastes. They sometimes use clustering algorithms to predict a user s preferences based on the preferences of other users in the user s cluster.... Nevin L. Zhang (HKUST) Machine Learning 14 / 24

Dimensionality Reduction When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace. This is called dimensionality reduction Often used in data visualization. Nevin L. Zhang (HKUST) Machine Learning 15 / 24

Structure Discovery Sometimes we would like to discover a graph structure about how a set of variables are related. In the following example, we have a structure about how word occurrences are related in a collection of document. There are latent variables, which can be interpreted as topics. Nevin L. Zhang (HKUST) Machine Learning 16 / 24

Reinforcement Learning In reinforcement learning, an agent learns how to act or behave from occasional reward or punishment signals It is the way Dolphins in Ocean Park learn amazing tricks. Currently, the most famous reinforcement learning system is AlphaGo. Nevin L. Zhang (HKUST) Machine Learning 17 / 24

Deep Learning Deep learning is a class of machine learning algorithms that: Use a cascade of multiple layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input. Learn in supervised (e.g., classification) and/or unsupervised (e.g., pattern analysis) manners. Learn multiple levels of representations that correspond to different levels of abstraction; the levels form a hierarchy of concepts. Nevin L. Zhang (HKUST) Machine Learning 18 / 24

Deep Learning Deep learning has a unique advantage, i.e., automatic feature extraction. It means that this algorithm automatically grasps the relevant features required for the solution of the problem. It reduces the burden on the programmer to select the features explicitly. Nevin L. Zhang (HKUST) Machine Learning 19 / 24

A Brief History of AI and Machine learning Nevin L. Zhang (HKUST) Machine Learning 20 / 24

A Brief History of Machine learning Nevin L. Zhang (HKUST) Machine Learning 21 / 24

We will cover... Supervised Learning Linear and Polynomial Regression Logistic and Softmax Regression Generative Models for Classification Learning Theory Deep Learning Deep Feedforward Networks Convolutional Neural Networks Recurrent Neural Networks Unsupervised Learning Variational Autoencoders Generative Adversarial Networks Mixture Models Reinforcement Learning Basic RL Value-Based Deep RL, Policy-Based Deep RL Nevin L. Zhang (HKUST) Machine Learning 22 / 24

The No Free Lunch Theorem The No Free Lunch theorem states that there is no one algorithm that works best for every problem. The assumptions of a great algorithm for one problem may not hold for another problem. It is common in machine learning to try multiple algorithms and find one that works best for a particular problem. Nevin L. Zhang (HKUST) Machine Learning 23 / 24

Questions about an ML Algorithm 1 What does it do? (User) 2 How does it work? (Programmer) 3 Why does it work the way it does? (Algorithm Designer) Pros and cons w.r.t alternatives 4 Why can it achieve its goal? (Theoretician) We will focus mostly on the first three questions. Nevin L. Zhang (HKUST) Machine Learning 24 / 24