Introduction to ML Abhijit Mishra Research Scholar Center for Indian Language Technology Department of Computer Science and Engineering Indian Institute of Technology Bombay Email: abhijitmishra@cse.iitb.ac.in URL: http://www.cse.iitb.ac.in/~abhijitmishra
Task: Get mangoes of a particular type from the market
Task 1: Solve an equation Task 2: Get mangoes of a particular type from the market Randomness?? Ambiguity?? Nuances??
Randomness Slight Variation in shape, size, color and odor etc. Ambiguity Similarity in size, color but belong to different categories Nuances?? Differences in size, color but belong to the same category How to make machines understand the
Introduction to ML Roadmap Definition of Machine Learning Learning to predict Classification Regression Learning Paradigms Rule based Statistical Example Based Statistical Machine Learning Supervised Semi-supervised Unsupervised Reinforcement Supervised approaches Probabilistic approaches Non-probabilistic approaches Example - Text Classification Books, Online Courses and Tools
Definition of Machine Learning Machine learning1 is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Explores the study and construction of algorithms that can learn and make predictions on data Applications: Pattern Recognition (e.g., Handwriting Recognition, Face detection, Gesture detection) Prediction of events (e.g., Stock market predictions, weather forecasting, prediction of diseases based on symptoms) Almost all popular online services (e.g., Google, Facebook, Amazon) use ML. https://en.wikipedia.org/wiki/machine_learning
Introduction to ML Roadmap Definition of Machine Learning Learning to predict Classification Regression Learning Paradigms Rule based Statistical Example Based Statistical Machine Learning Supervised Semi-supervised Unsupervised Reinforcement Supervised approaches Probabilistic approaches Non-probabilistic approaches Example - Text Classification
Learning to Predict Classification Classification is the problem of predicting to which of a set of categories (sub-populations) a new observation belongs. Input: Properties of the new observation Output: or the class of the new observation When, the problem is called binary classification problem (e.g., classifying emails into spam or non-spam categories) When the problem is called N-class/multi-class classification problem (e.g., classifying documents into multiple categories like sports, health, politics etc.).
Learning to Predict Regression When the out-put space of a predictor is a real number instead of (nominal categories as in classification), the prediction problem is referred to as statistical regression or simply regression. Input: Properties of the new observation Output: where Example: Predicting the temperature of a day given the climatic conditions of the previous day, estimating number of units of a new product to be sold in an year.
Note: Structured prediction Deals with more complex output (instead of scalar output as in cases of classification and regression) Output: where N Example: Automatic text translation (output is a sentence in another language), Parse tree generation (output is a tree structure), Image Captioning We will only focus on classification problems.
Introduction to ML Roadmap Definition of Machine Learning Learning to predict Classification Regression Learning Paradigms Rule based Statistical Example Based Statistical Machine Learning Supervised Semi-supervised Unsupervised Reinforcement Supervised approaches Probabilistic approaches Non-probabilistic approaches Example - Text Classification
Learning Objective Back to Mangoes Task: Given some basic measurable properties of a certain mango, predict which category it belongs to. Color Weight Smell Dimensions Taste?? Alphons o/alice/ir win (Classes) (Measurable properties/ Attributes/ Features)
Learning Objectives What to learn? Correspondences between various attributes of the input object and the classes How to learn? Rule based learning Statistical learning Example based learning
Learning Paradigms Rule Based Learning is based on a set of rules handcrafted by humans. If (weight<0.5 && color == yellow color== green ) { category = Alphonso ; } else if ( ) { category = Alice ; } The collection of rules or the rule-base has to be exhaustive enough to capture all the corner cases. Problems: Extremely hard, needs domain expertise and is highly time-consuming
Learning Paradigms Example Based A very small set examples having of complete information (both input and classes) are available. Templates for each classes are learned automatically. When a new observation arrives, class prediction is made based on the template that fits the observation best. Problems: Templates are generic representatives of classes that are supposed to represent the whole sub-population belonging to certain classes. For many problems, it is quite hard to come up with such representatives with small number of examples. Susceptible to change in the nature of the input data
Learning Paradigms Statistical Beneficial if a large set of diversified examples are available. Feature-Class correspondences are learned better. Easy to update classifier if the nature of the input data changes. Leverage huge volume of available webdata Problems: Overlearning can happen sometime (referred to as overfitting). Feature selection affects system
Introduction to ML Roadmap Definition of Machine Learning Learning to predict Classification Regression Learning Paradigms Rule based Statistical Example Based Statistical Machine Learning Supervised Semi-supervised Unsupervised Reinforcement Supervised approaches Probabilistic approaches Non-probabilistic approaches Example - Text Classification
Statistical Machine Learning- Supervised Approaches Learning is based on a set of observations for class labels are available. Alphonso Learned Model Alice Irwin Alphonso
Statistical Machine Learning- Semi-Supervised Approaches Learning is based on a set of observations for class labels are available AND another set (typically of larger volume than labelled set) of observations for which class labels are not available Alphonso Learned Model Alice Irwin Alphonso
Statistical Machine Learning- Un-Supervised Approaches Learning when no class labels are available.
Statistical Machine Learning- Reinforcement Learning Learning happens with the objective of maximizing the reward associated with the task. Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented. Association is captured in terms of rewards.
Introduction to ML Roadmap Definition of Machine Learning Books, Online Courses and Tools Learning to predict Classification Regression Learning Paradigms Rule based Statistical Example Based Statistical Machine Learning Supervised Semi-supervised Unsupervised Reinforcement Supervised approaches Probabilistic approaches Non-probabilistic approaches Example - Text Classification
Supervised Approaches Recap: Color Weight Smell Dimensions Taste?? Alphons o/alice/ir win (Classes) (Measurable properties/ Attributes/ Features)
Supervised Approaches Probabilistic Models Given a set of features the classification decision of probabilistic models can be expressed as where,
Supervised Approaches Naïve Bayes Posterior Prior Likelihood The prior can be assumed to be a multinomial distribution for classification problems
Supervised Approaches Naïve Bayes (1) Now if we assume that features are independent of each other. Note: The independent assumption may not hold true for many real life problems.
Supervised Approaches Logistic Regression Remember: In Logistic Regression is directly estimated Where u follows a regular weighted linear equation The coefficients ( and have to be learned during training).
Supervised Approaches: Non-Probabilistic Models Class-1 Class-1 Class-2 Class-2 (x1,x2)
Supervised Approaches: KNearest Neighbor K-closest neighbors are decided Class-1 Class-2 based on a pre-defined distance measure. The class to which maximum number of close neighbors belong to becomes the winner class
Distance/similarity measures Euclidian Distance (between vectors X1 and X2) Which is a special case of Minkowski Distance Cosine Distance
Supervised Approaches: Support Vector Machines Class-1 w. x b = 0 Class-2 f(x,w,b) = sign(w. x - b)
SVMs: Specifying the boundary e r P s= s la e C t n zo dic +1 s= s Cla e t ic zon d e Pr M = Margin Width = -1 2 w.w w. x b = 1 Plusw. x b = 0 PlaneClassifier Boundary Minus-Plane w. x b = -1 Given a guess of w and b we can Compute whether all data points in the correct half-planes Compute the width of the margin So now we just need to write a program to search the space of w s and b s to find the widest margin that matches all the datapoints. This is primarily done
Supervised Approaches Decision Tree l al a us c c i i o or or nu ss g g i a t e t te n cl o ca ca c Small MarSt A Big, Medium Color Green Yellow A TaxInc < 80 A B There could be more than one tree that fits the same data!
Supervised Approaches Note It is important to decide a set of features that adequately explains the data. Selecting extremely small number of features may underspecify the data and may not help the classifier to learn properly As the number of features increases, the modelcomplexity increases (i.e., more number of parameters to be learned and chances of overfitting increases). Very high dimensional feature vectors make it unintuitive to analyze them, design distance functions and performing combinatorics and optimizations. This is known as Curse of Dimensionality
Introduction to ML Roadmap Definition of Machine Learning Books, Online Courses and Tools Learning to predict Classification Regression Learning Paradigms Rule based Statistical Example Based Statistical Machine Learning Supervised Semi-supervised Unsupervised Reinforcement Supervised approaches Generative approaches Discriminative approaches Example - Text Classification
Example Text Classification Text classification is an important problem in the field of Natural Language Processing and Machine Learning. Objective: Assign labels to a given text with a class Example: 1: Obama won the election: Politics 2: Brasil lost the football match: Sports
Problems in Text Classification Lexical Problems: Presence of ambiguous words e.g., Cricket (game) vs Cricket (insect) Structural Problems: Complexity at the syntactic level e.g., Mohd. Kaif, who was the hero of the Natwest final match against England in 2002, has joined BJP and will be running for an MP position. (Politics) Semantic Problems: Complexity at the semantic level e.g., With the humiliating defeat in Bihar, INC s innings seems to be over. Pragmatic Problems: e.g., India lost to Zimbawe yesterday (Sports) Bernie lost to Clinton in Newyork. (Politics)
Text Classification Method Any Unseen Document Compute Features Features Some Documents Annotation Training Data Labels MODEL (Naïve Bayes, SVM, Decision Tree etc.) Prediction
Text Classification Feature Extraction Example: Training Sample: (Domain classification) 1: Obama won the election: Politics 2: Brasil lost the football match: Sports Features: Vocabulary: <Obama, won, the, election, Brasil, lost, football, match> Bag of Word Features based on presence/absence: 1: <1,1,1,1,0,0,0,0>:0 2: <0,0,1,0,1,1,1,1>:1
Text Classification Training and Testing Training: Weight of each feature towards a label is computed by training algorithm. Weight decides predictability. Test: Based on the features presented in the test data, the combined weightage is computed and a label is decided. Problem: When a feature is not seen in the training data (Data sparsity problem). Solution instead of taking Bag of Word based features, consider bag of senses, word
Text Classification Evaluation Metric Performance of classifiers are typically measured by Accuracy, Precision, Recall and FMeasure For a binary classification problem, if the class lables are positive and negative True Positive (TP): Number of test documents that are actually positive, are predicted positive True Negative (TN): Number of test documents that are actually negative, are predicted negative. False Positive (FP): Number of test documents that are actually negative, are predicted positive. False Negative (FN): Number of test documents that are actually positive, are predicted negative.
Text Classification Evaluation Metric (1)
Text Classification - DEMO Package: Scikit-learn (install numpy, scipy, matplotlib and scikit-learn packages) Demo: Naïve Bayes SVM KNN Decision Tree
Books and Online Courses Books Machine Learning by Tom Mitchell Pattern Recognition and Machine Learning by Christopher M. Bishop Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar Machine learning: a Probabilistic Perspective Kevin Murphy Bayesian Reasoning and Machine Learning - David Barber Probabilistic Graphical Models: Principles and Techniques by Daphne Koller, Nir Friedman Courses Machine Learning - Stanford University (Coursera) Andrew Ng Mining Massive Datasets Stanford Online
Tools Java Weka (for supervised/semi-supervised) (www.cs.waikato.ac.nz/ml/weka/) Mallet (for unsupervised) (www.mallet.cs.umass.edu) Python Scikit-Learn (http://scikit-learn.org/) Statsmodel (www.statsmodels.sourceforge.net) R statistical packages (https://cran.r-project.org/web/packages/)
Thank you
Questions?
References C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html Statistical Learning Theory by Vladimir Vapnik, WileyInterscience; 1998 Bishop, Christopher M. "Pattern recognition." Machine Learning 128 (2006).
Image URLS depositphotos.com vizagcityonline.com en.wikipedia.org/wiki/list_of_mango_culti vars tropicalfloridagardens.com alphonsomango.net alamy.com