Machine Learning. Basic Concepts. Joakim Nivre. Machine Learning 1(24)

Similar documents
Lecture 1: Basic Concepts of Machine Learning

Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

A Version Space Approach to Learning Context-free Grammars

Python Machine Learning

Chapter 2 Rule Learning in a Nutshell

Lecture 10: Reinforcement Learning

(Sub)Gradient Descent

AQUA: An Ontology-Driven Question Answering System

Proof Theory for Syntacticians

CS Machine Learning

Artificial Neural Networks written examination

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

arxiv: v1 [math.at] 10 Jan 2016

Linking Task: Identifying authors and book titles in verbose queries

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Rule Learning With Negation: Issues Regarding Effectiveness

Software Maintenance

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

The Strong Minimalist Thesis and Bounded Optimality

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Statewide Framework Document for:

Using dialogue context to improve parsing performance in dialogue systems

Cross Language Information Retrieval

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

On-Line Data Analytics

A Case Study: News Classification Based on Term Frequency

Rule Learning with Negation: Issues Regarding Effectiveness

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

The stages of event extraction

How do adults reason about their opponent? Typologies of players in a turn-taking game

CS 598 Natural Language Processing

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Probability and Statistics Curriculum Pacing Guide

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Abstractions and the Brain

Speech Recognition at ICSI: Broadcast News and beyond

Major Milestones, Team Activities, and Individual Deliverables

Reinforcement Learning by Comparing Immediate Reward

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Georgetown University at TREC 2017 Dynamic Domain Track

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Laboratorio di Intelligenza Artificiale e Robotica

Compositional Semantics

Radius STEM Readiness TM

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Corrective Feedback and Persistent Learning for Information Extraction

Ensemble Technique Utilization for Indonesian Dependency Parser

Language properties and Grammar of Parallel and Series Parallel Languages

What is a Mental Model?

Using focal point learning to improve human machine tacit coordination

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

CSC200: Lecture 4. Allan Borodin

A General Class of Noncontext Free Grammars Generating Context Free Languages

School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne

Probabilistic Latent Semantic Analysis

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Extending Place Value with Whole Numbers to 1,000,000

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Ontological spine, localization and multilingual access

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Rendezvous with Comet Halley Next Generation of Science Standards

Using computational modeling in language acquisition research

A Case-Based Approach To Imitation Learning in Robotic Agents

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Word learning as Bayesian inference

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Language Evolution, Metasyntactically. First International Workshop on Bidirectional Transformations (BX 2012)

Person Centered Positive Behavior Support Plan (PC PBS) Report Scoring Criteria & Checklist (Rev ) P. 1 of 8

The Smart/Empire TIPSTER IR System

MYCIN. The MYCIN Task

An Empirical and Computational Test of Linguistic Relativity

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

arxiv: v1 [cs.cl] 2 Apr 2017

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Word Segmentation of Off-line Handwritten Documents

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Learning to Rank with Selection Bias in Personal Search

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Softprop: Softmax Neural Network Backpropagation Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

An Introduction to Simio for Beginners

Evidence for Reliability, Validity and Learning Effectiveness

Laboratorio di Intelligenza Artificiale e Robotica

Transcription:

Machine Learning Basic Concepts Joakim Nivre Uppsala University and Växjö University, Sweden E-mail: nivre@msi.vxu.se Machine Learning 1(24) Machine Learning Idea: Synthesize computer programs by learning from representative examples of input (and output) data. Rationale: 1. For many problems, there is no known method for computing the desired output from a set of inputs. 2. For other problems, computation according to the known correct method may be too expensive. Machine Learning 2(24)

Well-Posed Learning Problems A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Examples: 1. Learning to classify chemical compounds 2. Learning to drive an autonomous vehicle 3. Learning to play bridge 4. Learning to parse natural language sentences Machine Learning 3(24) Designing a Learning System In designing a learning system, we have to deal with (at least) the following issues: 1. Training experience 2. Target function 3. Learned function 4. Learning algorithm Example: Consider the task T of parsing Swedish sentences, using the performance measure P of labeled precision and recall in a given test corpus (gold standard). Machine Learning 4(24)

Training Experience Issues concerning the training experience: 1. Direct or indirect evidence (supervised or unsupervised). 2. Controlled or uncontrolled sequence of training examples. 3. Representativity of training data in relation to test data. Training data for a syntactic parser: 1. Treebank versus raw text corpus. 2. Constructed test suite versus random sample. 3. Training and test data from the same/similar/different sources with the same/similar/different annotations. Machine Learning 5(24) Target Function and Learned Function The problem of improving performance can often be reduced to the problem of learning some particular target function. A shift-reduce parser can be trained by learning a transition function f : C C, where C is the set of possible parser configurations. In many cases we can only hope to acquire some approximation to the ideal target function. The transition function f can be approximated by a function ˆf : Σ Action from stack (top) symbols to parse actions. Machine Learning 6(24)

Learning Algorithm In order to learn the (approximated) target function we require: 1. A set of training examples (input arguments) 2. A rule for estimating the value corresponding to each training example (if this is not directly available) 3. An algorithm for choosing the function that best fits the training data Given a treebank on which we can simulate the shift-reduce parser, we may decide to choose the function that maps each stack symbol σ to the action that occurs most frequently when σ is on top of the stack. Machine Learning 7(24) Supervised Learning Let X and Y be the set of possible inputs and outputs, respectively. 1. Target function: Function f from X to Y. 2. Training data: Finite sequence D of pairs x, f (x) (x X ). 3. Hypothesis space: Subset H of functions from X to Y. 4. Learning algorithm: Function A mapping a training set D to a hypothesis h H. If Y is a subset of the real numbers, we have a regression problem; otherwise we have a classification problem. Machine Learning 8(24)

Varitations of Machine Learning Unsupervised learning: Learning without output values (data exploration, e.g. clustering). Query learning: Learning where the learner can query the environment about the output associated with a particular input. Reinforcement learning: Learning where the learner has a range of actions which it can take to attempt to move towards states where it can expect high rewards. Batch vs. online learning: All training examples at once or one at a time (with estimate and update after each example). Machine Learning 9(24) Learning and Generalization Any hypothesis that correctly classifies all the training examples is said to be consistent. However: 1. The training data may be noisy so that there is no consistent hypothesis at all. 2. The real target function may be outside the hypothesis space and has to be approximated. 3. A rote learner, which simply outputs y for every x such that x, y D is consistent but fails to classify any x not in D. A better criterion of success is generalization, the ability to correctly classify instances not represented in the training data. Machine Learning 10(24)

Concept Learning Concept learning: Inferring a boolean-valued function from training examples of its input and output. Terminology and notation: 1. The set of items over which the concept is defined is called the set of instances and denoted by X. 2. The concept or function to be learned is called the target concept and denoted by c : X {0, 1}. 3. Training examples consist of an instance x X along with its target concept value c(x). (An instance x is positive if c(x) = 1 and negative if c(x) = 0.) Machine Learning 11(24) Hypothesis Spaces and Inductive Learning Given a set of training examples of the target concept c, the problem faced by the learner is to hypothesize, or estimate, c. The set of all possible hypotheses that the learner may consider is denoted H. The goal of the learner is to find a hypothesis h H such that h(x) = c(x) for all x X. The inductive learning hypothesis: Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples. Machine Learning 12(24)

Hypothesis Representation The hypothesis space is usually determined by the human designer s choice of hypothesis representation. We assume: 1. An instance is represented as a tuple of attributes a 1 = v 1,..., a n = v n. 2. A hypothesis is represented as a conjunction of constraints on instance attributes. 3. Possible constraints are a i = v (specifying a single value),? (any value is acceptable), and (no value is acceptable). Machine Learning 13(24) A Simple Concept Learning Task Target concept: Proper name. Instances: Words (in text). Instance attributes: 1. Capitalized: Yes, No. 2. Sentence-initial: Yes, No. 3. Contains hyphen: Yes, No. Training examples: Yes, No, No, 1, No, No, No, 0,... Machine Learning 14(24)

Concept Learning as Search Concept learning can be viewed as the task of searching through a large, sometimes infinite, space of hypotheses implicitly defined by the hypothesis representation. Hypotheses can be ordered from general to specific. Let h j and h k be boolean-valued functions defined over X : h j g h k if and only if ( x X )[(h k (x) = 1) (h j (x) = 1)] h j > g h k if and only if (h j g h k ) (h k g h j ) Machine Learning 15(24) Algorithm 1: Find-S The algorithm Find-S for finding a maximally specific hypothesis: 1. Initialize h to the most specific hypothesis in H ( x X : h(x) = 0). 2. For each positive training instance x: For each constraint a in h, if x satisfies a, do nothing; else replace a by the next more general constraint satisfied by x. 3. Output hypothesis h. Machine Learning 16(24)

Open Questions Has the learner converged to the only hypothesis in H consistent with the data (i.e. the correct target concept) or are there many other consistent hypotheses as well? Why prefer the most specific hypothesis (in the latter case)? Are the training examples consistent? (Inconsistent data can severely mislead Find-S, given the fact that it ignores negative examples.) What if there are several maximally specific consistent hypotheses? (This is a possibility for some hypothesis spaces but not for others.) Machine Learning 17(24) Algorithm 2: Candidate Elimination Initialize G and S to the set of maximally general and maximally specific hypotheses in H, respectively. For each training example d D: 1. If d is a positive example, then remove from G any hypothesis inconsistent with d and make minimal generalizations to all hypotheses in S inconsistent with d. 2. If d is a negative example, then remove from S any hypothesis inconsistent with d and make minimal specializations to all hypotheses in G inconsistent with d. Output G and S. Machine Learning 18(24)

Example: Candidate Elimination Initialization: G = {?,?,? } S = {,, } Instance 1: Yes, No, No, 1 : G = {?,?,? } S = { Yes, No, No } Instance 2: No, No, No, 0 G = { Yes,?,?,?, Yes,?,?, Yes,? } S = { Yes, No, No } Machine Learning 19(24) Remarks on Candidate-Elimination 1 The sets G and S summarize the information from previously encountered negative and positive examples, respectively. The algorithm will converge toward the hypothesis that correctly describes the target concept, provided there are no errors in the training examples, and there is some hypothesis in H that correctly describes the target concept. The target concept is exactly learned when the S and G boundary sets converge to a single identical hypothesis. Machine Learning 20(24)

Remarks on Candidate-Elimination 2 If there are errors in the training examples, the algorithm will remove the correct target concept and S and G will converge to an empty target space. A similar result will be obtained if the target concept cannot be described in the hypothesis representation (e.g. if the target concept is a disjunction of feature attributes and the hypothesis space supports only conjunctive descriptions). Machine Learning 21(24) Inductive Bias The inductive bias of a concept learning algorithm L is any minimal set of assertions B such that for any target concept c and set of training examples D c ( x i X )[(B D c x i ) L(x i, D c )] where L(x i, D c ) is the classification assigned to x i by L after training on the data D c. We use the notation (D c x i ) L(x i, D c ) to say that L(x i, D c ) follows inductively from (D c x i ) (with implicit inductive bias). Machine Learning 22(24)

Inductive Bias: Examples Rote-Learning: New instances are classified only if they have occurred in the training data. No inductive bias and therefore no generalization to unseen instances. Find-S: New instances are classified using the most specific hypothesis consistent with the training examples. Inductive bias: The target concept c is contained in the given hypothesis space and all instances are negative unless proven positive. Candidate-Elimination: New instances are classified only if all members of the current set of hypotheses agree on the classification. Inductive bias: The target concept c is contained in the given hypothesis space H (e.g. it is non-disjunctive). Machine Learning 23(24) Inductive Inference A learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances. To eliminate the inductive bias of, say, Candidate-Elimination, we can extend the hypothesis space H to be the power set of X. But this entails that: S {x D c c(x) = 1} G {x D c c(x) = 0} Hence, Candidate-Elimination is reduced to rote learning. Machine Learning 24(24)