Data Structures. Notes for Lecture 13 Techniques of Data Mining By. Classification: Basic Concepts. 1. Classification: Definition

Data Structures Notes for Lecture 13 Techniques of Data Mining By Ass.Prof.Dr.Samaher Al_Janabi 2017-2018 1. Classification: Definition Classification: Basic Concepts Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 2. Illustrating Classification Task 3. Classification Techniques Decision Tree based Methods Rule-based Methods Memory based reasoning 1

Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines 3.1. Decision Tree Decision trees are one of the fundamental techniques used in data mining. They are tree-like structures used for classification, clustering, feature selection, and prediction. Decision trees are easily interpretable and intuitive for humans. They are well suited for high-dimensional applications. Decision trees are fast and usually produce high-quality solutions. Decision tree objectives are consistent with the goals of data mining and knowledge discovery. This lecture reviews the concept of decision trees in data mining. A decision tree consists of a root and internal nodes. The root and the internal nodes are labeled with questions in order to find a solution to the problem under consideration. The root node is the first state of a DT. This node is assigned to all of the examples from the training data. If all examples belong to the same group, no further decisions need to be made to split the data set. If the examples in this node belong to two or more groups, a test is made at the node that results in a split. A DT is binary if each node is split into two parts, and it is nonbinary (multi-branch) if each node is split into three or more parts A decision tree model consists of two parts: creating the tree and applying the tree to the database. To achieve this, decision trees use several different algorithms. The most widely-used algorithms by computer scientists are ID3, C4-5, and C5.0. Example:- 2

Another Example of Decision Tree We can constructing a decision tree from a set T of training cases as follows: Let the classes be denoted by {C\, C2,, Cn}. There are three possibilities: (i) T contains one or more cases, but all belonging to a single class Cj. The decision tree for T is a leaf identifying class Cj. (ii) T contains no cases. The decision tree is also a leaf in this case, but the class to be associated with the leaf must be determined from sources other than T. (iii) T contains cases that belong to a mixture of classes. T is partitioned into subsets T1,T2,, Tk, where Ti contains all cases in T that have outcome Oi of the chosen test. The decision tree for T consists of a decision node identifying the test, and one branch for each possible outcome. This process is applied recursively to each subset of the training cases, so that the ith branch leads to the decision tree constructed from the subset Ti of the training cases. Generally, a decision tree algorithm is most appropriate for the third case. In this case, the decision tree algorithm can be stated as follows: From the training data set, identify a target variable and a set of input variables. Examine each input variable one at a time: Create two or more groupings of the values of the input variables, and measure how similar items are within each group and how different items are between groups. Select the grouping that maximizes similarity within groupings and differences between groupings. 3

Once the groupings have been calculated for each input variable, select the single input variable that maximizes similarity within groupings and differences between groupings. This process is repeated in each group that contains a convincing percentage of information in the original data. The process is not terminated until all divisible groups have been divided. 3.1.1. ID3 Algorithm Below is the decision tree algorithm for ID3 that describes the general layout for DT algorithms. This algorithm uses gain ratio as the evaluating test. The gain criterion selects a test to maximize the mutual information between the test and the class. The process of determining the gain for a test is as follows : Imagine selecting one case at random from a set S of cases and announcing that it belongs to some class Cj. Let probability freq(cj, S) denote the frequency of class Cj cases in set S so that this message has the The information the message conveys is defined by The expected information from such a message pertaining to class membership is the sum over the classes in proportion to their frequencies in 5; that is, When applied to the set of training cases, Info(T) measures the average amount of information needed to identify the class of a case in set T. This quantity is also known as the entropy of the set T. Now consider a similar measurement after T has been partitioned (denoted by Tj) in accordance with the n outcomes of a test X. The expected information requirement is the weighted sum over the n subsets: The quantity measures the information that is gained by partitioning T in accordance to the test X. Even though the gain criterion yields good results, it has a serious deficiency it is biased towards tests with many outcomes. The bias in the gain criterion can be corrected by normalizing the apparent gain of tests. By analogy, the definition of split info is given by 4

This represents the "potential information generated by dividing T into n subsets, whereas the information gain measures the information relevant to classification that arises from the same division." Then, expresses the useful portion of the generated information by the split (that appears useful for classification). The gain ratio selects a test to maximize the ratio above, subject to the constraint that the information gain must be large at least as large as the average gain over all tests examined. ID3 Decision Tree Algorithm Given Examples (S); Target attribute (C); Attributes (R) Initialize Root Function ID3 (S,C,R) Create a Root node for the tree IF S = empty, return a single node with value Failure; IF S = C, return a single node with C; IF R = empty, return a single node with most frequent target attribute (C); ELSE BEGIN Let D be the attribute with largest Gain Ratio (D, S) among attributes in R; Let {dj\j = 1, 2,..., n} be the values of attribute D; Let {Sj\j = 1, 2,..., n} be the subsets of 5 consisting respectively of records with value dj for attribute D; Return a tree with root labeled D arcs d, d-i,..., dn going respectively to the trees; For each branch in the tree IF S = empty, add a new branch with most frequent C; ELSE ID3{S!,C,R-{D}), ID3{S2,C,R-{D}),..., ID3 (Sn, C,R-{B}) END ID3 Return Root 5

3.1.2. Decision Tree Classification Task A. Apply Model to Test Data (A) 6

(B) (C) (D) 7

(E) (F) 8