Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2011-2012 Lecture 11: 21 May 2012 Unsupervised Learning (cont ) Slides courtesy of Bing Liu: www.cs.uic.edu/~liub/webminingbook.html 1

Road map Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use? Cluster evaluation Summary Mixed attributes The distance functions we have seen are for data with all numeric attributes, or all nominal attributes, etc. In many practical cases data has different types of attributes, from the following 6: interval-scaled ratio-scaled symmetric binary asymmetric binary nominal ordinal Clustering a data set involving mixed attributes is a challenging problem 2

Convert to a single type One common way of dealing with mixed attributes is to: 1. Choose a dominant attribute type 2. Convert the other types to this type E.g., if most attributes in a data set are interval-scaled we convert ordinal attributes and ratio-scaled attributes to interval-scaled attributes it is also appropriate to treat symmetric binary attributes as interval-scaled attributes Convert to a single type (cont ) It does not make much sense to convert a nominal attribute or an asymmetric binary attribute to an interval-scaled attribute but it is frequently done in practice by assigning some numbers to them according to some hidden ordering, e.g., prices of the fruits Alternatively, a nominal attribute can be converted to a set of (symmetric) binary attributes, which are then treated as numeric attributes 3

Combining individual distances This approach computes individual attribute distances and then combine them A combination formula, proposed by Gower, is The distance dist(x i,x j ) is between 0 and 1 r is the number of attributes f! ij r f f = 1! 1 if x if and x jf are not missing # = " 0 if x if or x jf is missing # 0 if attribute f is asymmetric and x if and x jf are both 0 $ # f ij δ d = 1 dist( x i, x j ) = r (4) δ d ij f is the distance contributed by attribute f, in the range [0,1] f ij f ij Combining individual distances (cont ) If f is a binary or nominal attribute " f $ 1 if x d ij = if! x # jf %$ 0 otherwise distance (4) reduces to equation (3)-lect 10 if all attributes are nominal the simple matching distance (1)-lect 10 if all attributes are symmetric binary the Jaccard distance (2)-lect 10 if all attributes are asymmetric If f is interval-scaled f d ij = x if! x jf R f R f is the value range of f R f = max( f )! min( f ) If all the attributes are interval-scaled, distance (4) reduces to Manhattan distance Assuming that all attributes values are standardized Ordinal and ratio-scaled attributes are converted to interval-scaled attributes and handled in the same way 4

Road map Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use? Cluster evaluation Summary How to choose a clustering algorithm Clustering research has a long history A vast collection of algorithms are available We only introduced several main algorithms Choosing the best algorithm is challenging Every algorithm has limitations and works well with certain data distributions It is very hard, if not impossible, to know what distribution the application data follow The data may not fully follow any ideal structure or distribution required by the algorithms One also needs to decide how to standardize the data, to choose a suitable distance function and to select other parameter values 5

How to choose a clustering algorithm (cont ) Due to these complexities, the common practice is to 1. run several algorithms using different distance functions and parameter settings 2. carefully analyze and compare the results The interpretation of the results must be based on insight into the meaning of the original data knowledge of the algorithms used Clustering is highly application dependent and to certain extent subjective (personal preferences) Road map Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use? Cluster evaluation Summary 6

Cluster Evaluation: hard problem The quality of a clustering is very hard to evaluate because We do not know the correct clusters Some methods are used User inspection A panel of experts inspects the resulting clusters and scores them Study centroids as spreads Examine rules (e.g., from a decision tree) that describe the clusters For text documents, one can inspect by reading The final score is the average of the individual scoring Manual inspection is labor intensive and time consuming Cluster evaluation: ground truth We use some labeled data (for classification) Assumption: Each class is a cluster Let the classes in the data D be C=(c 1, c 2,,c k ) The clustering method produces k clusters, which divides D into k disjoint subsets, D 1, D 2,, D k After clustering, a confusion matrix is constructed From the matrix, we compute various measurements: entropy, purity, precision, recall and F-score 7

Evaluation measures: Entropy For each cluster, we can measure the entropy as entropy(d i ) =!" Pr i (c j )log 2 Pr i (c j ) Pr i (c j ): proportion of class c j in cluster D i The entropy of the whole clustering is entropy total (D) = k j=1 k! i=1 D i D entropy(d i) D i / D is the weight of cluster D i, proportional to its size Evaluation measures: purity Measures the extent a cluster contains only one class of data The purity of the whole clustering is purity(d i ) = max j ( Pr(c j )) purity total (D) = D i / D is the weight of cluster D i, proportional to its size k! i=1 D i D purity(d i ) Precision, recall, and F-measure can be computed as well Based on the class that is most frequent in the cluster 8

An example We can use the total entropy or purity to compare different clustering results from the same algorithm different algorithms Precision, recall and F-measure can be computed as well for each cluster The precision of Science in cluster 1 is 0.89, the recall is 0.83, the F-measure is thus 0.86 A remark about ground truth evaluation Commonly used to compare different clustering algorithms A real-life data set for clustering has no class labels Thus although an algorithm may perform very well on some labeled data sets, no guarantee that it will perform well on the actual application data at hand The fact that it performs well on some label data sets does give us some confidence of the quality of the algorithm This evaluation method is said to be based on external data or information 9

Evaluation based on internal information Intra-cluster cohesion (compactness): Cohesion measures how near the data points in a cluster are to the cluster centroid Sum of squared error (SSE) is a commonly used measure Inter-cluster separation (isolation): Separation means that different cluster centroids should be far away from one another In most applications, expert judgments are still the key Indirect evaluation In some applications, clustering is not the primary task, but used to help perform another task We can use the performance on the primary task to compare clustering methods For instance, in an application, the primary task is to provide recommendations on book purchasing to online shoppers If we can cluster shoppers according to their features, we might be able to provide better recommendations We can evaluate different clustering algorithms based on how well they help with the recommendation task Here, we assume that the recommendation can be reliably evaluated 10

Road map Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use? Cluster evaluation Summary Summary Clustering is has along history and still active There are a huge number of clustering algorithms More are still coming every year We only introduced several main algorithms. There are many others, e.g., density based algorithm, sub-space clustering, scale-up methods, neural networks based methods, fuzzy clustering, co-clustering, etc. Clustering is hard to evaluate, but very useful in practice This partially explains why there are still a large number of clustering algorithms being devised every year Clustering is highly application dependent and to some extent subjective 11

Reinforcement Learning These slides are an adaptation of slides drawn by Tom Mitchell and modified by Liviu Ciortuz Introduction Supervised learning is the simplest and most studied type of learning How can an agent learn behaviors when it doesn t have a teacher to tell it how to perform? The agent has a task to perform It takes some actions in the world At some later point, it gets feedback telling it how well it did on performing the task The agent performs the same task over and over again This problem is called reinforcement learning: The agent gets positive reward for tasks done well The agent gets negative reward for tasks done poorly 12

Introduction (cont ) The goal is to get the agent to act in the world so as to maximize its rewards The agent has to figure out what it did that made it get the reward/punishment This is known as the credit assignment problem Reinforcement learning can be used to train computers to do many tasks, such as: playing board games job shop scheduling controlling robot flight/taxy scheduling Overview Task: Control learning make an autonomous agent (robot) to perform actions, observe consequences and learn a control strategy The Q learning algorithm acquire optimal control strategies from delayed rewards, even when the agent has no prior knowledge of the effect of its actions on the environment Reinforcement Learning is related to dynamic programming, used to solve optimization problems While DP assumes that the agent/program knows the effect (and rewards) of all its actions, in RL the agent has to experiment in the real world 13

Reinforcement Learning Problem Target function to learn:! : S! A Goal: maximize r 0 +!r 1 +! 2 r 2 +... where 0!! <1 Example: play Backgammon (TD-Gammon [Tesauro, 1995]); immediate reward +100 if win, -100 if lose, 0 otherwise Control learning characteristics 14

Learning Sequential Control Strategies Using Markov Decision Processes Agent s Learning Task 15