A Rough Set Approach to Data Mining: Extracting a Logic of Default Rules from Data. Torulf Mollestad

A Rough Set Approach to Data Mining: Extracting a Logic of Default Rules from Data Torulf Mollestad Knowledge Systems Group Department of Computer Systems The Norwegian University of Science and Technology February 11, 1997

i excusez-moi je parle cheval (J. Prevert)

Abstract In this thesis, the problem of Data Mining is investigated, that is, constructing decision rules from a set of primitive input data. The main contention of the present work is that there is a need to be able to reason also in presence of inconsistencies, and that more general, possibly unsafe rules should be made available through the Data Mining process. Such rules are typically simpler in structure and allow the user to reason in absence of information. In this work, Rough Set theory is used as the underlying framework for Data Mining of rules that reect limited knowledge. A framework is suggested for the automatic extraction of propositional default rules that reect normal intra-dependencies in the data. The proposed algorithm introduces indeterminacy by removing conditional attributes in a controlled manner. The selection of attributes to be removed is made from the factors in the discernibility function, thereby removing information needed to discern classes in the original information system. By this procedure, a number of dierent default decision algorithms (sets of default rules) are obtained, each of which classies according to information over a subset of the condition attributes. Hence, when classifying new cases, the default decision algorithm which is best suited to handling the information at hand may be selected and applied. The approach oers the possibility to direct an information gathering process, through upward traversal in the lattice of information systems. At each point, the attribute(s) may be selected that are presumed to give the most information relative to the current situation. In this light, a link is drawn to prioritised default frameworks, arguing that an upward traversal in the lattice enables the use of increasingly more specic rules. If the more specic rules are in conict with conclusions drawn on the basis of less information then the latter conclusions are retracted. A number of properties of the framework are investigated, with special emphasis on methods for limiting an exponential search space. Also, a framework is dened for making an exhaustive search for functional dependencies in an information system; the connection with the default rule extraction algorithm is intuitive. Tests have been run on several dierent data sets. The results suggest that default rules are good for classifying new objects in situations of limited knowledge, and also that default rules give a good view of the relative importance of the dierent attributes. The knowledge is presented in an explicit way, in a manner which is easily understandable to a human being.

Contents I Preliminaries 1 1 Introduction 3 1.1 Living is Learning.............................. 3 1.2 Learning is Observing........................... 3 1.3 Observing is Discerning........................... 4 1.4 The Information Explosion......................... 5 1.5 Knowledge Discovery and Data Mining.................. 6 1.6 Overview of the Thesis........................... 12 2 Background 13 2.1 Knowledge Discovery and Data Mining.................. 13 2.1.1 Methods for Doing Knowledge Discovery............. 14 2.1.2 The Problem of Uncertainty.................... 18 2.2 Reasoning under Uncertainty....................... 19 2.2.1 Probability and Certainty Factors................. 19 2.2.2 Evidence Theory.......................... 20 2.2.3 Default Reasoning......................... 22 2.3 A Forward Pointer............................. 27 3 Overview of the Rough Set Theory 29 3.1 Introduction................................. 29 3.2 Information Systems............................ 29 3.3 Reasoning about Approximate Concepts................. 35 3.4 Taking the Decision into Consideration.................. 37 3.5 Extraction of Rules from Information Systems.............. 42 3.5.1 Determinism and Indeterminism.................. 42 3.5.2 Decision Rule Generation..................... 43 i

CONTENTS ii II Main Results 51 4 Generation of Default Rules 53 4.1 Discernibility Factor Sets.......................... 54 4.2 What's in a Default?............................ 56 4.3 Handling Inconsistent Examples...................... 58 4.4 Finding More General Patterns...................... 60 4.4.1 Projections of Information Systems................ 61 4.4.2 Gluing Object Classes through Making Projections....... 66 4.4.3 Gluing Generates Indeterminacy.................. 69 4.4.4 The Space of Decision Systems.................. 72 4.4.5 Blocking of Default Rules..................... 74 4.5 Generation of Default Rules and Blocks.................. 76 4.6 A Sketch of the Algorithm......................... 79 5 The Framework for Default Rules Generation 81 5.1 The Algorithm............................... 81 5.1.1 The Main Function { Compute Decision Algorithms....... 81 5.1.2 Computing Variants { Compute Variants............. 82 5.1.3 Rule Generation { Gen DRules DBlocks............. 84 5.2 The Original Example Continued..................... 86 5.3 Related Work................................ 92 5.4 Discussion.................................. 95 6 Modelling Specicity through Prioritisation 97 6.1 Conict Resolution through Voting.................... 98 6.2 Lattice Ascent and Prioritised Knowledge................ 98 7 Analysis of the Framework 105 7.1 The Complexity of the Algorithm..................... 105 7.2 The Properties of the Membership Function............... 108 7.3 The Change in Membership Degree.................... 110 7.4 A Local Test on Projection Quality.................... 111 7.4.1 Monotonically Decreasing Membership Degree.......... 114 7.4.2 Interesting Nodes are Reached................... 116

CONTENTS iii 7.5 Selecting Projections with the Least Loss................. 118 8 Heuristics and User Control 121 8.1 Handling the Complexity.......................... 121 8.1.1 A Representation of Decision Class Distribution......... 122 8.2 Minimising the Number of Objects Lost................. 124 8.3 Counting the Size of Classes........................ 131 8.4 Selecting Projections from the UDFS................... 132 8.5 The Threshold Function.......................... 133 9 Functional Dependencies 135 9.1 Reducts and Minimal Descriptions.................... 135 9.2 The Denition of Functional Dependency................. 137 9.3 A Framework for Finding Dependencies.................. 139 9.3.1 Functional Dependencies in A................... 140 9.3.2 Functional Dependencies in Variants of A............ 142 III Experiences and Conclusion 151 10 Experiments and Observations 153 10.1 Implementation............................... 153 10.1.1 The Output from the Algorithm.................. 154 10.2 The Case Studies.............................. 156 10.2.1 Australian Credit Card Approval................. 157 10.2.2 The Hepatitis Domain....................... 163 10.2.3 The Heart-Disease Domain..................... 167 10.2.4 The Invariance With the Size of the Training Set........ 171 10.2.5 Two Additional Examples..................... 171 10.3 Some Notes on the Experiments...................... 172 10.4 Further Improvements........................... 174 11 Summary and Future Research 177 11.1 Contributions and Experience....................... 177 11.2 Further Work................................ 180

CONTENTS iv IV Appendices 193 A Proofs of Prop.?? and?? 195 B Notation 197 C Boolean Simplication 199

List of Figures 1.1 The Information Pyramid......................... 6 1.2 The Projection of the Universe onto an Information System...... 10 2.1 A Decision Tree............................... 15 2.2 A Generalisation Hierarchy......................... 17 3.1 The B-Upper and B-Lower Approximation of X............. 35 4.1 Performing a projection on A = (U; A).................. 62 4.2 Deterministic and Indeterministic Decision Systems........... 63 4.3 Joining Classes as a Consequence of Projection.............. 70 4.4 Pushing Objects into the Border Region................. 70 5.1 The Search in the Lattice for Ex.??................... 88 6.1 A Lattice over Penguins.......................... 102 7.1 The Squeezing of the membership value.................. 112 7.2 Construction of Variants through other Paths.............. 114 7.3 Monotonic Descent............................. 116 8.1 Five Classes in a Three Dimensional Vector Space............ 123 8.2 Gluing Classes Together.......................... 124 8.3 Objects Lost at each Path......................... 129 8.4 Cutting Paths in the Lattice........................ 130 9.1 The regions sub (A) and sub + (A)..................... 144 9.2 The search in the Information System over A = fe; d; m; sg....... 148 10.1 The Australian Credit Domain....................... 162 10.2 The Hepatitis Domain........................... 168 v

LIST OF FIGURES vi 10.3 The Heart-Disease Domain......................... 170

List of Tables 2.1 A Basic Probability Assignment, and the Belief- and Plausibility Values 21 3.1 An Information System........................... 30 3.2 An Example Information System..................... 31 3.3 The Discernibility Matrix for the Example in Tab.??.......... 33 3.4 An Example Decision System....................... 39 3.5 The Discernibility Matrix Modulo Decision Attribute d......... 41 3.6 Gen Denite rules((u; (C; D)))....................... 47 4.1 Adding and Removing Objects from A.................. 64 4.2 Adding and Removing Condition Attributes from A........... 65 4.3 Adding and Removing Decision Attributes from A............ 65 4.4 Modifying a Decision System........................ 66 5.1 Compute Decision Algorithms(A)..................... 82 5.2 Compute Variants(A; C Cut ; tr )...................... 83 5.3 Gen DRules DBlocks(A; C Cut ; tr )..................... 85 5.4 Gen DRClass(A; E; X; C; D)........................ 86 5.5 The Example Information System (re-shown)............... 86 5.6 The Discernibility Matrix Modulo Decision d (re-shown)........ 87 5.7 The Discernibility Matrices over the Lattice............... 89 5.8 The Rules, Sorted by Membership Degree................ 90 5.9 The Lattice and Computed Rules..................... 91 6.1 The Penguins Example........................... 100 7.1 An information system with maximal breadth.............. 107 7.2 An information system with full traversal................. 107 9.1 Eighty objects from Ex.??......................... 138 vii

LIST OF TABLES viii 9.2 The Discernibility Matrix for the (Flattened) Example......... 140 9.3 Compute FD(A; A Cut )........................... 147 10.1 Attribute Information For Australian Credit............... 158 10.2 The Results of the Best Decision Algorithms on the Test Table..... 159 10.3 The Australian Credit Domain....................... 161 10.4 Attribute Information for Hepatitis.................... 163 10.5 Sample Objects from Hepatitis....................... 164 10.6 The same Objects, Unknowns Corrected, Table Scaled and Coded... 164 10.7 The Hepatitis Domain........................... 167 10.8 The Heart-Disease Domain......................... 170

Acknowledgements First of all, I thank my supervisor, Professor Jan Komorowski, NTNU, and Professor Andrzej Skowron, Warsaw University, who started me up on the topic of Data Mining. Professor Skowron hosted two visits to Poland for me, and the spark of the work was ignited during long and interesting discussions with him. Thanks goes also to Anna Gomolinska and Pjotr Synak for their constructive and encouraging comments. I also thank the current and previous members of the Knowledge Systems group, Tore Amble, Mihhail Matskin, Agata Wrzos-Kaminska, Jacek Wrzos- Kaminski, Staal Vinterbo, Aleksander hrn, Vidar Srhus, Danilo Montesi and Ahmed Guessoum. Our secretary Lisbeth Waagan also deserves credit for her kind help given over the course of these years. Also thanks to Leszek Polkowski, with whom I had some interesting conversations at the early stages of this work. Jon Petter Hjulstad implemented the algorithm in his project{ and diploma work, which was continued by students yvind Tuseth Aasheim and Helge Grenager Solheim. The work of these three has been of great value to the nal result. Thanks goes also to Vidar Larsen and Anders Christensen, who struggled with UNIX for me. Many thanks to my family, my parents Kari and Ulf, my sister Ingrid and her son Eirik for believing in me when I did not. Finally, I would like to thank all my friends for numerous cups of coee, long discussions and a lot of joy. A special thanks to the people of Kameleonteateret. Impro has given me great pleasure and inspiration, and I will continue. During the last part of my study I have been sponsored by a scholarship from the Institute of Computer Systems, Norwegian Institute of Science and Technology. My trips to Warsaw were hosted by the Department of Mathematics, University of Warsaw under support from the National Research Committee (KBN) 8T11C01011. This research was also supported in part by the ESPRIT BRA COMPUNET/NFR contract # 469.92/011 and by the Human Capital and Mobility NFR contract # 101341/410. Til minne om m ix

LIST OF TABLES x

LIST OF TABLES xi With a squeaking sound the old robot approaches the small, just born computer. - Isn't he lovely? says his mother. - Yes, says the robot. - By the way, have you heard this: \Is it true that we descend from the humans?" said the little VW. The robot laughs, a creaky, old laugh. The mother laughs too. The son is the only one silent. He doesn't get the point. But he hides the information in his brain, - just in case. T. A. Bringsvrd, from The City in the Bubble (1967)

LIST OF TABLES xii

Part I Preliminaries 1

2 No man's knowledge here can go beyond his experience (J. Locke)

Chapter 1 Introduction 1.1 Living is Learning In order to support life, every individual, human being or animal, is fundamentally dependent on its ability to adapt to the environment. It has to be able to collect information from the surroundings, and to react in a manner which is suitable to the situation. The individual also has to be able to handle changes that occur in that environment, possibly by altering its behaviour. Ultimately, the understanding of what constitutes rational behaviour in human beings has to be seen in light of their previous experiences, which themselves have been obtained through interaction with the environment. Previously experienced situations that are in either sense similar to the current situation play an important role as references when choosing our actions. Hence, through our interaction with the world, we learn by updating our reference knowledge to include new experiences. In this light, learning is a continuous process, calibrated by our actions in diverse situations. 1.2 Learning is Observing We may give the simplistic, yet intuitive, meaning to the word \learning" as the incitation of ideas. According to the philosopher John Locke, all ideas ultimately stem from experience, and take two forms; the ideas received through our senses, such as sight, hearing, etc., and the ideas reached from other ideas through reection, i.e. the functions of conscious thinking. As an illustration of his views, Locke put forward the following riddle: A man who is born blind can distinguish a ball from a cube by feeling the objects. If he suddenly gained sight, would he then be able to distinguish the two objects, without touching them? This little anecdote is interesting from the following perspective: In the philosophy of this thesis work, the man's knowledge is dened in terms of his ability to discern 3

CHAPTER 1. INTRODUCTION 4 the two objects, through use of his hands. He is, however, (obviously) not able to see the objects. Therefore, he cannot make any explicit experiences with respect to the connection between the feel of an object and the way it looks. If he suddenly gains sight, sees the objects, but is unable to touch them, will he be able to derive the connection for himself? The question is highly debatable, however illustrates the fundamental dependency which exists between knowledge on one hand, and the utilitarian view of ditto, i.e. the use of the knowledge. 1.3 Observing is Discerning Our ability to act intelligently in an environment relies fundamentally on our ability to discern. Only if we are able to distinguish dierent objects and events in the world are we able to react in a proper manner, to classify them { again according to their distinguishing features. In this light, the study of the relationship between certain properties of an object and its classication with respect to other properties is interesting. Here, the word \object" may mean anything that one can think of, i.e. physical objects, abstract concepts/ideas or situations. One main contention of this work is that observing is fundamentally dependent on the ability to distinguish objects, and correspondingly, the ability to see similarities between objects. Interesting information can be obtained only in presence of contrast. The study of objects is governed by perception. Normally, we do not distinguish between two small rocks lying in the road nor the birds that we see o on their way to Africa. Looking at them, we rest assured that they are birds, the fact that they may be of the same (or dierent) species, or that they are indeed individuals with a number of dierences between them, escapes us, it is not considered relevant. We are, however, able to discern them from all other objects that are not birds. During a phase of information collection, observations are made through making a number of \atomic" measurements that are used to characterise the objects of interest. The importance of such an atomic piece of information may be judged by its ability to discern a number of objects. In other words, we wish to capture the most general distinguishing properties of objects. In this light, there is a very natural preference to gathering the most important information rst, but the cost of collecting the information may also be taken into account. For instance, a medical doctor has a repertoire of very simple tests that may be performed easily, and that give important information. He does not, when introduced to a new and completely unknown patient, immediately set out performing X-ray tests, extensive cardiology or brain scans. Evidently, many such tests would be futile in that they would not provide any interesting information at all, simply not being relevant to the patient's actual medical condition. The patient might for instance have but a slight cold, a fact which would have been revealed by a couple of very simple and cheap tests. In all, the ability to distinguish between objects is a core aspect of learning and intelligent behaviour in general. Moreover, distinction is made at a level which is pragmatically suitable to the situation. Hence, the problem of nding the properties of objects that are potentially most helpful for discerning at the desired level is very important. If a person aunts this kind of accepted code, going intensely into the minute details

CHAPTER 1. INTRODUCTION 5 of things without apparent cause, we regard his behaviour as a symptom of obsession, if not insanity. In this thesis we investigate exactly the phenomenon of discernment, i.e. the selection of certain (small) pieces of information that will allow a conclusion to be drawn. These pieces of information will typically have the property that they are descriptory of the objects at the particular level that we are interested in. If we classify objects according to incomplete information, it should however be evident that the conclusions reached may not be correct in the logical sense (i.e. logically valid). However, if they are signicantly supported from the knowledge, they may be used, provided that no contradictory information is presented. Again, a medical doctor may work out an opinion about a patient only based on a minimal number of tests, and base his further analysis on this. His conclusion could however eventually turn out to be incorrect, more elaborate (and expensive) tests may reveal this. 1.4 The Information Explosion Mankind is an active information collector. A continuous search for information, and more importantly, the desire to analyse and understand the information stands out as one of the most fundamental objectives to human behaviour. In this light, understanding involves something more than the mere adaptation to the environment. Whereas animals learn in order to be able to cope with the world, human learning extends beyond this, our intelligence indulges in a search for ways to describe the processes that take place around us. Moreover, mankind is a very active information conveyor. The communication of information depends on the use of language, which allows us express the properties of objects and the relations that exist between them. It is imperative that the language have a well dened meaning, about which the communicating parties agree. The use of language allows human beings to pass information on to others in a manner which is unique among living creatures. Human experience is registered in a multitude of ways, and knowledge is potentially more available to the individual than ever before. Whereas the means for collecting information traditionally was restricted to the use of our natural senses, external devices cater for a steadily increasing fraction of the total amount of information collected in the world. Electronic equipment, operative in diverse elds such as geology, meteorology, astronomy or medicine is capable of gathering vast amounts of information, which may in turn be stored at a low cost. The information potentially available today has grown to incredible amounts, and consequently, a steadily increasing fraction of the data has remained unanalysed. The word data is used here to denote primitive, unanalysed pieces of information, whereas the term knowledge denotes the result of a further analysis of the data. Figure 1.1 illustrates the phenomenon. Primitive data is gathered in huge amounts, but the step leading to the extraction of interesting knowledge is not always easy to take. Hence, in practice, only a small fraction of the information collected has so far been analysed or seen by human beings, and far less used for any practical purpose. It has been estimated that the amount of information in the world doubles every 20 months. The size and number of databases probably increases even faster. [...]