This presentation is intended to be a brief overview of what educational data mining is (and what it isn t), how it can be used, and what it can tell

This presentation is intended to be a brief overview of what educational data mining is (and what it isn t), how it can be used, and what it can tell you. A hypothetical example will be used to illustrate these points, and the presentation will end with a real-life example of how I ve been using educational data mining to examine data from an educational video game. 1

Educational data mining is nothing more than the process of automatically identifying patterns in educational data sets too large to analyze by hand. It is generally used for one of three purposes. The most common purpose is to identify things that co-occur. It finds that people that do A do B most of the time, or when X occurs Y also often occurs. Association rule mining is used to find these cooccurrences. Association rule mining is often called market basket research, where it is found that people who buy milk also buy eggs most of the time (in the same basket at the grocery store). There is no order information, because no one cares which items in the basket get scanned first. Sequence mining imposes a constraint on association rule mining so that B has to occur within a certain time window after A. This sort of analysis is what companies like Amazon and Netflix use to recommend other books or movies that you might like. Cluster analysis is used to identify naturally occurring groups. Given a whole bunch of people it will tell you that there is a group of people doing one set of actions and another group of people doing a different set of actions. This type of analysis isn t as popular in marketing but is becoming very popular in medical and biomedical research where it can be used in gene expression or to find groups of people that react differently to the same medication. Classification rule discovery often goes hand-in-hand with cluster analysis and is used to identify the defining characteristics of previously identified groups. It will tell you that Group A is older and smokes more than Group B and Group B exercises more than Group A. 2

Since educational data mining does all these great things, why don t we use it more? The answer is that educational data generally isn t large enough for educational data mining to be appropriate. Educational data mining requires a dataset that is both long (e.g., lots of participants) and wide (e.g., lots of variables). While this sort of data is rarely generated by subject matter tests or classroom observations, educational technological environments such as interactive tutors, online educational environments, and educational video games and simulations often produce such data because they can record every action taken by the students rather than just the answers they give to questions 3

This data is more fine-grained and more detailed than standard educational data, because it includes the specific steps taken in the course of solving each problem. It is also very context dependent because a given step can be appropriate in one problem and inappropriate in another. Additional detail is added to this data by including time and order information, so it is clear which actions occurred before which other actions as well as how long it took a student to move from one action to another. Most importantly for the purposes of analysis, this data is not prescreened for relevance. Every mouse click can be captured, not just the ones that indicate understanding of the targeted concept (or lack thereof). 4

Think of it as if it was qualitative data, except that it isn t data you collected yourself. Someone else made the observations. They didn t know what your research questions would be, and they didn t have any research questions of their own, so they wrote down EVERYTHING. Bob stood up. Bob looked left. Bob scratched his ear. Bob yawned. Everything Bob did, in exquisite detail. 5

For example, given the problem of crossing the street safely, standard educational data would record either a successful street crossing or an unsuccessful street crossing. 6

An educational video game about crossing the street would record: Person stops at crosswalk, person looks left, person waits, person looks left again, person looks right, person enters crosswalk, person exits crosswalk. It records everything the person did, rather than recording a judgment about that person s performance on the task. 7

This data also includes important timing and context information, so it records that they stopped at a crosswalk with no traffic light, two seconds later looked left and saw a car coming, etc. 8

If we aren t using data mining techniques, what do we do with this data? We do what quantitative people usually do with other people s qualitative data: we count things. In this example we might count how often people wait at crosswalks with no traffic light or how often people look right at a crosswalk. We might even count how often people exit crosswalks with no incident (in the simulation). These counts could then be used to predict success or failure for people crossing the street in real life or to predict their performance on a standardized test on the topic of street crossing. The first two counts could also be used to predict success or failure in the simulation (e.g., the third count). 9

Association rule mining finds actions or events that frequently co-occur, provided that the event pair accounts for more than a given percent of the data (say 10%). The association pictured here shows that 84% of people who look left and see a car wait at a crosswalk with no traffic light. This pattern of looking and waiting is observed in 12% of all street crossers. 10

Sequence mining is similar to association rule mining, except that it contains a time element. Sequence mining finds actions or events that frequently co-occur, provided that the second event occurs within a given window of time after the first event (say 60 seconds). The sequence pictured here shows that 74% of people who look left and don t see a car exit the crosswalk with no incident within 30 seconds. This pattern of looking and crossing safely is observed in 48% of all street crossers. 11

Cluster analysis looks for groups of associations and separates the actions or events into groups of actions or events that frequently co-occur. The groups pictured here consist of a group (on the left) that look left and don t see a car, look right and don t see a car, and exit the crosswalk with no incident, and a group (on the right) that enter the crosswalk against the light and have an incident in the crosswalk. The distance between circles indicates their co-occurrence, with circles that are close to each other frequently co-occurring and circles that are far from each other rarely if ever co-occurring. This would indicate that, for example, looking left and not seeing a car rarely co-occurs with having an incident in the crosswalk. Note that, similar to factor analysis, cluster analysis does not come up with names for the different groups. Those names must be determined by the researcher or a content expert. For example, the left group might be named looks both ways before crossing the street and the right group might be named enters crosswalk against the light. 12

Classification rule discovery starts with previously known groups which actions or events are indicative of membership in one group rather than the other. This is essentially a series of significance tests, similar to what you might analyze in a MANOVA. Does event 1 occur significant more often in one group than the other(s)? Does event 2 occur significantly more often in one group than the other(s)? Etc. In this example, Group 1 looks left and doesn t see a car and looks right and doesn t see a car more often than Group 2. Group 2 enters against the light more often than Group 1. 13

While at first glance none of the data mining techniques seem to be much more advanced than counting, they provide a number of practical advantages. This is mostly because data mining operates over datasets that include hundreds or thousands (or millions) of unique actions. If you counted up each of these unique actions and then tested each one to see if it effected your outcome of interest, you would be fishing and your results would be suspect because running hundreds or thousands of significance tests at a five percent error rate (sig <=.05) would result in false positives for one in every twenty actions. If you limit those hundreds or thousands of actions to less than twenty, how do you know which ones to count? How do you know you haven t left out an important mediator? Finally, how do you identify slight variations in the actions? Is waiting at a crosswalk with no traffic light different than waiting at a crosswalk with a traffic light? Is waiting at a crosswalk with no traffic light during the day different than waiting at a crosswalk with no traffic light at night? 14

Most importantly, you can t count what you don t know to look for. If you are only counting actions or patterns of actions that you have decided a priori are important, then you won t find this type of street crosser. If you don t know that some of your street crossers are ducks, it can bias your results. Ducks (or geese) are fairly poor street crossers. They don t look both ways, they take a really long time to get to the other side, and they don t usually use crosswalks. But they usually reach the other side without incident. Most data mining results will be things you already knew about (e.g., identifying a group of street crossers that look both ways before they cross the street). In fact, if data mining doesn t identify the things you already know about, then either you were wrong and there is no such thing as a street crosser who looks both ways or something is wrong with the data mining algorithm. The gold is in the unexpected results. What are the ducks in your data? 15

While educational data mining is a pretty fantastic tool for data discovery, it does not do the things most statistical techniques can do. It does not do hypothesis testing. There is no way to look for a specific sequence or group and determine whether or not it occurs more often than chance. There is also no way to determine whether or not the sequences or groups that were identified will exist in your next set of data. Data mining techniques have no methods of determining the reliability of your results. Nor can they be used directly for prediction. For example, data mining techniques themselves do not provide any way to determine the likelihood of a specific group performing better than other groups either in the simulation or in real life. All of which means that data mining is unlikely to answer the question you are really interested in. However, data mining provides you with the variables that will allow you to answer that question. The real power of data mining comes when the results are used as variables in standard statistical techniques such as a regression, hierarchical linear model, IRT, or Bayesian network. 16

So when should you use educational data mining? As is true with every other statistical technique, you should use it when you have the right kind of data and the right kind of question. You should use it when you have a lot of very detailed data and when you want to know what people are doing in a given situation or you are interested in the different ways people are going about solving a given problem. 17

I have been using data mining techniques to analyze log data from an educational video game called Save Patch that is intended to teach students about fractions. In this game students break whole unit ropes (on the left had side of the screenshot, under Path Options ) into fractional pieces using the up and down arrows next to each rope. Students then place the ropes on each sign in the grid to guide the puppet safely to the cage. The level in the screenshot represents two units (indicated by gray posts at intersections) broken into thirds (indicated by the red posts). To solve this level correctly students must place 3/3 (or 1/1) on the first sign, 1/3 on the second sign, 1/3 on the third sign, and hit the Go button on the bottom of the page. The most recent dataset for this game consists of 859 students, and each student played the game for approximately two hours. 18

The log data from Save Patch looks like this in its original form. Each row in the data represents a specific action. Each action includes detailed information about what the action was (e.g., the Data_01 to Data_03 columns) as well as valuable context information (e.g., the ID and Game Time columns). The 859 students in this study generated 1,208,133 rows of log data. 19

Each of these 1.2 million rows was translated into a mnemonic that combined all detailed information about the action into a single tag. For example, the first mnemonic indicates that a student scrolled a whole unit rope (1o1) to thirds (3o3). The second mnemonic indicates that the student selected a one-third rope. 20

The data was then transformed into a matrix, with a row for each attempt each student makes at each level in the game and a column for each unique mnemonic in the data. A 1 under a given mnemonic indicates that that action was performed in that attempt at that level by that student. A 0 indicates that the action was not performed in that attempt. The mnemonics in this example have been renamed M1, M2, etc. so that I could show a number of them on the same slide. They are not numbered in the actual data. This process took the 1.2 million rows of log data and transformed them into a sparse binary matrix that is 55,039 rows long and 17,685 columns wide. 21

Cluster analysis takes this matrix and calculates each mnemonic s distance from each other mnemonic and plots them in n-dimensional space (where n is the number of mnemonics). Using the fanny algorithm in R on a given dataset (data) to find a given number of clusters (4) would take the image on the left and identify four groups of points (the image on the right). The fanny algorithm can not tell you how many clusters are in your data. You have to run the algorithm starting with two clusters and then incrementing one cluster at a time until either the algorithm will not produce the desired number of clusters or it starts breaking up mnemonics into artificially small pieces (e.g., breaking the group of three mnemonics on the left side of the screen into three separate clusters). In the Save Patch data, each cluster of mnemonics corresponded to a specific strategy that was being used to solve the problem. 22

One of the most common incorrect mathematical strategies that was identified by the cluster analysis was a Unitizing Error. Students who make Unitizing Errors assume that the whole representation is one unit across, regardless of the number of units represented in the image. This strategy leads students to place 3/6 on the first sign (rather than 3/3), and 1/6 on the next two signs (rather than 1/3). This cluster was identified as a Unitizing Error because the only logical reason why a student would attempt to solve the level in sixths would be if they thought the entire representation was one unit across. Had there been multiple explanations for a student solving the level in sixths, we would not have been able to name this cluster. 23

Another common incorrect mathematical strategies that was identified by the cluster analysis was a Partitioning Error. Students who make Partitioning Errors don t know that the denominator of a fraction is the number of pieces it is cut into. Rather, they think that the denominator of a fraction is the number of cuts (or marks) that break up the whole unit. These students count the number of dividing marks (red posts) between unit marks to determine the denominator rather than counting the number of spaces. This strategy leads students to place 3/2 on the first sign (rather than 3/3), and 1/2 on the next two signs (rather than 1/3). 24

Once the different strategies students were using to solve the fractions problems in the game were identified, the data was transformed into a much smaller dataset that still included all 55,039 rows but reduced the 17,685 columns into a single Strategy column. Time and order information was retained so that the data could be used in sequence mining. 25

Sequence mining in R using the cspade algorithm took this data (data) and found every sequence of strategies where the second strategy immediately followed the first (maxwin = 1) and at least ten percent of students performed that strategy (support = 0.01). This resulted in a list of frequent sequences and the percent of students who performed each sequence. Note that the percentages do not have to add up to 1 because some students performed more than two (or three, or four) strategies before they figured out how to solve a given level. For example, in the level represented in the slide, 23.27% of all students made a game error {G} (e.g., going the wrong direction from a sign) and then solved the level correctly{s} on their next attempt at the level. On the other hand, 4.11% of students made partitioning errors {P} four or more times in a row. 26

To make this data more easily read, the results were represented as graphs. These graphs show the frequent sequences ending in an {S} in level 4 (Level 4: Paths to Solutions) and the sequences that do not end in an {S} in level 4 (Level 4: Paths Between Errors). These graphs indicate, for example, that 44.54% of students made a partitioning error while trying to solve the level, but that 26.09% of students (just over half of those who make partitioning errors) solved the level immediately after making a partitioning error. We are currently working on operationalizing these sequences into variables that can be used to predict either in-game performance or performance on a paper-and - pencil fractions posttest. 27

Thank you for your interest. If you would like to know more about our work, we have a number of CRESST reports available online at http://www.cse.ucla.edu/products/reports.php Including: Using Cluster Analysis to Extend Usability Testing to Instructional Content http://www.cse.ucla.edu/products/reports/r816.pdf edu/products/reports/r816 pdf A Primer on Data Logging to Support Extraction of Meaningful Information from Educational Games: An Example from Save Patch http://www.cse.ucla.edu/products/reports/r814.pdf and The Feasibility of Using Cluster Analysis to Examine Log Data from Educational Video Games http://www.cse.ucla.edu/products/reports/r790.pdf We also have an article in the Journal of Educational Data Mining titled Identifying Key Features of Student Performance in Educational Video Games and Simulations through Cluster Analysis http://www.educationaldatamining.org/jedm/images/articles/vol4/issue1/kerretal ti t i i /JEDM/i / ti l / l4/i EtAl Vol4Issue1P111_152.pdf 28