Comparing Regression Algorithms for Predicting Students Marks in Hellenic Open University

Comparig Regressio Algorithms for Predictig Studets Marks i Helleic Ope Uiversity S.B. Kotsiatis, P. E. Pitelas Departmet of Mathematics Uiversity of Patras Educatioal Software Developmet Laboratory Hellas {sotos, pitelas}@math.upatras.gr ABSTRACT The ability to provide assistace for a studet at the appropriate level is ivaluable i the learig process. Not oly does it aid the studet s learig process but also prevets problems, such as studet frustratio ad flouderig. Studets key demographic characteristics ad their marks i a small umber of writte assigmets ca costitute the traiig set for a regressio method i order to predict the studet s performace. The scope of this work compares some of the state of the art regressio algorithms i the applicatio domai of predictig studets marks. A umber of experimets have bee coducted with six algorithms, which were traied usig datasets provided by the Helleic Ope Uiversity. Fially, a prototype versio of software support tool for tutors has bee costructed implemetig the M5rules algorithm, which proved to be the most appropriate amog the tested regressio algorithms. KEYWORDS: studet performace, machie learig, data miig. INTRODUCTION The applicatio of Machie Learig Techiques i predictig studets performace proved to be helpful for idetifyig poor performers ad it ca eable tutors to take remedial measures at a earlier stage, eve from the very begiig of a academic year usig oly studets demographic data, i order to provide additioal help to the groups at risk (Kotsiatis et al., 2004). The diagosis of studets performace is icreased as ew curriculum data is etered durig the academic year, offerig the tutors more effective results. Kotsiatis et al. (2004) showed that the most accurate machie learig algorithm for idetifyig predicted poor performers is the Naïve Bayes Classifier. However, that work could oly predict if a studet passes a course module or ot. This paper uses existig regressio techiques i order to predict the studets marks i a distace learig system. It compares some of the state of the art regressio algorithms to fid out which algorithm is more appropriate ot oly to predict studet s performace accurately but also to be used as a educatioal supportig tool for tutors. For the purpose of our study the iformatics course of the Helleic Ope Uiversity (HOU) provided the data set. Geerally, the usage of regressio aalysis to classify data ca be a extremely useful tool for researchers ad Ope Uiversity admiistrators. A plethora of data ca be utilized simultaeously to classify cases ad the resultat model ca be evaluated for usefuless relatively easily. The ability to develop a predictive model based o the model produced through the regressio aalysis procedure icreases its usefuless substatially. Ope Uiversities ca utilize this dyamic ad

578 4 ο Συνέδριο ΕΤΠΕ, 29/09 03/10/2004, Παν/µιο Αθηνών powerful procedure to target services ad itervetios to studets who eed it most, thereby utilizig their resources more effectively. The followig sectio describes i brief the Helleic Ope Uiversity (HOU) distace learig methodology ad the data of our study. Some very basic defiitios about regressio techiques are give i sectio 3. Sectio 4 presets the experimet results for all the tested algorithms ad at the same time compares these results. Sectio 5 presets the produced educatioal decisio support tool. Fially, sectio 6 discusses the coclusios ad some future research directios. HELLENIC OPEN UNIVERSITY AND DATA DESCRIPTION The missio of the Helleic Ope Uiversity (HOU) is to offer uiversity level educatio usig the distace learig methodology. The basic educatioal uit of the HOU is the course module (referred simply as module from ow o) that covers a specific subject i graduate ad postgraduate level. A module is equivalet to three semester academic lessos of Helleic Uiversities while a studet may register with up to three modules per year. The iformatics course of HOU is composed of 12 modules ad leads to a Bachelor Degree. For the purpose of our study the iformatics course provided the traiig set. A total of 354 istaces (studet s records) have bee collected from the module Itroductio to Iformatics (INF10) (Xeos et al., 2002). Regardig the INF10 module of HOU durig a academic year studets have to had i 4 writte assigmets, optioal participate i 4 face to face meetigs with their tutor ad sit for fial examiatios after a 11-moth-period. A studet with a mark >=5 passes a lesso or a module while a studet with a mark <5 fails to complete a lesso or a module. Geerally, a studet must submit at least three assigmets (out of 4). Subsequetly, the tutors evaluate these assigmets ad a mark greater or equal to 20 should be obtaied i total i order that each studet successfully completes the INF10 module. Studets who meet the above criteria may sit the fial examiatio test. The attributes (features) of our dataset are preseted i Table 1 alog with the values of every attribute. The set of the attributes was divided i 3 groups. The Registry Class, the Tutor Class ad the Classroom Class. The Registry Class represets attributes which were collected from the Studet s Registry of the HOU cocerig studets sex, age, marital status, umber of childre ad occupatio. I additio to the above attributes, the previous post high school educatio i the field of iformatics ad the associatio betwee studets jobs ad computer kowledge were also take ito accout. If a studet has atteded at least a semiar (of 100 hours or more) o Iformatics after high school the he/she would qualify as yes i computer literacy. Moreover, studets who use software packages (such as word processor) at their job without havig ay deep kowledge i iformatics were cosidered as juior-users, while studets who work as programmers or i data processig departmets were cosidered a seior users. The remaiig studets jobs were listed as o cocerig associatio with computers. Tutor Class represets attributes, which were collected from tutors records cocerig studets marks o the writte assigmets ad their presece or absece i face-to-face meetigs. Fially, the class attribute represets the result o the fial examiatio test. Accordig to the data collected i the framework of this research, the studets age follows a ormal distributio with a average value 31.1 years (±5.1). It must be oted that o studets uder the age of 24 years ca be accepted accordig to the regulatio of HOU, sice it is cosidered that such studets could easily atted covetioal Helleic Uiversities. The aalysis of the demographic attributes showed that the ratio of me who passed the exams vs. me who failed is 48 52%, while for wome this ratio drops to 39 61%. Moreover, it should be oted that the percetage of studets below 32 years old that pass the exams is measured 46%, whe the correspodig umber for older studets is 44%. Aother iterestig fact is related to studet performace ad their marital status. It is just as possible for a married studet to pass the exams (51%) while a sigle studet has oly 41% probability to pass the module. A similar

Οι ΤΠΕ στην Εκπαίδευση 579 situatio holds with the existece of childre, a studet with childre has 52% probability to pass the module while a studet without childre has oly 43%. This is probably due to the fact that the family obligatios is kow ad has bee take uder cosideratio prior to the commecemet of the studies. It must be also metioed that the workload separates the probabilities just i the middle. Sex male, female Age 24-46 Marital status sigle, married, divorced, widowed Number of childre oe, oe, two or more Occupatio o, part-time, fulltime Computer literacy o, yes Job associated with computers o, juior-user, seior-user 1 st face to face meetig Abset, preset 1 st writte assigmet o, 0-10 2 d face to face meetig abset, preset 2 d writte assigmet o, 0-10 3 rd face to face meetig abset, preset 3 rd writte assigmet o, 0-10 4 th face to face meetig abset, preset 4 th writte assigmet o, 0-10 Class Fial examiatio test 0-10 Studet s Registry (demographic) attributes Attributes from tutors records Table 1. The attributes used ad their values O the cotrary, as far as the demographic attributes are cocered, stroger correlatio exists betwee studet performace ad the existece of previous educatio i the field of Iformatics. The ratio of studets who have previous educatio i the field of Iformatics ad pass the exams vs. them who fail is 51 49%, while for the remaiig studets this ratio drops to 28 72%. A similar correlatio exists betwee the ivolvemets i professioal activities demadig the use of computer. The studets who use the computer i their job have 52% probability to pass the module while the remaiig studets have oly 32%. Util ow, we have described how each demographic attribute iflueces the predictio based o our dataset. I order to show i which directio (pass or fail) each of the remaiig attributes values push the iductio i Table 2 some practical probabilities are estimated. The iterpretatio of Table 2 is easy eough ad it shows, for example, that a studet with a mark more tha 6 i WRI-4, has about 4 times more probabilities to pass tha fail (0.65/0.17). Subsequetly, i a attempt to show how much each attribute iflueces the iductio, we rak the ifluece of each oe accordig to a statistical measure RRELIEF (Sikoja ad Kooeko, 1997). The key idea of the RRELIEF algorithm is to estimate the quality of attributes accordig to how well their values distiguish betwee the istaces that are ear to each other. I regressio problems the predicted value (class) is cotiuous, therefore the (earest) hits ad misses caot be used. Istead of requirig the exact kowledge of whether two istaces belog to the same class or ot, we ca itroduce a kid of probability that the predicted values of two istaces are differet. This probability ca be modeled with the relative distace betwee the predicted (class) values of the two istaces.

580 4 ο Συνέδριο ΕΤΠΕ, 29/09 03/10/2004, Παν/µιο Αθηνών Attribute Value Pass Fail WRI-4 Mark<3 0.04 0.68 3=<Mark=<6 0.31 0.15 Mark>6 0.65 0.17 Mark<3 0.03 0.61 WRI-3 3=<Mark=<6 0.21 0.2 Mark>6 0.66 0.19 Mark<3 0.08 0.52 WRI-2 3=<Mark=<6 0.15 0.26 Mark>6 0.77 0.22 FTOF-4 Abset 0.23 0.76 Preset 0.77 0.24 FTOF-3 Abset 0.2 0.65 Preset 0.8 0.35 Mark<3 0.02 0.19 WRI-1 3=<Mark=<6 0.14 0.35 Mark>6 0.84 0.46 FTOF-2 Abset 0.22 0.54 Preset 0.78 0.46 Table 2. The average RRELIEF score of each attribute accordig to our dataset are preseted i Table 4. The larger the value of the RRELIEF scores is, the more ifluece of the attribute i the iductio. Attribute RRELIEF W_ASS-4 0.11799 W_ASS-3 0.09263 W_ASS-2 0.03932 sex 0.01266 F_MEET1 0.0104 F_MEET4 0.00989 childre 0.00307 Job associated with computers 0.00102 domestic -0.00563 F_MEET3-0.00601 occupatio -0.00903 F_MEET2-0.00931 Computer Kowledge -0.01091 age -0.01416 W_ASS-1-0.03098 Table 3. The average RRELIEF score of each attribute Thus, the demographic attributes that mostly ifluece the iductio are the sex ad the childre. I additio, it was foud that 1st writte assigmet has ot a large value of ifluece. The reaso is that almost all studets try harder with the first writte assigmet thus makig the offered iformatio of this attribute miimal ad maybe cofusig.

Οι ΤΠΕ στην Εκπαίδευση 581 REGRESSION ISSUES The problem of regressio cosists i obtaiig a fuctioal model that relates the value of a target cotiuous variable y with the values of variables x 1, x 2,..., x (the predictors). This model is obtaied usig samples of the ukow regressio fuctio. These samples describe differet mappigs betwee the predictor ad the target variables. For the propose of our compariso the six most commo regressio techiques amely Model Trees (Wag & Witte, 1997), Neural Networks (Mitchell, 1997), Liear regressio (Fox, 1997), Locally weighted liear regressio (Atkeso et al., 1997) ad Support Vector Machies (Shevade et al., 2000) are used. I the followig we will briefly describe these regressio techiques. Liear regressio is the simplest statistical techique used to fid the best-fittig liear relatioship betwee the class ad its predictors (other features). y = β + β x + L + β x 0 1 i1 k ik Fid values of beta that miimize Q: i = 1 Q = ( y ( β + β x + β x +... + β x )) 0 1 1 2 2 i i i k ik Note that omial features with values are coverted ito -1 biary features ad a Wald test is used to test the statistical sigificace of each coefficiet (β ι ) i the model (Fox, 1997). A stadard liear regressio method may employ a attribute deletio strategy, which simplifies the predictio task. Model trees are the couterpart of decisio trees for regressio tasks. Model trees are trees that classify istaces by sortig them based o attribute values. Istaces are classified startig at the root ode ad sortig them based o their attribute values. The most well kow model tree iducer is the M5 (Wag & Witte, 1997). A model tree is geerated i two stages. The first builds a ordiary decisio tree, usig as splittig criterio the maximizatio of the itra-subset variatio of the target value (Witte & Frak, 2000). The secod prues this tree back by replacig subtrees with liear regressio fuctios wherever this seems appropriate. If this step is omitted ad the target is take to be the average target value of traiig examples that reach this leaf, the the tree is called a regressio tree istead. Although the models trees are smaller ad more accurate tha the regressio trees, the regressio trees are more comprehesible (Witte & Frak, 2000). M5rules algorithm produces propositioal regressio rules i IF-THEN rule format usig routies for geeratig a decisio list from M5 Model trees (Witte & Frak, 2000). The algorithm is able to deal with both cotiuous ad omial variables, ad obtais a piecewise liear model of the data. Artificial Neural Networks (ANNs) are aother method of iductive learig based o computatioal models of biological euros ad etworks of euros as foud i the cetral ervous system of humas (Mitchell, 1997). A multi layer eural etwork cosists of large umber of uits (euros) joied together i a patter of coectios. Uits i a et are usually segregated ito three classes: iput uits, which receive iformatio to be processed, output uits where the results of the processig are foud, ad uits i betwee called hidde uits. Regressio with a eural etwork takes place i two distict phases. First, the etwork is traied o a set of paired data to determie the iput-output mappig. The weights of the coectios betwee euros are the fixed ad the etwork is used to predict the umerical class values of a ew set of data. Locally weighted liear regressio (LWR) is a combiatio of istace-based learig ad liear regressio (Atkeso et al., 1997). Istead of performig a liear regressio o the full, uweighted dataset, it performs a weighted liear regressio, weightig the traiig istaces accordig to their distace to the test istace at had. This meas that a liear regressio has to be doe for 2

582 4 ο Συνέδριο ΕΤΠΕ, 29/09 03/10/2004, Παν/µιο Αθηνών each ew test istace, which makes the method computatioally quite expesive. However, it also makes it highly flexible, ad eables it to approximate o-liear target fuctios. The sequetial miimal optimizatio algorithm (SMO) has bee show to be a effective method for traiig support vector machies (SVMs) o classificatio tasks defied o sparse data sets (Platt, 1999). SMO differs from most SVM algorithms i that it does ot require a quadratic programmig solver. Shevade et al. (2000) geeralize SMO so that it ca hadle regressio problems. This implemetatio globally replaces all missig values ad trasforms omial attributes ito biary oes. For the regressio methods, there is t oly oe regressor s criterio. Table 4 represets the most well kow. Fortuately, it turs out for i most practical situatios the best regressio method is still the best o matter which error measure is used. Mea absolute error Root mea squared error Relative absolute error Root relative squared error p a + K + p a 1 1 ( p a ) + K + ( p a ) 2 2 1 1 1 1 1 1 1 p a + K + p a a a + K + a a ( p a ) + K + ( p a ) 2 2 1 1 1 1 ( a a) + K + ( a a) 2 2 1 1 Table 4. Regressor criteria (p : predicted values, a : actual values, 1 a = a ) i i EXPERIMENTS RESULTS The learig algorithms are useful as a tool for idetifyig predicted poor performers (Kotsiatis et al., 2003). With the help of machie learig the tutors will be i positio to kow from the begiig of the module, based oly o curriculum-based data of the studets whose of them will complete the module with eough accurate precisio, which reaches 64% i the iitial forecasts ad exceeds 80% before the middle of the period (Kotsiatis et al., 2004). After the middle of the period, we ca use existig regressio techiques i order to predict the studets marks. The experimets took place i two distict phases. Durig the first phase (traiig phase) the algorithms were traied usig the data collected from the academic year 2000-1. The traiig phase was divided i 5 cosecutive steps. The 1st step icluded the demographic data, the two first face-to-face meetigs ad writte assigmets as well as the resultig class (fial mark). The 2d step additioally icluded the third face-to-face meetig. The 3rd step additioally icluded the third writte assigmet. The 4th step additioally icluded the fourth face-to-face meetig ad fially the 5th step that icluded all attributes described i Table 1. Subsequetly, te groups of data for the ew academic year (2001-2) were collected from 10 tutors ad the correspodig data from the HOU registry. Each oe of these 10 groups was used to measure the accuracy withi these groups (testig phase). The testig phase also took place i 5 steps. Durig the 1st step, the demographic data as well as the two first face-to-face meetigs ad

Οι ΤΠΕ στην Εκπαίδευση 583 writte assigmets of the ew academic year were used to predict the class (fial studet mark) of each studet. This step was repeated 10 times (for every tutor s data). Durig the 2d step these demographic data alog with the data from the third face-to-face meetig were used i order to predict the class of each studet. This step was also repeated 10 times. Durig the 3rd step the data of the 2d step alog with the data from the third writte assigmet were used i order to predict the studet class. The remaiig steps use data of the ew academic year i the same way as described above. These steps are also repeated 10 times. It must be metioed that we used the free available source code by (Witte ad Frak, 2000) for our experimets. We have tried to miimize the effect of ay expert bias by ot attemptig to tue ay of the algorithms to the specific data set. Wherever possible, default values of learig parameters were used. This aïve approach results i lower estimates of the true mea absolute error, but it is a bias that affects all the learig algorithms equally. I Table 5, the most easily uderstadable measure - mea absolute error - of each algorithm for all the testig steps of the experimet is preseted. M5 BP LR LWR SMOreg M5rules WRI-2 1.83 2.15 1.89 1.84 1.84 1.83 FTOF-3 1.74 2.08 1.83 1.79 1.78 1.74 WRI-3 1.55 1.79 1.6 1.53 1.56 1.55 FTOF-4 1.54 1.8 1.56 1.5 1.55 1.54 WRI-4 1.23 1.65 1.5 1.4 1.44 1.21 Table 5. Mea absolute error Accordig to the results, the M5rules is the most accurate regressio algorithm to be used for the costructio of a software support tool. A advatage of M5rules except for its better performace is its comprehesibility. SOFTWARE SUPPORT TOOL A prototype versio of the software support tool has already bee costructed ad is i use by the tutors. The tool expects the traiig set as a spreadsheet i CSV (Comma-Separated Value) file format (Figure 1). The tool assumes that the first row of the CSV file is used for the ames of the attributes. There is ot ay restrictio i attributes' order. However, the class attribute must be i the last colum. It must be metioed that the used attributes are ot a coclusive list. A extesio ca itroduce ew attributes that were ot i the curret database, but are collectable by tutors ad may potetially cotribute to the predictio of academic achievemet. For example, measures of differet itellectual abilities, iterests, motivatio, ad persoality traits of studets. Oce the database is i a sigle relatio, each attribute is automatically examied to determie its data type (for example, whether it cotais umeric or symbolic iformatio). A feature must have the value? to idicate that o measuremet was recorded. After opeig the data set that characterizes the problem for which the user wats to take the predictio, the tool automatically uses the correspodig attributes for traiig. After the traiig of the model, the user is able to see the produced regressor. The tool (Figure 2) ca also predict the output of either a sigle istace or a etire set of istaces (batch of istaces). It must be metioed that for batch of istaces the user must import a Excel cvs file with all the istaces he/she wats to have predictios.

584 4 ο Συνέδριο ΕΤΠΕ, 29/09 03/10/2004, Παν/µιο Αθηνών Figure 1. The CSV file of the use case Figure 2. The prototype tool

Οι ΤΠΕ στην Εκπαίδευση 585 The rakig of the attributes ifluece brought cosiderable beefits; by helpig the tutors to better uderstad the characteristics of the populatio that mostly affect academic achievemet. For example, the prototype tool for the used dataset shows that the attributes that mostly ifluece the iductio are the WRI-4 ad the WRI-3 (Figure 3). Figure 3. Rakig the attributes ifluece to the fial predictio i our use case What is more, the implemeted tool ca preset useful iformatio about the imported data set such as the presece or ot of missig attribute values, the frequecy of each attribute value etc. Fially, the tool provides o-lie help for ovice users. CONCLUSION This paper aims to fill the gap betwee empirical predictio of studet performace ad the existig regressio techiques. Our data set is from the module INFO but most of the coclusios are wide-ragig ad preset iterest for the majority of programs of study of Helleic Ope Uiversity ad more geerally for all the distace educatio programs. It would be iterestig to compare our results with those from other ope ad distace learig programs offered by other ope Uiversities. So far, however, we have ot bee able to fid such results. Geerally, the educatio domai offers may iterestig ad challegig applicatios for data miig. Firstly, a educatioal istitutio ofte has may diverse ad varied sources of iformatio. There are the traditioal databases (e.g. studets iformatio, teachers iformatio, class ad schedule iformatio, alumi iformatio), olie iformatio (olie web pages ad course cotet pages) ad more recetly, multimedia databases. Secodly, there are may diverse iterest groups i the educatioal domai that give rise to may iterestig miig requiremets. For example, the admiistrators may wish to fid out iformatio such as admissio requiremets ad to predict the class erollmet size for timetablig. The studets may wish to kow how best to select courses based o predictio of how well they will perform i the courses selected. With so much iformatio ad so may diverse eeds, it is foreseeable that a itegrated data miig system that is able to cater for the special eeds of a educatio istitutio will be i great demad particularly i the 21st cetury.

586 4 ο Συνέδριο ΕΤΠΕ, 29/09 03/10/2004, Παν/µιο Αθηνών I a ext study we ited to apply data miig methods with the goals of aswerig the followig two research questios: 1) Do there exist groups of studets who use olie resources i a similar way? If so, ca we idetify the class a idividual studet belogs to? Based o the usage of the resource by other studets i the group, ca we help a ew studet use the resources better? 2) Ca we classify the learig difficulties of the studets? If so, ca we show how differet types of problems impact studets achievemet? Ca we help istructors to develop the homework more effectively ad efficietly? APPENDIX The tool is available i the web page: http://www.math.upatras.gr/~esdlab/regressio-tool/ The Java Virtual Machie (JVM) 1.2 or ewer is eeded for the executio of the program. REFERENCES Atkeso, C. G., Moore, A.W., & Schaal, S. (1997). Locally weighted learig. Artificial Itelligece Review, 11, 11 73. Fox, J. (1997), Applied Regressio Aalysis, Liear Models, ad Related Methods, ISBN: 080394540X, Sage Pubs. Kotsiatis, S., Pierrakeas, C., Pitelas, P.(2003), Prevetig studet dropout i distace learig systems usig machie learig techiques, Proceedigs of Seveth Iteratioal Coferece o Kowledge-Based Itelliget Iformatio & Egieerig Systems, 3-5 September 2003, Lecture otes i AI, Spriger-Verlag Vol 2774, pp 267-274.. Kotsiatis S., Pierrakeas C., Pitelas P. (2004), Predictig Studets Performace i Distace Learig Usig Machie Learig Techiques, Accepted for publicatio i Applied Artificial Itelligece (AAI). Mitchell, T. (1997), Machie Learig. McGraw Hill. Platt, J. (1999). Usig sparseess ad aalytic QP to speed traiig of support vector machies. I: Kears, M. S., Solla, S. A. & Coh D. A. (Eds.), Advaces i eural iformatio processig systems 11. MA: MIT Press. Shevade, S., Keerthi, S., Bhattacharyya C., ad Murthy, K. (2000). Improvemets to the SMO algorithm for SVM regressio. IEEE Trasactio o Neural Networks, 11(5):1188-1183. Sikoja M. ad Kooeko I. (1997), A adaptatio of Relief for attribute estimatio i regressio, Proceedigs of the Fourteeth Iteratioal Coferece (ICML'97), ed., Dough Fisher, pp. 296-304. Morga Kaufma Publishers. Wag, Y. & Witte, I. H. (1997). Iductio of model trees for predictig cotiuous classes, I Proc. of the Poster Papers of the Europea Coferece o ML, Prague (pp. 128 137). Prague: Uiversity of Ecoomics, Faculty of Iformatics ad Statistics. Witte, I.H., Frak, E. (2000), Data Miig: Practical Machie Learig Tools ad Techiques with Java Implemetatios, Morga Kaufma, Sa Mateo, CA, 2000. Xeos, M., Pierrakeas C. ad Pitelas P. (2002). A survey o studet dropout rates ad dropout causes cocerig the studets i the course of iformatics of the Helleic Ope Uiversity, Computers & Educatio (39): 361 377.