Student Performance Prediction and Risk Analysis by Using Data Mining Approach Bilal Mehboob, Rao Muzamal Liaqat, Nazar Abbas CEME, NUST Pakistan bilalmehboob.pk@gmail.com muzammilliaqat@gmail.com nazeerebbas@gmail.com ABSTRACT: Today we are surrounding with large data related to student performance (class participation, attendance, pre student history, quiz result, subject dependency, student CGPA till to final semester). In this paper we will evaluate the reason of student failure basis on the previous data, predict the risk of failure for next course so that students may be mentally prepare for offered course as well dependency level of the course. In engineering it is common practice if a student doesn t knows about the basic course he/she can t perform well in advance courses of same scopes. In this paper we will back trace the failure cause with the help of decision tree. This work wills also help out to estimate the risk in early phase, which can help the teachers to design an effective planning for the students who are at risk. We have used the six algorithms for prediction and risk analysis ID3 gives the best results as compared to other. In this paper we have used the data set of CEME, NUST. Our dataset consists of 450 records extracted from five degrees (DE_29, DE_30, DE_31, DE_32, and DE_33). Keywords: Data Mining, Risk Analysis, ID3, Performance Prediction Received: 10 January 2017, Revised 19 February 2017, Accepted 1 March 2017 2017 DLINE. All Rights Reserved 1. Introduction Education plays an important role for the development of a country, especially for underdeveloped countries like Pakistan. However it is important to find out the failure reason of students to improve educational growth as well as gap wholes in this domain. Traditionally teachers predict the student performance on the basis of their experience; they had well understood the student nature and temperament. In this paper we will use data mining approach to predict the student performance as well identification of risk failure. Data mining techniques and methods can be applied in various fields such as marketing, sales, trades, business, real states, web engineering etc. Use of data mining in education domain is rapidly increasing. Data mining is also related with knowledge discovery, in which we extract useful information or pattern from the data. We have different tools Journal of Intelligent Computing Volume 8 Number 2 June 2017 49
that are used in data mining such as rapid miner, weka, spsss etc. In data mining two important concepts unsupervised learning and supervised learning is used respectively. In unsupervised learning we use clustering mechanism to extract the useful information from the data. Famous algorithm for unsupervised learning are K-NN, K-Mean, DBSCAN, SVM (Support Vector Machine) and Expectation Maximization Clustering (EMC). Our main focus in this paper is on supervised learning in we will use the decision tree, naïve Bayes, ID3 and random forest algorithms for student performance prediction and risk analysis. Data mining concept in education is known as EDM (Educational Data Mining), it was defined by International society of educational data mining [1], and this society also deals with different types of data that come from educational domain. In this paper data mining approach is used to predict the student early performance and risk level. We have used the logistic regression, naïve Bayes, ID3 and CART algorithm to predict the student performance and reason of their failure. In this paper we select the CART algorithm on the basis of accuracy, AUC, sensitivity and Error Rate. We have performed the comparative analysis of these algorithms to evaluate our work. Rest of the paper is divided into 5 sections. Section 2 describes the related work in this domain. Proposed methodology is described in Section 3. Experimental result and evaluation is described in Section 4. Comparative analysis of different algorithm is carried out in Section 5. Conclusion and future work is discussed in Section 6. 2. Related Work Data mining has emerged as an active research in educational domain because we can expose lot of things by using data mining approach. In this way we can analyze the students performance, faculty performance, difficulty level among different subjects as well as the failure reason among students. According to survey thousands of students are dropped out due to their poor performances in academics [15]. We can extract the strength and weakness of students by using data mining approach [2]. Data mining has been used to extract the hidden pattern and useful information from the data [3]. Remero and Ventura give us an exhaustive overview of using data mining approach by different researcher from 1995 to 2005 for educational data mining [4]. Rayan Baker has described the state of data mining in educational data [5]. Traditional statistical method have been used to calculate the student performance, these methods doesn t give satisfactory results in student performance prediction based on previous data [6]. Researcher has used the tree based structure to obtain the useful information from data, in tree like structure we have root nodes and leaf nodes information between these nodes is depicted in the form of layers [5]. Clustering algorithm has been used for performance prediction [7]. Dr. S. Hari and J James Manoharan has used the K-Mean clustering to predict the student performance, they have divided the data into different sets of clusters [7]. Dr. V.Shirividya and Keerthana have used both classification and clustering techniques to predict the student performance [8]. Proposed Methodology Figure 1. Proposed Methodology D A Carnegie and C Watterson has used the high school record to predict the students GPA in first year of engineering [9]. Dekker and M Pechenizkiy have used the data mining approach to find out the dropout information and student performance [10]. H. Bydzovska and Bayer give the concept to use the student data along with social data to predict the student performance more precisely [11]. Dr. Sangeeta and T Mishra has used the has used the Random Tree and J48 algorithms to predict the student 50 Journal of Intelligent Computing Volume 8 Number 2 June 2017
performance [12]. Lopez and C Romero introduce the concept of meta classifier for clustering and used the EM (Expectation Maximization) algorithm to predict the academic performance of students [13]. Pallamreddy et al. have used (DT) Decision Tree algorithm on dataset, this algorithms give the tree like model that helps to understand the decisions and consequences based on the nature of data [14]. 3. Data Selection Now we will explain the each step in the proposed methodology one by one. In this paper we have used the data set provided by College of Electrical and Mechanical Engineering CEME, NUST. Data set consist of 450 records extracted from degree DE_29, DE_30 DE_33.We have selected these degrees due to complete access of profile data as well as individual academic records. 4. Preprocessing & Cleaning Figure 2. Filtering Mechanism Figure 3. Data Formatting In this step we covert raw data into machine understandable data by applying some preprocess steps. We have to covert the data according to nature of algorithms e.g. to run ID3 and CHAID we convert the data into polynomial form. To run DT we have to assign the label attribute. In preprocessing we exclude the missing value to make the data compatible with algorithms. We have applied the filtering mechanism to remove the missing values in the data, filters also give us facility to extract the data Journal of Intelligent Computing Volume 8 Number 2 June 2017 51
of our interest by using the built in operators in filters. If a column reveals low information according to label attributes then we can exclude it from data. We have used the rapid miner studio 7 to perform preprocessing steps. Filtering mechanism and data formatting are represented by in figure2 and 3 respectively. 5. Feature Selection and Extraction It is an important step in which we select the most important attributes that have the direct effect on label attribute. There are many methods that are used for feature selection and extraction. We use the entropy, IG (information Gain), reducts and core to extract the useful features. In this paper we will use the correlation matrix to find out the most important attribute. In correlation matric each attribute is assigned a numeric value known as weight. Minimum value of assigned weight is 0 and maximum value is 1. Higher the value of weight reveals the importance of attribute or discriminability level of that attributes. On the basis of assigned weights we can apply the threshold value to reduce the attributes. Weight assigned to different attributes by using correlation attributes are shown by table 1. Figure 4. Correlation Matrix Attribute Weight Attribute Weight Category 1 Database Engineering 0.158 Mathematics_1 0.256 OOP.172 Mathematics_2 0.344 Algorithm and computing.329 Mathematics_3 0.318 Data Structure.298 Mathematics_4 0.228 PL&E.247 Mathematics_5 0.242 Computer Networks.156 Engineering Mechanics 0.215 Mobile networks.223 Pakistan Studies 0.467 Network Analysis.226 Digital system Design 0.121 Digital communication.064 Electronic Circuit 0.242 Computer Aided drawing.546 Control System 0.139 Design project.390 AI 0.125 Metric/O level.867 Engineering Economics 0.288 FSC/A level.886 Software Engineering 0.171 Games/Activities.878 6. Data Mining Table 1. Weights by Using Correlation 52 Journal of Intelligent Computing Volume 8 Number 2 June 2017
Now we will apply the DT (decision tree), Random forest, Random tree, ID3, CHAID to extract the useful information from the data. To get the better understanding from data we have mapped the CGPA of 8 th semester in Student performance prediction Label. We have calculated the performance of each algorithm on the basis of Accuracy, Sensitivity and Recall. Comparative analysis of each algorithm is shown in table 2. Algorithm Accuracy Average Precision Risk Precision Below Average Above Average Precession Precision Decision Tree 55.52 42.86 58.82 25.00 59.20 Random Tree 54.11 39.13 75.00 30.00 58.48 Random Forest 61.97 40.00 72.22 36.36 65.85 ID3 79.23 78.23 88.00 91.21 93.50 CHAID 49.50 0.00 0.00 0.00 49.54 Decision Stump 50.95 0.00 50.00 0.00 50.94 7. Results & Interpretation Table 2. Comparative Analysis of Algorithms Journal of Intelligent Computing Volume 8 Number 2 June 2017 53
54 Journal of Intelligent Computing Volume 8 Number 2 June 2017
Journal of Intelligent Computing Volume 8 Number 2 June 2017 55
8. Conclusion & Future Work In this paper we have used the six algorithms; results obtained by using each algorithm are mentioned in comparative analysis table. ID3 gives the best results as compare to other on the basis of accuracy and precision for performance prediction and risk analysis. A descriptive result of ID3 helps us to extract following assumptions from the data. Students who were at Risk in Mathematics_1 were also at Risk in Mathematic_2. Students who were at Average in Digital Image processing were also at Average in Digital Signal Processing. Students who were above average in DBMS, they performed well Design Projects. Students who have the status of Hafiz e Quran performed well as compared to others. Students who score low grades in DBMS, they also got low grades in Database Engineering. Mostly students were above average in Microprocessor Based Design, those were above average in computer Architecture. Students who were below average or at risk in Math_1, they got low grades in Math_5. Each result is extracted from the ID3 algorithm description. We have divided the student result into four categories on the basis of CGPA of 8 th semester. If the CGPA was >=3.0 we have categorize them as Above Average, CGPA (2.3 to 2.5) named as Below Average, CGPA (2.6 to 2.9) labeled as Average and students who have CGPA <=2.2 classified as Risk. In future we will design a tool that will predict the next semester subject s grades and Risk Analysis on the base of current result. References [1] Educational Data Mining Society, Available: http://www.educationaldatamining.org/ [2] Personal learning plan, The Glossary of Educational Reform, Great Schools Partnership, Portland, ME, USA. [Online]. Available: http://edglossary.org/personal-learning-plan/ [3] Fayyad, U., Piatetsky-Shapiro, G., Smyth, P. (1996). From Data Mining to Knowledge Discovery in Databases, AI Magazine, Fall 1996. [4] Romero, C., Ventura, S. (2007). Educational data mining: a survey from 1995 to 2005, Expert Systems with Applications, 33, p. 135 146. [5] Baker, R.S.J.D., Yacef, K. (2009). The State of Educational Data Mining in 2009: A Review and Future Visions, Journal of Educational Data Mining, 1, 1 (1). [6] Sharabiani, Ashkan., Karim, Fazle., Sharabiani, Anooshiravan., Atanasov, Mariya., Darabi, Houshang (2014). An Enhanced Bayesian Network Model for Prediction of Students Academic Performance in Engineering Programs, 2014 IEEE Global Engineering Education Conference, p. 832-837. [7] Jamesmanoharan, J., Ganesh, S.H., Felciah, M.L.P., Shafreenbanu, A.K. (2014). Discovering Students Academic Performance Based on GPA Using K-Means Clustering Algorithm Computing and Communication Technologies (WCCCT), 2014 World Congress on Feb. 27 2014-March 1 2014, p. 200 202. [8] Keerthana, G., Srividhya, Dr.V. (2014). Performance Enhancement of Classifiers using Integration of Clustering and Classification Techniques in International Journal of Computer Science Engineering (IJCSE), 03 May, p.200-203. [9] Carnegie, D. A., Watterson, C., Andreae, P., Browne, W. N. Prediction of success in engineering study, In: 2012 IEEE Global Engineering Education Conference (EDUCON), 2012, p. 1 9. [10] Dekker, G. W., Pechenizkiy, M., Vleeshouwers, J. M. (2009). Predicting Students Drop Out: A Case Study, In: Proceedings of the 2nd International Conference on Educational Data Mining, Cordoba, Spain, vol. 9, p. 41 50. [11] Bayer, J., Bydzovská, H., Géryk, J., Obs1vac, T., Popel1nskó, L. (2012). Predicting drop-out from social behaviour of students, In: Proceedings of the 5th International Conference on Educational Data Mining-EDM 2012, Chania, Greece, p. 103 109. [12] Piatetsky-Shapiro, Gregory (1991). Discovery, analysis, and presentation of strong rules, in Piatetsky-Shapiro, Gregory; and Frawley, William J.; eds., Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA. 56 Journal of Intelligent Computing Volume 8 Number 2 June 2017
[13] Lopez, M.I., Romero, C., Ventura, S., Luna, J.M. (2012). Classification via clustering for predicting final marks starting from the student participation in Forums, In: Proc. EDM, 2012, p.148-151. [14]venkatasubbareddy, Pallamreddy., Sreenivasarao, Vuda (2010). The Result Oriented Process for Students Based On Distributed Data Mining, International Journal of Advanced Computer Science and Applications, 1 (5) November 2010, p. 22-25. [15] Hammond, L. D., Zielezinski, M. B., Goldman, S. (2014). Using technology to support at-risk students learning, In: Alliance for Excellent Education, Stanford Center for Opportunity Policy in Education, [Online]. Available: https://edpolicy. stanford.edu/ sites/default/files/scope-pub-using-technologyreport.pdf Journal of Intelligent Computing Volume 8 Number 2 June 2017 57