Student Life and Grade Correlation

CSC 177-05/04/17 Professor Mei Lu By David Judilla, Bryce Hairabedian, Justin Mendiguarin - Team 6 Student Life and Grade Correlation Objective Student life is not all one in the same. As students we all have different things going on in our life. Whether a student has a great family relationship, regularly attends class, or even has internet at home, these are all peripheral life circumstances. Our main objective for implementing the Student Life Grade Correlation data mart is to benefit educational entities. By being able to explore the data mart through our web interface, Faculty, Staff, and even students will be able to see if certain life circumstances like family support will generally affect a student's performance. 1

Goals Three Goals we would like to achieve with our data mart. 1. Faculty & Staff for gaining better understanding of factors that can affect student performance. 2. Educational entities can gain a general understanding of the student demographic correlating to a certain grade range. 3. Students or educational entities can find grade distribution and correlation to time a student spends outside the classroom. Student Life Grade Correlation can help in the following areas: A. Find the correlation between Grade received and circumstances such as; - Age - Absences - Failures in the Past - Family Relationship - Free Time - Going Out with Friends - Daily Alcohol Consumption - Weekly Alcohol Consumption - Health B. Depending on grade received, what is the most commonly reported statistic for; - Internet Access at home - Attending Nursery School - In romantic relationship - Extra educational support - Sex - Study Time - Travel Time - Age - Absences - Failures in the Past - Family Relationship - Free Time - Going Out with Friends - Daily Alcohol Consumption - Weekly Alcohol Consumption - Health 2

C. Find the grade distribution depending on a time circumstance value. Such as low amount of free time, how many students get a high score. The data mart will also provide grade distribution for other time factors such as; - Travel Time - Free Time - Time spent out with friends - Study Time Background Information The data comes from the UCI Machine Learning Repository website ( http://archive.ics.uci.edu/ml/datasets/student+performance.) Titled Student Performance Data Set. The data set consists of a survey of high school level students in Portugal. The survey was taken over the course of the 2008 school year from two different schools and two different courses, Mathematics and Language (Portuguese.) There were 395 recorded students for mathematics and 649 student records for Language. Combining to a total of 1044 student records from both schools and both subjects. The survey consists of multiple student life attributes or facts. Facts such as family size, study time, health. Grades were recorded three times over the course of the school year, two progress reports and one final grade. The Student Performance Data Set contains the following attributes; school - student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira) sex - student's sex (binary: "F" - female or "M" - male) age - student's age (numeric: from 15 to 22) address - student's home address type (binary: "U" - urban or "R" - rural) famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3) Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart) Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education) Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education) Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other") Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other") reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other") guardian - student's guardian (nominal: "mother", "father" or "other") traveltime - home to school travel time (numeric: 1 - <15 min., 2-15 to 30 min., 3-30 min. to 1 hour, or 4 - >1 hour) 3

studytime - weekly study time (numeric: 1 - <2 hours, 2-2 to 5 hours, 3-5 to 10 hours, or 4 - >10 hours) failures - number of past class failures (numeric: n if 1<=n<3, else 4) schoolsup - extra educational support (binary: yes or no) famsup - family educational support (binary: yes or no) paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) activities - extra-curricular activities (binary: yes or no) nursery - attended nursery school (binary: yes or no) higher - wants to take higher education (binary: yes or no) internet - Internet access at home (binary: yes or no) romantic - with a romantic relationship (binary: yes or no) famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) freetime - free time after school (numeric: from 1 - very low to 5 - very high) goout - going out with friends (numeric: from 1 - very low to 5 - very high) Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) health - current health status (numeric: from 1 - very bad to 5 - very good) absences - number of school absences (numeric: from 0 to 93) Grade recordings; 31 G1 - first period grade (numeric: from 0 to 20) 31 G2 - second period grade (numeric: from 0 to 20) 32 G3 - final grade (numeric: from 0 to 20, output target) 4

Design An overview of our design for the data warehouse & data mart consists of merging, cleaning, reducing, and transforming the data before final design of the star schema. 5

Data Preprocessing The data did not require extensive cleaning as the data was already parsed into two separate CSV files. The cleaning included merging both of the flat student record files with a full join in R-Studio. The full join was used because all the same columns and facts were recorded in both sets. The initial data sets did not have an ID field so a column was added and placed firsts as the Student ID. Data reduction focused on removing unnecessary facts from the student records. The two facts removed from the data set were reason and school. We felt the attribute reason (reason for choosing to attend this school) and the attribute school (school attended) was not valuable to our goals of the data mart or data mining. Transforming the data was done on two levels. The initial level included relabeling all the binary survey facts. Facts such as Internet at Home got mapped from yes s & no s to 0 s & 1 s. The other level of data transformation took place on the client data mart web application. In this setting we map the students Portuguese grading scale (0-20) to a grading scale most are familiar with (percentage grades.) We also map their grades results to display meaningful information to how well the student performed (from Excellent to Poor) making it relatable to all audiences. 6

Final Star Schema Design The final star schema resulted in 1 fact table and 4 dimensional tables shown below. Student Fact table contains the basic information relating to the student and all grades recorded. Home Life dimensional table includes any facts dealing with life at home; mother and father s education, parents marital status, students guardian, family size, address is rural or urban, internet at home, mother and father s occupation, daily and weekly alcohol consumption. Relationships dimensional table contains any facts relating to human relationships; family relationship, family support, romantic relationship. Educational dimensional table includes all facts relating to education in the student's life, excluding grades; failures in the past, absences, paid classes, tutoring, wants to attend college. Time dimensional table contains time related factors in student life; time to travel to school, study time, free time, and how much the student goes out. 7

Data Mart Web Application Data Mart Implementation Client Side: The datamart client is built with the Model View Controller web app. The website is a single page that serves the user from multiple API calls to server. AngularJS framework was used to call the backend server API for easy JSON resolution. HTML5, CSS, and Bootstrap were used to make the UI user and mobile friendly. Google Charts API was used for visualization on the web app, displaying rich and dynamic charts. Server Side: Currently called a RESTful service in industry the its sole purpose is to query our SQLite database and send back a JSON object depending on which endpoint you query. The server was written in python using flask (micro web framework) running locally. So when demoed the server was running on a local host ( http://localhost:5000/ <api endpoint>) when demoed. The client side website/web-app calls the server with three main api endpoints; /getgradefromcol? (@param column name, @param column value) : Returns the final grade distribution dependent upon user selected column name and value /gradestocol?(@param column) : returns a JSON object with array all the averages of a given column for all grades /gradeavgstats? (@param grade) : returns JSON object with array of most common value for each column for a selected grade If a valid argument is passed the server will query the SQLite database and return the results in form of a JSON object. The object is then parsed on the client side for displaying the chart or table depending upon user input. Data Mart Use The data mart web app is divided into three sections. The three sections were created with these goals in mind; 1. Faculty & Staff for gaining better understanding of factors that can affect student performance. 2. Educational entities can gain a general understanding of the student demographic correlating to a certain grade range. 3. Students or educational entities can find grade distribution and correlation to time a student spends outside the classroom. 1. The first section, Line Graph, is meant to show correlation between distributed students grade averages and a certain fact of relationship, time, education, or home life. 8

a. Select the the certain tuple you would like to see from the drop down menu then select Submit. b. Select Submit, a line graph will be shown with the particular fact fluctuation as the grade percentage increases. 2. The next section down on the data mart web app is the most common attribute table. This table will show the most commonly recorded value for all the attributes in the survey for a certain grade range. a. Choose the grade range you would like to see from the drop down menu. b. Select Submit to see all facts results of the most commonly recorded value. 3. The third section is aimed towards exploring time factors and how it may affect a student s performance by showing a final grade distribution dependent upon a time factor and a high-low range. a. First select a time factor attribute you would like to see the grade distribution for. b. Second select a range (4 = High to 1 = Low) of the time attribute you have selected. i. The value of the range is relative to the student. For example if a student rated Free Time as a 4 this would be the highest, so the student has a lot of free time. c. Select Submit to see the distribution of grades on the donut chart. Data Mining Introduction For the data mining portion of the project we aimed to answer one question, "Can we predict a student's grade", given the other columns in our dataset. The columns of the dataset contained information highly coupled to a student's performance in school (e.g. age, absences, final grade, etc.). The dataset contained 1044 rows of information. We decided on using an 80/20 threshold process, where 836 Rows (80%) were used for the training set and 208 rows (20%) were used for the test set. The primary tool used for data mining was R Studio and a machine learning algorithm called Random Forest. Machine Learning Algorithm - Random Forest To do predictions, the Random Forest machine learning algorithm was used. Random Forest was appealing to the problem at hand because it was able to handle both regression and classification problems where at the early stages of the project, it was unknown which type of question (regression or classification) the experiment would be solving. Random Forest is also very similar to some of the decision tree algorithms that were learned in class, but the main difference being that Random Forest is an ensemble learning 9

algorithm, where it creates a defined number of decision trees and uses the mean or mode of all the trees to create a prediction model. Advantages: Above average accuracy rate Resistant, but not immune, to overfitting Doesn't require crossover validation Each decision tree created gets a new bootstrap sample which removes the need for crossover validation. Disadvantages: Model creation takes a notable amount of time Creating a Random Forest of 2000 decision trees took 30 seconds, which could be an issue if wanted to use Random Forest in live production predictions Outputted result isn't easily interpreted, like a simple decision tree would. Variable Importance Another advantage of using Random Forest was that it can provided the importance of each column/variable in the dataset. %IncMSE is the amount of influence a variable has on the accuracy of the model. So from the graphs shown above, it can be inferred that when failures are used in the decision tree the accuracy of the model increases by about 60%. Variables that have very little influence on the grade predictions can also be seen, like after school activities and family relationship quality. 10

Grade Prediction - Regression For the regression problem of predicting a student's final grade ("G3", see above), a numerical value from 0-20. Two sets of variables were used for the predictions. One set with all columns, but G3, And another set containing all columns, but G1, G2, and G3. 11

Sample of Results Data Frame During this portion of the analysis, an assumption was made where a correct prediction was when the difference between the predicted and actual grade was less than or equal to 2. With those assumptions, an accuracy of 74.4% was found when predicting without a student's previous grades, and a 95.3% accuracy with the student's previous grades. Plotting the actual grades against the predicted grades shows that without previous grades the predictions have a larger amount of deviation as compared to when the student's previous grades are used, which is clearly linear and has less deviation 12

Grade Prediction - Classification The biggest glaring issue for the regression predictions, was that there was an assumption was made that a predicted grade within 2 points of its actual was a correct prediction. The only problem here being that 2 points is a whole letter grade in Portugal (2/20 =.1), which could be the difference between a passing and failing grade. The solution for this problem was to turn the regression problem into a classification problem, where a new column, `pass`, was created in the data frame, expressing whether or not a student passed the class (True or False). The Random Forest formula was then altered to predict for the new column, `pass`. 13

Without Previous Grades Conclusion As previously stated, through the experiment classification and regressional predictions were made using two sets of columns as independent variables (With and without the student's previous grades). 14

Regression without Regression with previous grades - 74.4% accuracy previous grades - 95.3% accuracy Classification without previous grades - 79.7% accuracy Classification with previous grades - 91.3% accuracy Both experiments resulted with above average accuracies. Predicting a student's grade using previous grades resulted in ~10-20% higher accuracies, compared to without previous grades. What can be taken away from this is that in future studies/predictions of student grades, one of the most important survey questions to ask would be for a student's previous grades in a similar subject. Improvements to this experiment could be made by having a student's previous grades from a different class/subject. The previous grades used in this experiment were essentially progress report grades from the course taken. It would be interesting to see how well grades from a history course can predict grade for a math course. Data Mining Notes: References from P. Cortez and A. Silva (See reference #4) were used to validate design choices for the data mining portion of this project. Cortez and Silva made similar decisions for their analysis in that they chose multiple sets of variables for their model (i.e. with and without previous grades), but chose to be more granular in their variable sets, in that they made 3 variable sets, where each had all other variables but then only G1, G2, and G3, respectively. They also changed their regression problem into a classification problem just as we did, but for a different reason, where we changed our problem because we made assumptions that we thought could lead to erroneous results. Learning Experience such as experiments and readings Working with R. A very different language than I have encountered before R is such a statistical driven language. Use of this tool in the future would be beneficial for some of my more mathematical application. Visualizing data. Use of different API s for charts and graphs. Creating efficient Star Schemas 15

References 1. Student Performance Data Set http://archive.ics.uci.edu/ml/datasets/student+performance# 2. Portuguese Grading Scale High School and College https://dre.pt/application/file/606224 3. R Package - Dplyr - Easy Merging https://www.r-project.org/nosvn/pandoc/dplyr.html 4. P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. 5. "Titanic: Getting Started With R - Part 5: Random Forests." Trevor Stephens. N.p., 18 Jan. 2014. Web. 15 May 2017. 6. Bagging vs Boosting vs Stacking in Machine Learning https://stats.stackexchange.com/questions/18891/bagging-boosting-and-stacking-in-ma chine-learning 7. Association Rule Mining http://athena.ecs.csus.edu/~associationcw/ 8. " Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations " by Ian H. Witten, Eibe Frank, and Mark A. Hall, 3rd edition, Morgan Kaufmann 2011. 9. "The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling" by Ralph Kimball and Magy Ross, Wiley; 2nd edition 2002 10. "Data Mining: Introductory and Advanced Topics" by Margaret Dunham, Prentice Hall 2003. 16

Appendix (optional) containing a set of supporting material such as examples, sample demo sessions, and any information that reflects your effort regarding the project. Sample Cleaning in R-Studio Below both flat files read into tables mathstud and portstud together they contain all records within the student performance data set. Dplyr library was used for easy joining. Each student record was then given an ID as the primary key and will be foreign key once dimensional tables are created. ID was put as first column for easy reading. Sample Transformation in R-Studio Below is some of the data transformation. Change labels from Yes, No to 0, 1. Sample data mart query for time factors Server side sample query shown below for the donut graph on the data mart web-app. Pulled from server.py line 60-64. 17

Below cur is the cursor that is executing the SQLite query. {0} is mapped to the first parameter column name and {1} is mapped to the second column value. cur.execute(""" SELECT G3 FROM Student, Time WHERE {0} = {1} & Student.id = Time.id """.format(colname, colvalue)) 18