INTRO:

Every high school student who intends to pursue higher level of education need to provide personal background information and academic achievements as a part of application documents for university's admission committee. As students, we can't deny the fact that students' background must influence the viewpoint of admission officers as they examine the applications. The admission committee gathers large volume of data from the past and current students. The advances in the data mining field makes it possible for universities to establish sets of criterion for future applicants and provides innovative ways of supporting both teachers and students.

RESULTS:

Classification:

The result from decision tree algorithm (J48) is very poor. The correctly classified instance for students' final grade is only 47% to 50%, which basically useless in terms of prediction grades. For the k-nearest neighbors algorithm (KNN) with k = 1,5,10, and 20, we see the prediction accuracy gradually increases from 42% to 49%. Such low prediction accuracy and poor performance of the algorithms are caused by overfitting issue and the complexity of the attributes. Based on our investigation, we find the instances classified as A, B, or C have very similar attributes values. Of all 30 attributes, some of attributes do not contribute much to the final grades. Despite the poor prediction performance, the correlation of attributes to the final grades are studied. The most important attributes that influence the academic performance of the students are listed as following, and they seem to be very reasonable. Number of past class failures

Intention for pursuing higher education
Number of school absences
Mother's education level
Weekday alcohol consumption
Father's education level
Weekends alcohol consumption
Weekly study time

Figure 1: Important attributes that influence student’s academic performance sorted by percentage

Regression:

Unlike the results from classification, our regression algorithms output more plausible prediction. The results from linear regression, random forest, and multi-layer perceptron classifiers are plotted in Figure 2. From our randomly select 888 training data and 156 test data, the best machine learning algorithm for predicting final grade is random forest classifier. The average error and standard deviation is 2.2 and 2.15 points. The second best algorithms is linear regression, which has 2.5 and 2.29 points for average error and standard deviation respectively. The fact that Random forest works much better than linear regression is caused by that the random forest algorithm works better when the feature and the output is conditional upon the values of other features. Also random tree is better at capturing non-monotonic relation between features and output. The multi-layer perceptron, on the other hand, has the worse prediction accuracy of all three as shown in Figure 1. The average error for this algorithm is nearly 6 points.

Figure 2: Prediction error comparison for three different regression algorithms.

FUTURE WORK:

There are several things that can help extend this project further. More data is expected to be obtained to increase the accuracy of this prediction such as grades of more courses, students’ life style, easiness of the exams etc. There are some features depending on each other. Figuring out how the features are interrelated will help us in eliminating unnecessary attributes.

DATA:

Our data is adopted from Using Data Mining to Predict Secondary School Student Alcohol Consumption. The paper predicted a student’s alcohol consumption rate based on their background. We think this data set, which involves 1044 students and 30 features, is sufficient for us to predict academic performance. The attributes and data types are listed below.

INTRO

our_vision

contact

METHODOLOGY:

Our goal is to predict a student’s academic performance by the grade assigned by the end of term. The original data sets had numerical final grade ranges from 0 to 20 points. Since our prediction is in numerical format, we initially planned two strategies to tackle this problem. One strategy proposed is to use classification method which involves preprocessing the data and dividing them into several categories. The second strategy is to use regression method which is very intuitive and natural.

To conduct classification, the methods we use are decision tree(J48) and K-nearest neighbors (KNN) classifiers since we are most familiar with this method. We divide final gades into five level of nominal attributes, A, B, C, D, and E. In addition, we also conduct regression analysis on the data to predict academic performance using linear regression, random forest, and multi-layer perceptron classifiers simply using the final grades in numerical form.

Our data include 1044 students from two different schools and their grades from math course and Portuguese language course. 85% of original data is used for training and 15% of original data is used for testing.