Predicting School Ranks Through Data Mining Micah Oppenheim—Boston University CS 105—Professor David G. Sullivan, Ph.D Introduction Everyone wants to ensure that they are receiving the best education available to them. It is no secret that some public schools are better than others, however it is not always clear what makes one school better than another. This project attempts to answer the following question: What factors contribute to the ranking of a public school system? To complete this task I conduct data mining of a dataset originally containing information on 151 Greater Boston area high schools. The data set contains information on the schools themselves, economic factors of the towns that the schools are in, success of the students, and a numbered rank given by Boston Magazine. I create and analyze several regression functions that attempt to predict the rank of each school. Dataset Description The dataset used in this project was constructed by Boston Magazine and obtained from the Many Eyes website at http://www-958.ibm.com/software/data/cognos/manyeyes/ datasets/2005-public-school-rankings/versions/1. The dataset I used contains the names of 146 schools, their rank, and 21 other attributes: ATTRIBUTE school median_home_price TYPE nominal numeric residential_tax_rate enrollment numeric numeric cost_per_pupil sat_math numeric numeric sat_participation_rate numeric dropout_rate average_teacher_salary numeric numeric classrooms_with_internet numeric student_per_computer numeric mcas_math numeric student_teacher_ratio numeric DESCRIPTION Name of the school. Median price of homes in the town. Residential tax rate per $1000 assessed valuation in the town. Number of students in the school. Mean estimated cost of attending the school. Average SAT Math scores. Percentage of students that took SAT tests. Percentage of student body that did not graduate. Average salary of teachers. Percentage of classrooms with internet. Number of students for each computer. Average score of Massachusetts Comprehensive Assessment System Math exam. Number of students for each teacher. four_year_public numeric two_year_public numeric four_year_private numeric two_year_private numeric other numeric work numeric military sat_verbal numeric numeric mcas_english numeric rank numeric Percentage of student body that attend a four-year public school after graduation. Percentage of student body that attend a two-year public school after graduation. Percentage of student body that attend a four-year private school after graduation. Percentage of student body that attend a two-year private school after graduation. Percentage of student body that neither attend college, join the military, or work after graduation. Percentage of student body that work after graduation. Percentage of student body that join the military after graduation. Average SAT Verbal scores. Average score of Massachusetts Comprehensive Assessment System English exam. Rank of school as determined by Boston Magazine. Data Preparation There were several steps I had to take in order to prepare the dataset for data mining. I manually split the attribute SAT math/verbal into two separate attributes; SAT math and SAT verbal. I also split the attribute MCAS math/English into two separate attributes; MCAS math and MCAS English. This was done because originally in each column there were two scores separated by a “/”. This would disrupt the data mining process, as the data mining program Weka would not be able to properly judge the value of the test scores. I then removed 5 schools that were missing data for certain attributes. This was done so I could create more accurate models with the data knowing that the removal of 5 schools would not be detrimental to the overall process. Next I focused on formatting my data table so it could be uploaded into the program Weka. To do this I manually removed all commas, dashes, and percent signs in the data. I also removed the name of the schools, as the rank itself acts as an ID number for each school. Once the data was successfully loaded into Weka, I used to Weka to randomize the data then split it into training data that would be used to create the models, and test data that would be used to test the models. This data was split 115/31 or roughly 80%/20%. Since I have also created a python program that uses a SQL query, I created another version of the data that could be uploaded into SQL. To do this I removed the headers of each attribute and added the name of the schools back into the data. Data Analysis I preformed numeric estimation through several forms of regression analysis to predict the rank of schools. I used Weka to create regression functions based on 4 different models: SimpleLinearRegression: A linear regression function using only one attribute to predict the output. LinearRegression (no attribute selection): A linear regression function using all possible attributes. LinearRegression (attributes are pruned using m5 method): A linear regression function in which attributes are pruned by Weka. MultilayerPerceptron: A model that uses backpropagation to produce an output. The model creates many layers of regression functions that are connected and calculated together to produce the output. The entire models created by each form of regression can be found in Appendix A. Results Results of the four algorithms used: Model SimpleLinearRegression LinearRegression (no pruning) LinearRegression (m5 pruning) MultilayerPerceptron Correlation Correlation Error Error Error on Coefficient of Coefficient of on training on test training Error on training data test data data data data test data 0.8811 0.9156 44.69% 37.88% 44.69% 37.88% 0.9783 0.9726 19.30% 22.30% 19.30% 22.30% 0.9769 0.9976 0.9677 0.9897 19.99% 6.94% 24.49% 13.55% 19.99% 6.94% 24.49% 13.55% SimpleLinearRegression: This regression function chose mcas_math as the sole attribute to be used. Although this was the least accurate of all four models, it produced a respectable correlation coefficient of 0.8811. This also suggests that, if looking at all the attributes independently, the scores of the MCAS math exams have the most direct correlation with the perceived quality of a school. It is interesting that this model had a noticeably higher accuracy on the test data, which suggests that this correlation exists throughout all the schools with relative consistency. A visualization of this correlation helps display the connection between the two attributes: Average MCAS Math Score by Rank school rank x-axis = rank y-axis = MCAS math score The image shows the relatively linear relation between a school’s rank and the average MCAS math scores that the student body receives. This concept is not surprising as it implies that students that attend better schools receive better test scores. LinearRegression (no pruning): The linear regression function created using all the attributes is significantly more accurate than simple linear regression with an overall accuracy of 77.70% on the test data. This shows that a school’s rank is dependent on many attributes, and cannot be accurately determined by one, or even a few attributes. However, examining the coefficients assigned to the attributes does reveal which attributes of a school seem to have the least effect on rank; median home price (coefficient of 0), enrollment (coefficient of -.0015), and average teacher salary (coefficient of -.0001) barely influence the output of the function and are therefore more likely to be less important to the rank. LinearRegression (m5 pruning): The linear regression function created with pruning scores slightly worse than that without pruning in the categories of correlation coefficient and accuracy on both the training and test data. This is intuitive as it suggests that the more information available about a school allows one to better predict its rank. This reinforces the notion a school’s rank is dependent on numerous attributes, not just a few. Examining attributes left in the pruned function may reveal the more important factors in determining a school’s rank; most of the attributes in this function have to do with test scores and student’s future after graduation. The more physical characteristics of the school are left out, such as student/teacher ratio, classrooms with Internet, and enrollment. This is surprisingly intuitive as it deems the performance of the students as the foremost influence of a school’s rank. MultilayerPerceptron: This model is much more in depth than the others as it calculates an output using the values of all the attributes 11 times. This allows the model to be much more accurate than the others, with an overall accuracy of 86.45% on the test data. This model does make intuitive sense as it perpetuates the concept that the deeper the information about a school is examined, the more accurately a school’s rank can be determined. With this information I was able to create a Python program that allows a user to input the name of a school and get the school’s predicted rank as determined by the pruned linear regression model and the actual rank given by Boston magazine. The program can be seen in Appendix B. Conclusion With the relatively high accuracy rates and correlation coefficients obtained through linear regression, it seems that the attributes in this dataset do directly influence the rank of a school. However it is hard to say exactly which attributes of a school give it a higher ranking. Test scores and the success of students after graduation as conveyed by college entry statistics seem to have the most influence on the school’s rank. This supports the argument that a school’s quality is based on the performance of the student body, as that is the best indication that a school is doing its job. Appendix A SimpleLinearRegression function: -1.99 * MCAS math + 195.11 LinearRegression Function (no pruning): 0 * median home price + 3.2171 * residential tax rate + -0.0015 * enrollment + -0.0035 * cost per pupil + -0.2046 * SAT math + -22.0749 * SAT participation + 0.588 * dropout rate + -0.0001 * average teacher salary + 0.0496 * classrooms with internet + 0.0217 * students per computer + -0.6226 * MCAS math + 0.6543 * student to teacher ratio + 1.2052 * four year public + 1.031 * two-year public + 0.5295 * four year private + 0.537 * two-year private + 1.0611 * other + 1.3179 * work + 1.3801 * military + 0.0269 * SAT verbal + -0.4808 * MCAS English + 151.3666 LinearRegression Function (m5 pruning): 3.3348 * residential tax rate + -0.0044 * cost per pupil + -0.2337 * SAT math + -21.6186 * SAT participation + -0.6591 * MCAS math + 0.6854 * four year public + 0.4835 * two-year public + 0.5861 * other + 0.8442 * work + -0.4095 * MCAS English + 241.0377 MultilayerPerceptron Function: Linear Node 0 Inputs Weights Threshold 0.013608543390385267 Node 1 -0.7018654675930166 Node 2 0.8377544996547287 Node 3 0.05080787892170032 Node 4 0.6044820299992532 Node 5 -0.08997647801025567 Node 6 0.18270215970649528 Node 7 0.324201319750299 Node 8 0.6150776623869614 Node 9 -0.9556171321828326 Node 10 -0.7389437359732564 Node 11 0.4963316013623846 Sigmoid Node 1 Inputs Weights Threshold -0.3567897469061703 Attrib median home price 0.174950912414142 Attrib residential tax rate -0.1932557871681284 Attrib enrollment 0.37725504739258997 Attrib cost per pupil 0.40587368647992705 Attrib SAT math -0.210798428824892 Attrib SAT participation -0.3224444948882534 Attrib dropout rate 0.6179652209508388 Attrib average teacher salary -0.46743686705284365 Attrib classrooms with internet -0.1383468728439659 Attrib students per computer 0.1930770003055754 Attrib MCAS math -0.41764160546287415 Attrib student to teacher ratio 0.16861098457603077 Attrib four year public -0.08291590499652372 Attrib two-year public 0.5352051777872039 Attrib four year private 0.08253526979616456 Attrib two-year private 0.4891101125699421 Attrib other 0.09336708055476663 Attrib work 0.12588125555015162 Attrib military 0.43349416260813445 Attrib SAT verbal -0.5033755686155623 Attrib MCAS English -0.19079596640767774 Sigmoid Node 2 Inputs Weights Threshold -0.5207972266294439 Attrib median home price 0.25821146768527287 Attrib residential tax rate 0.4031875509894643 Attrib enrollment -0.26105557255140266 Attrib cost per pupil -1.4260965171011224 Attrib SAT math -0.7130861415786883 Attrib SAT participation -0.35151570920661374 Attrib dropout rate 0.7468319080478203 Attrib average teacher salary -0.7314950102156266 Attrib classrooms with internet 0.6845526965445545 Attrib students per computer -0.09053897293236066 Attrib MCAS math -0.277816939834462 Attrib student to teacher ratio 1.1152688981379313 Attrib four year public 0.7133771376861674 Attrib two-year public -0.0023339699657199025 Attrib four year private -0.15641889645710239 Attrib two-year private -0.11129947932646851 Attrib other 0.278064101751817 Attrib work 0.5113972394392824 Attrib military -0.35811310950941244 Attrib SAT verbal 0.05566470749684916 Attrib MCAS English -0.8432412500028468 Sigmoid Node 3 Inputs Weights Threshold -0.3458254870540105 Attrib median home price 0.09601057737055244 Attrib residential tax rate -0.0544514773552502 Attrib enrollment 0.18500856142680458 Attrib cost per pupil 0.16010892053583115 Attrib SAT math -0.1292231456043694 Attrib SAT participation -0.11200983538390738 Attrib dropout rate 0.23943777652480003 Attrib average teacher salary -0.08451340006347438 Attrib classrooms with internet -0.35490416530066715 Attrib students per computer 0.3007880100531612 Attrib MCAS math -0.14414222601868165 Attrib student to teacher ratio 0.18291646206003273 Attrib four year public 0.10612807446857082 Attrib two-year public 0.07860376579517057 Attrib four year private 0.10565899167130793 Attrib two-year private 0.287428261571298 Attrib other 0.2696629440919717 Attrib work 0.1964208254666402 Attrib military 0.21555985172048864 Attrib SAT verbal -0.11328401408414435 Attrib MCAS English -0.10124937631370062 Sigmoid Node 4 Inputs Weights Threshold -0.3090453671223993 Attrib median home price 0.025880255501360486 Attrib residential tax rate 0.152764093911713 Attrib enrollment 0.25421414981664897 Attrib cost per pupil 0.3231117833844099 Attrib SAT math -0.16035632891630053 Attrib SAT participation -0.20454542519258434 Attrib dropout rate -0.018186931964321872 Attrib average teacher salary 0.07732224712626272 Attrib classrooms with internet -0.73693164711512 Attrib students per computer 0.09701699517714785 Attrib MCAS math -0.03961820597924514 Attrib student to teacher ratio -0.09660501624154445 Attrib four year public 0.3449839796560578 Attrib two-year public -0.14842734685247372 Attrib four year private -0.08822891333041889 Attrib two-year private 0.42392666138561386 Attrib other 0.3800811090502082 Attrib work 0.12062171885397874 Attrib military 0.10645103968735717 Attrib SAT verbal -0.07739187781620664 Attrib MCAS English -0.3545241261886391 Sigmoid Node 5 Inputs Weights Threshold -0.38121280503557375 Attrib median home price 0.1275193805434692 Attrib residential tax rate -0.02058191349778571 Attrib enrollment 0.1285542203918258 Attrib cost per pupil 0.22693172584472165 Attrib SAT math -0.07549746798329492 Attrib SAT participation -0.06118224488370498 Attrib dropout rate 0.2706023781836587 Attrib average teacher salary -0.1819557811510326 Attrib classrooms with internet -0.2476072951760343 Attrib students per computer 0.28274969305634007 Attrib MCAS math -0.1201465762648191 Attrib student to teacher ratio 0.2649075771129865 Attrib four year public 0.05007444680065536 Attrib two-year public 0.12669164223100682 Attrib four year private 0.14316362657367424 Attrib two-year private 0.23363998242505601 Attrib other 0.2596434340860038 Attrib work 0.12363002925868323 Attrib military 0.18162037263028563 Attrib SAT verbal -0.1616977662116066 Attrib MCAS English -0.12742114131973759 Sigmoid Node 6 Inputs Weights Threshold -0.34269614825290634 Attrib median home price 0.2403923821885783 Attrib residential tax rate 0.4961627461261628 Attrib enrollment 0.13411605190224818 Attrib cost per pupil -0.07352750976547436 Attrib SAT math -0.30893932179575023 Attrib SAT participation -0.16703637506791122 Attrib dropout rate 0.2507020138815577 Attrib average teacher salary -0.18393782824093172 Attrib classrooms with internet -0.15930006086524007 Attrib students per computer 0.011250960961260706 Attrib MCAS math -0.027358293288930708 Attrib student to teacher ratio 0.36100767570039544 Attrib four year public 0.4752998409758364 Attrib two-year public 0.07220637569623042 Attrib four year private -0.1300719794522889 Attrib two-year private 0.19385720021178823 Attrib other 0.23188529897386306 Attrib work 0.08689055519938249 Attrib military -0.09507713352419005 Attrib SAT verbal -0.10493504099440695 Attrib MCAS English -0.4771481783777194 Sigmoid Node 7 Inputs Weights Threshold -0.3566067488535791 Attrib median home price 0.10329026219384638 Attrib residential tax rate -0.008599086101692995 Attrib enrollment 0.19802880925851699 Attrib cost per pupil 0.23156890837827698 Attrib SAT math -0.1370781964522465 Attrib SAT participation -0.14469579713981875 Attrib dropout rate 0.14287155858014886 Attrib average teacher salary -0.006987647804318273 Attrib classrooms with internet -0.5155257359980489 Attrib students per computer 0.2616193199878401 Attrib MCAS math -0.004039460521691838 Attrib student to teacher ratio -0.00509687753663684 Attrib four year public 0.07441241709561938 Attrib two-year public -0.01129371115505712 Attrib four year private 0.04478129608912071 Attrib two-year private 0.41010253508529615 Attrib other 0.2986528256627126 Attrib work 0.17568588451946335 Attrib military 0.2539743420181896 Attrib SAT verbal -0.10559943233091952 Attrib MCAS English -0.2017741840607495 Sigmoid Node 8 Inputs Weights Threshold -0.34165325489145665 Attrib median home price 0.10317253440725992 Attrib residential tax rate 0.018323637705576646 Attrib enrollment 0.1689102682944753 Attrib cost per pupil 0.31431713755064633 Attrib SAT math -0.13547516916073102 Attrib SAT participation -0.23323630224105263 Attrib dropout rate 0.03184525457371614 Attrib average teacher salary 0.14478672420243854 Attrib classrooms with internet -0.6765463761958007 Attrib students per computer 0.09483468383119627 Attrib MCAS math -0.012681267433821421 Attrib student to teacher ratio -0.11684187264270346 Attrib four year public 0.19559653553705053 Attrib two-year public -0.1607957719976626 Attrib four year private -0.06687713408880765 Attrib two-year private 0.40312804095030924 Attrib other 0.4487227906917138 Attrib work 0.018006525432432953 Attrib military 0.2831418167858799 Attrib SAT verbal -0.14717691734303537 Attrib MCAS English -0.33614465069566296 Sigmoid Node 9 Inputs Weights Threshold -1.1235806723021264 Attrib median home price 0.22605426706935328 Attrib residential tax rate 0.026252067338904832 Attrib enrollment 0.032590738585206454 Attrib cost per pupil 0.4023378772747055 Attrib SAT math 1.8819736033267496 Attrib SAT participation 0.6704741735071301 Attrib dropout rate -0.6700214285635534 Attrib average teacher salary 0.40052145927002536 Attrib classrooms with internet -0.2610108834644476 Attrib students per computer -0.24335533764219958 Attrib MCAS math 1.468770294954177 Attrib student to teacher ratio -0.1623924862500453 Attrib four year public -0.14170743185246382 Attrib two-year public 0.12356711399416426 Attrib four year private 1.0437423556714254 Attrib two-year private 0.14875207301385504 Attrib other 0.49673462614505925 Attrib work 0.3642671957329207 Attrib military -0.6314910686716961 Attrib SAT verbal 0.9499315140389681 Attrib MCAS English -0.0987184008753146 Sigmoid Node 10 Inputs Weights Threshold -0.40418796685836106 Attrib median home price 0.3405698390928264 Attrib residential tax rate -0.7028437189292482 Attrib enrollment 0.06948470039648492 Attrib cost per pupil 0.12833608522425186 Attrib SAT math 0.09505382695009755 Attrib SAT participation 0.15180627878480898 Attrib dropout rate 0.3815558655280989 Attrib average teacher salary -0.3316160353765328 Attrib classrooms with internet -0.3867919852082672 Attrib students per computer 0.007041211668194398 Attrib MCAS math 0.0652510719586196 Attrib student to teacher ratio 0.5190391594138813 Attrib four year public -0.293899577402547 Attrib two-year public -0.034046356704053836 Attrib four year private 0.4548182080242436 Attrib two-year private 0.14126941910613805 Attrib other 0.32838914161298577 Attrib work 0.2032023062956632 Attrib military -0.09736401563590788 Attrib SAT verbal 0.2497239355375822 Attrib MCAS English 0.03912380974263083 Sigmoid Node 11 Inputs Weights Threshold -0.3932053332535418 Attrib median home price 0.24183849645158534 Attrib residential tax rate 0.6662651161613324 Attrib enrollment 0.3438620577197615 Attrib cost per pupil 0.22037487235458286 Attrib SAT math -0.10686103138175339 Attrib SAT participation 0.2373014029859392 Attrib dropout rate 0.18844524242505936 Attrib average teacher salary -0.03136644236191362 Attrib classrooms with internet -0.1236231410015428 Attrib students per computer -0.18078067511599424 Attrib MCAS math 0.33817856268792973 Attrib student to teacher ratio 0.16933757710699898 Attrib four year public 0.5874629902235733 Attrib two-year public 0.37411586916303186 Attrib four year private -0.044205836299185716 Attrib two-year private 0.1846655669367499 Attrib other 0.1432454788316435 Attrib work -0.3020891615328035 Attrib military -0.17221599887322916 Attrib SAT verbal -0.2167647306197881 Attrib MCAS English -0.34154972142115037 Appendix B—Python Program ####################################################################### # # A program by Micah Oppenheim # # This program opens the school database file used in my project. It # allows the user to enter the name of a school and see its rank as # determined by Boston Magazine, and its predicted rank as calculated # by a pruned linear regression function created by Weka. # ####################################################################### # Imports SQL, connects to the database and creates a cursor. import sqlite3 db = sqlite3.connect('schools.db') cursor=db.cursor() # User inputs the school they want information on. The name of a school # is a proper noun so capitalize accordingly! schoolSelect = raw_input("Enter name of school: ") print # Finds the attributes in the database that are needed to calculate the # predicted rank. command = "SELECT school, residential_tax_rate, cost_per_pupil, sat_math, sat_participation_rate, mcas_math, four_year_public, two_year_public, other, work, mcas_english, rank from schools where school = ?;" cursor.execute(command, [schoolSelect]) # Calculates and prints the school, predicted rank and actual rank # using the information found in the query. for tuple in cursor: predictedRank = int(3.3348*tuple[1] + -0.0044*tuple[2] + -0.2337*tuple[3] + -21.6186*tuple[4] + -0.6591*tuple[5] + 0.6854*tuple[6] + 0.4835*tuple[7] + 0.5861*tuple[8] + 0.8442*tuple[9] + -0.4095*tuple[10] + 241.0377) print schoolSelect, "has a predicted rank of", predictedRank, "and an actual rank of", tuple[11] db.commit() db.close()