Predicting School Ranks Through Data Mining Introduction Micah Oppenheim—Boston University

advertisement
Predicting School Ranks Through Data Mining
Micah Oppenheim—Boston University
CS 105—Professor David G. Sullivan, Ph.D
Introduction
Everyone wants to ensure that they are receiving the best education available to them. It
is no secret that some public schools are better than others, however it is not always clear
what makes one school better than another. This project attempts to answer the following
question: What factors contribute to the ranking of a public school system?
To complete this task I conduct data mining of a dataset originally containing information
on 151 Greater Boston area high schools. The data set contains information on the
schools themselves, economic factors of the towns that the schools are in, success of the
students, and a numbered rank given by Boston Magazine. I create and analyze several
regression functions that attempt to predict the rank of each school.
Dataset Description
The dataset used in this project was constructed by Boston Magazine and obtained from
the Many Eyes website at http://www-958.ibm.com/software/data/cognos/manyeyes/
datasets/2005-public-school-rankings/versions/1.
The dataset I used contains the names of 146 schools, their rank, and 21 other attributes:
ATTRIBUTE
school
median_home_price
TYPE
nominal
numeric
residential_tax_rate
enrollment
numeric
numeric
cost_per_pupil
sat_math
numeric
numeric
sat_participation_rate
numeric
dropout_rate
average_teacher_salary
numeric
numeric
classrooms_with_internet
numeric
student_per_computer
numeric
mcas_math
numeric
student_teacher_ratio
numeric
DESCRIPTION
Name of the school.
Median price of homes in the town.
Residential tax rate per $1000
assessed valuation in the town.
Number of students in the school.
Mean estimated cost of attending the
school.
Average SAT Math scores.
Percentage of students that took SAT
tests.
Percentage of student body that did
not graduate.
Average salary of teachers.
Percentage of classrooms with
internet.
Number of students for each
computer.
Average score of Massachusetts
Comprehensive Assessment System
Math exam.
Number of students for each
teacher.
four_year_public
numeric
two_year_public
numeric
four_year_private
numeric
two_year_private
numeric
other
numeric
work
numeric
military
sat_verbal
numeric
numeric
mcas_english
numeric
rank
numeric
Percentage of student body that
attend a four-year public school after
graduation.
Percentage of student body that
attend a two-year public school after
graduation.
Percentage of student body that
attend a four-year private school
after graduation.
Percentage of student body that
attend a two-year private school
after graduation.
Percentage of student body that
neither attend college, join the
military, or work after graduation.
Percentage of student body that
work after graduation.
Percentage of student body that join
the military after graduation.
Average SAT Verbal scores.
Average score of Massachusetts
Comprehensive Assessment System
English exam.
Rank of school as determined by
Boston Magazine.
Data Preparation
There were several steps I had to take in order to prepare the dataset for data mining. I
manually split the attribute SAT math/verbal into two separate attributes; SAT math and
SAT verbal. I also split the attribute MCAS math/English into two separate attributes;
MCAS math and MCAS English. This was done because originally in each column there
were two scores separated by a “/”. This would disrupt the data mining process, as the
data mining program Weka would not be able to properly judge the value of the test
scores.
I then removed 5 schools that were missing data for certain attributes. This was done so
I could create more accurate models with the data knowing that the removal of 5 schools
would not be detrimental to the overall process.
Next I focused on formatting my data table so it could be uploaded into the program
Weka. To do this I manually removed all commas, dashes, and percent signs in the data.
I also removed the name of the schools, as the rank itself acts as an ID number for each
school.
Once the data was successfully loaded into Weka, I used to Weka to randomize the data
then split it into training data that would be used to create the models, and test data that
would be used to test the models. This data was split 115/31 or roughly 80%/20%.
Since I have also created a python program that uses a SQL query, I created another
version of the data that could be uploaded into SQL. To do this I removed the headers of
each attribute and added the name of the schools back into the data.
Data Analysis
I preformed numeric estimation through several forms of regression analysis to predict
the rank of schools. I used Weka to create regression functions based on 4 different
models:
SimpleLinearRegression: A linear regression function using only one attribute to predict
the output.
LinearRegression (no attribute selection): A linear regression function using all possible
attributes.
LinearRegression (attributes are pruned using m5 method): A linear regression function
in which attributes are pruned by Weka.
MultilayerPerceptron: A model that uses backpropagation to produce an output. The
model creates many layers of regression functions that are connected and calculated
together to produce the output.
The entire models created by each form of regression can be found in Appendix A.
Results
Results of the four algorithms used:
Model
SimpleLinearRegression
LinearRegression (no
pruning)
LinearRegression (m5
pruning)
MultilayerPerceptron
Correlation
Correlation
Error
Error
Error on
Coefficient of Coefficient of on training on test
training
Error on
training data test data
data
data
data
test data
0.8811
0.9156
44.69%
37.88% 44.69% 37.88%
0.9783
0.9726
19.30%
22.30%
19.30%
22.30%
0.9769
0.9976
0.9677
0.9897
19.99%
6.94%
24.49%
13.55%
19.99%
6.94%
24.49%
13.55%
SimpleLinearRegression: This regression function chose mcas_math as the sole
attribute to be used. Although this was the least accurate of all four models, it produced
a respectable correlation coefficient of 0.8811. This also suggests that, if looking at all
the attributes independently, the scores of the MCAS math exams have the most direct
correlation with the perceived quality of a school. It is interesting that this model had a
noticeably higher accuracy on the test data, which suggests that this correlation exists
throughout all the schools with relative consistency. A visualization of this correlation
helps display the connection between the two attributes:
Average MCAS Math Score by Rank
school rank
x-axis = rank
y-axis = MCAS math score
The image shows the relatively linear relation between a school’s rank and the average
MCAS math scores that the student body receives. This concept is not surprising as it
implies that students that attend better schools receive better test scores.
LinearRegression (no pruning): The linear regression function created using
all the attributes is significantly more accurate than simple linear regression with
an overall accuracy of 77.70% on the test data. This shows that a school’s rank is
dependent on many attributes, and cannot be accurately determined by one, or even
a few attributes. However, examining the coefficients assigned to the attributes does
reveal which attributes of a school seem to have the least effect on rank; median home
price (coefficient of 0), enrollment (coefficient of -.0015), and average teacher salary
(coefficient of -.0001) barely influence the output of the function and are therefore more
likely to be less important to the rank.
LinearRegression (m5 pruning): The linear regression function created with pruning
scores slightly worse than that without pruning in the categories of correlation coefficient
and accuracy on both the training and test data. This is intuitive as it suggests that the
more information available about a school allows one to better predict its rank. This
reinforces the notion a school’s rank is dependent on numerous attributes, not just a
few. Examining attributes left in the pruned function may reveal the more important
factors in determining a school’s rank; most of the attributes in this function have to do
with test scores and student’s future after graduation. The more physical characteristics
of the school are left out, such as student/teacher ratio, classrooms with Internet, and
enrollment. This is surprisingly intuitive as it deems the performance of the students as
the foremost influence of a school’s rank.
MultilayerPerceptron: This model is much more in depth than the others as it calculates
an output using the values of all the attributes 11 times. This allows the model to be
much more accurate than the others, with an overall accuracy of 86.45% on the test
data. This model does make intuitive sense as it perpetuates the concept that the deeper
the information about a school is examined, the more accurately a school’s rank can be
determined.
With this information I was able to create a Python program that allows a user to input
the name of a school and get the school’s predicted rank as determined by the pruned
linear regression model and the actual rank given by Boston magazine. The program can
be seen in Appendix B.
Conclusion
With the relatively high accuracy rates and correlation coefficients obtained through
linear regression, it seems that the attributes in this dataset do directly influence the rank
of a school. However it is hard to say exactly which attributes of a school give it a higher
ranking. Test scores and the success of students after graduation as conveyed by college
entry statistics seem to have the most influence on the school’s rank. This supports the
argument that a school’s quality is based on the performance of the student body, as that
is the best indication that a school is doing its job.
Appendix A
SimpleLinearRegression function:
-1.99 * MCAS math + 195.11
LinearRegression Function (no pruning):
0 * median home price +
3.2171 * residential tax rate +
-0.0015 * enrollment +
-0.0035 * cost per pupil +
-0.2046 * SAT math +
-22.0749 * SAT participation +
0.588 * dropout rate +
-0.0001 * average teacher salary +
0.0496 * classrooms with internet +
0.0217 * students per computer +
-0.6226 * MCAS math +
0.6543 * student to teacher ratio +
1.2052 * four year public +
1.031 * two-year public +
0.5295 * four year private +
0.537 * two-year private +
1.0611 * other +
1.3179 * work +
1.3801 * military +
0.0269 * SAT verbal +
-0.4808 * MCAS English +
151.3666
LinearRegression Function (m5 pruning):
3.3348 * residential tax rate +
-0.0044 * cost per pupil +
-0.2337 * SAT math +
-21.6186 * SAT participation +
-0.6591 * MCAS math +
0.6854 * four year public +
0.4835 * two-year public +
0.5861 * other +
0.8442 * work +
-0.4095 * MCAS English +
241.0377
MultilayerPerceptron Function:
Linear Node 0
Inputs Weights
Threshold 0.013608543390385267
Node 1 -0.7018654675930166
Node 2 0.8377544996547287
Node 3 0.05080787892170032
Node 4 0.6044820299992532
Node 5 -0.08997647801025567
Node 6 0.18270215970649528
Node 7 0.324201319750299
Node 8 0.6150776623869614
Node 9 -0.9556171321828326
Node 10 -0.7389437359732564
Node 11 0.4963316013623846
Sigmoid Node 1
Inputs Weights
Threshold -0.3567897469061703
Attrib median home price 0.174950912414142
Attrib residential tax rate -0.1932557871681284
Attrib enrollment 0.37725504739258997
Attrib cost per pupil 0.40587368647992705
Attrib SAT math -0.210798428824892
Attrib SAT participation -0.3224444948882534
Attrib dropout rate 0.6179652209508388
Attrib average teacher salary -0.46743686705284365
Attrib classrooms with internet -0.1383468728439659
Attrib students per computer 0.1930770003055754
Attrib MCAS math -0.41764160546287415
Attrib student to teacher ratio 0.16861098457603077
Attrib four year public -0.08291590499652372
Attrib two-year public 0.5352051777872039
Attrib four year private 0.08253526979616456
Attrib two-year private 0.4891101125699421
Attrib other 0.09336708055476663
Attrib work 0.12588125555015162
Attrib military 0.43349416260813445
Attrib SAT verbal -0.5033755686155623
Attrib MCAS English -0.19079596640767774
Sigmoid Node 2
Inputs Weights
Threshold -0.5207972266294439
Attrib median home price 0.25821146768527287
Attrib residential tax rate 0.4031875509894643
Attrib enrollment -0.26105557255140266
Attrib cost per pupil -1.4260965171011224
Attrib SAT math -0.7130861415786883
Attrib SAT participation -0.35151570920661374
Attrib dropout rate 0.7468319080478203
Attrib average teacher salary -0.7314950102156266
Attrib classrooms with internet 0.6845526965445545
Attrib students per computer -0.09053897293236066
Attrib MCAS math -0.277816939834462
Attrib student to teacher ratio 1.1152688981379313
Attrib four year public 0.7133771376861674
Attrib two-year public -0.0023339699657199025
Attrib four year private -0.15641889645710239
Attrib two-year private -0.11129947932646851
Attrib other 0.278064101751817
Attrib work 0.5113972394392824
Attrib military -0.35811310950941244
Attrib SAT verbal 0.05566470749684916
Attrib MCAS English -0.8432412500028468
Sigmoid Node 3
Inputs Weights
Threshold -0.3458254870540105
Attrib median home price 0.09601057737055244
Attrib residential tax rate -0.0544514773552502
Attrib enrollment 0.18500856142680458
Attrib cost per pupil 0.16010892053583115
Attrib SAT math -0.1292231456043694
Attrib SAT participation -0.11200983538390738
Attrib dropout rate 0.23943777652480003
Attrib average teacher salary -0.08451340006347438
Attrib classrooms with internet -0.35490416530066715
Attrib students per computer 0.3007880100531612
Attrib MCAS math -0.14414222601868165
Attrib student to teacher ratio 0.18291646206003273
Attrib four year public 0.10612807446857082
Attrib two-year public 0.07860376579517057
Attrib four year private 0.10565899167130793
Attrib two-year private 0.287428261571298
Attrib other 0.2696629440919717
Attrib work 0.1964208254666402
Attrib military 0.21555985172048864
Attrib SAT verbal -0.11328401408414435
Attrib MCAS English -0.10124937631370062
Sigmoid Node 4
Inputs Weights
Threshold -0.3090453671223993
Attrib median home price 0.025880255501360486
Attrib residential tax rate 0.152764093911713
Attrib enrollment 0.25421414981664897
Attrib cost per pupil 0.3231117833844099
Attrib SAT math -0.16035632891630053
Attrib SAT participation -0.20454542519258434
Attrib dropout rate -0.018186931964321872
Attrib average teacher salary 0.07732224712626272
Attrib classrooms with internet -0.73693164711512
Attrib students per computer 0.09701699517714785
Attrib MCAS math -0.03961820597924514
Attrib student to teacher ratio -0.09660501624154445
Attrib four year public 0.3449839796560578
Attrib two-year public -0.14842734685247372
Attrib four year private -0.08822891333041889
Attrib two-year private 0.42392666138561386
Attrib other 0.3800811090502082
Attrib work 0.12062171885397874
Attrib military 0.10645103968735717
Attrib SAT verbal -0.07739187781620664
Attrib MCAS English -0.3545241261886391
Sigmoid Node 5
Inputs Weights
Threshold -0.38121280503557375
Attrib median home price 0.1275193805434692
Attrib residential tax rate -0.02058191349778571
Attrib enrollment 0.1285542203918258
Attrib cost per pupil 0.22693172584472165
Attrib SAT math -0.07549746798329492
Attrib SAT participation -0.06118224488370498
Attrib dropout rate 0.2706023781836587
Attrib average teacher salary -0.1819557811510326
Attrib classrooms with internet -0.2476072951760343
Attrib students per computer 0.28274969305634007
Attrib MCAS math -0.1201465762648191
Attrib student to teacher ratio 0.2649075771129865
Attrib four year public 0.05007444680065536
Attrib two-year public 0.12669164223100682
Attrib four year private 0.14316362657367424
Attrib two-year private 0.23363998242505601
Attrib other 0.2596434340860038
Attrib work 0.12363002925868323
Attrib military 0.18162037263028563
Attrib SAT verbal -0.1616977662116066
Attrib MCAS English -0.12742114131973759
Sigmoid Node 6
Inputs Weights
Threshold -0.34269614825290634
Attrib median home price 0.2403923821885783
Attrib residential tax rate 0.4961627461261628
Attrib enrollment 0.13411605190224818
Attrib cost per pupil -0.07352750976547436
Attrib SAT math -0.30893932179575023
Attrib SAT participation -0.16703637506791122
Attrib dropout rate 0.2507020138815577
Attrib average teacher salary -0.18393782824093172
Attrib classrooms with internet -0.15930006086524007
Attrib students per computer 0.011250960961260706
Attrib MCAS math -0.027358293288930708
Attrib student to teacher ratio 0.36100767570039544
Attrib four year public 0.4752998409758364
Attrib two-year public 0.07220637569623042
Attrib four year private -0.1300719794522889
Attrib two-year private 0.19385720021178823
Attrib other 0.23188529897386306
Attrib work 0.08689055519938249
Attrib military -0.09507713352419005
Attrib SAT verbal -0.10493504099440695
Attrib MCAS English -0.4771481783777194
Sigmoid Node 7
Inputs Weights
Threshold -0.3566067488535791
Attrib median home price 0.10329026219384638
Attrib residential tax rate -0.008599086101692995
Attrib enrollment 0.19802880925851699
Attrib cost per pupil 0.23156890837827698
Attrib SAT math -0.1370781964522465
Attrib SAT participation -0.14469579713981875
Attrib dropout rate 0.14287155858014886
Attrib average teacher salary -0.006987647804318273
Attrib classrooms with internet -0.5155257359980489
Attrib students per computer 0.2616193199878401
Attrib MCAS math -0.004039460521691838
Attrib student to teacher ratio -0.00509687753663684
Attrib four year public 0.07441241709561938
Attrib two-year public -0.01129371115505712
Attrib four year private 0.04478129608912071
Attrib two-year private 0.41010253508529615
Attrib other 0.2986528256627126
Attrib work 0.17568588451946335
Attrib military 0.2539743420181896
Attrib SAT verbal -0.10559943233091952
Attrib MCAS English -0.2017741840607495
Sigmoid Node 8
Inputs Weights
Threshold -0.34165325489145665
Attrib median home price 0.10317253440725992
Attrib residential tax rate 0.018323637705576646
Attrib enrollment 0.1689102682944753
Attrib cost per pupil 0.31431713755064633
Attrib SAT math -0.13547516916073102
Attrib SAT participation -0.23323630224105263
Attrib dropout rate 0.03184525457371614
Attrib average teacher salary 0.14478672420243854
Attrib classrooms with internet -0.6765463761958007
Attrib students per computer 0.09483468383119627
Attrib MCAS math -0.012681267433821421
Attrib student to teacher ratio -0.11684187264270346
Attrib four year public 0.19559653553705053
Attrib two-year public -0.1607957719976626
Attrib four year private -0.06687713408880765
Attrib two-year private 0.40312804095030924
Attrib other 0.4487227906917138
Attrib work 0.018006525432432953
Attrib military 0.2831418167858799
Attrib SAT verbal -0.14717691734303537
Attrib MCAS English -0.33614465069566296
Sigmoid Node 9
Inputs Weights
Threshold -1.1235806723021264
Attrib median home price 0.22605426706935328
Attrib residential tax rate 0.026252067338904832
Attrib enrollment 0.032590738585206454
Attrib cost per pupil 0.4023378772747055
Attrib SAT math 1.8819736033267496
Attrib SAT participation 0.6704741735071301
Attrib dropout rate -0.6700214285635534
Attrib average teacher salary 0.40052145927002536
Attrib classrooms with internet -0.2610108834644476
Attrib students per computer -0.24335533764219958
Attrib MCAS math 1.468770294954177
Attrib student to teacher ratio -0.1623924862500453
Attrib four year public -0.14170743185246382
Attrib two-year public 0.12356711399416426
Attrib four year private 1.0437423556714254
Attrib two-year private 0.14875207301385504
Attrib other 0.49673462614505925
Attrib work 0.3642671957329207
Attrib military -0.6314910686716961
Attrib SAT verbal 0.9499315140389681
Attrib MCAS English -0.0987184008753146
Sigmoid Node 10
Inputs Weights
Threshold -0.40418796685836106
Attrib median home price 0.3405698390928264
Attrib residential tax rate -0.7028437189292482
Attrib enrollment 0.06948470039648492
Attrib cost per pupil 0.12833608522425186
Attrib SAT math 0.09505382695009755
Attrib SAT participation 0.15180627878480898
Attrib dropout rate 0.3815558655280989
Attrib average teacher salary -0.3316160353765328
Attrib classrooms with internet -0.3867919852082672
Attrib students per computer 0.007041211668194398
Attrib MCAS math 0.0652510719586196
Attrib student to teacher ratio 0.5190391594138813
Attrib four year public -0.293899577402547
Attrib two-year public -0.034046356704053836
Attrib four year private 0.4548182080242436
Attrib two-year private 0.14126941910613805
Attrib other 0.32838914161298577
Attrib work 0.2032023062956632
Attrib military -0.09736401563590788
Attrib SAT verbal 0.2497239355375822
Attrib MCAS English 0.03912380974263083
Sigmoid Node 11
Inputs Weights
Threshold -0.3932053332535418
Attrib median home price 0.24183849645158534
Attrib residential tax rate 0.6662651161613324
Attrib enrollment 0.3438620577197615
Attrib cost per pupil 0.22037487235458286
Attrib SAT math -0.10686103138175339
Attrib SAT participation 0.2373014029859392
Attrib dropout rate 0.18844524242505936
Attrib average teacher salary -0.03136644236191362
Attrib classrooms with internet -0.1236231410015428
Attrib students per computer -0.18078067511599424
Attrib MCAS math 0.33817856268792973
Attrib student to teacher ratio 0.16933757710699898
Attrib four year public 0.5874629902235733
Attrib two-year public 0.37411586916303186
Attrib four year private -0.044205836299185716
Attrib two-year private 0.1846655669367499
Attrib other 0.1432454788316435
Attrib work -0.3020891615328035
Attrib military -0.17221599887322916
Attrib SAT verbal -0.2167647306197881
Attrib MCAS English -0.34154972142115037
Appendix B—Python Program
#######################################################################
# # A program by Micah Oppenheim
#
# This program opens the school database file used in my project. It
# allows the user to enter the name of a school and see its rank as
# determined by Boston Magazine, and its predicted rank as calculated
# by a pruned linear regression function created by Weka.
#
#######################################################################
# Imports SQL, connects to the database and creates a cursor.
import sqlite3
db = sqlite3.connect('schools.db')
cursor=db.cursor()
# User inputs the school they want information on. The name of a school
# is a proper noun so capitalize accordingly!
schoolSelect = raw_input("Enter name of school: ")
print
# Finds the attributes in the database that are needed to calculate the
# predicted rank.
command = "SELECT school, residential_tax_rate, cost_per_pupil,
sat_math, sat_participation_rate, mcas_math, four_year_public,
two_year_public, other, work, mcas_english, rank from schools where
school = ?;"
cursor.execute(command, [schoolSelect])
# Calculates and prints the school, predicted rank and actual rank
# using the information found in the query.
for tuple in cursor:
predictedRank = int(3.3348*tuple[1] + -0.0044*tuple[2] +
-0.2337*tuple[3] + -21.6186*tuple[4] + -0.6591*tuple[5] +
0.6854*tuple[6] + 0.4835*tuple[7] + 0.5861*tuple[8] + 0.8442*tuple[9] +
-0.4095*tuple[10] + 241.0377)
print schoolSelect, "has a predicted rank of", predictedRank, "and an
actual rank of", tuple[11]
db.commit()
db.close()
Download