College Tuition: Data mining and analysis CS105

advertisement
CS105
College Tuition: Data mining and
analysis
By Jeanette Chu & Khiem Tran
4/28/2010
Introduction
College tuition issues are steadily increasing every year. According to the college pricing trends report
released by Collegeboard, the average increase in tuition and fees at four-year colleges is around 6
percent. Roughly 15 percent of students that attend four-year colleges have experienced 15 percent or
more increase in tuition pricing. How do colleges determine their financial worth? We hypothesized
that several factors contribute to this large figure that are paid off by loans, which numerous students
and parents pledge 20 years of their lives to paying off. The main contributors include the following
attributes:
Attribute
Size of the Student Body
Size of Faculty and Staff
Location of the school
Reasoning
The students supply the revenue that
keeps the college running
The college needs to pay the salaries, a
larger budget is needed for a larger staff
The college needs to account for cost of
maintenance
Dataset Description
The entire table was compiled by using data from the U.S. News and World Report College Rankings of
National Colleges and the complete University Guide found towards the back of the survey. The
university guide included the setting, room and board expenses, and financial aid packages, all of which
we believed would affect the tuition costs.
Attribute
Rank
Type
Numeric
Description
Based on the Survey’s
overall college scoring
Average Freshman
Retention
Numeric
% of the freshmen that
continue at that college
Percent of Class
Sizes under 20
Numeric
% of classes with less
than 20 students
Percent of Class
Sizes over 50
Numeric
% of classes with more
than 50 students
Percent of Full-time
Faculty
Numeric
% of faculty that work
full time
Hypothesis
Higher-ranked schools
may charge more for the
prestige and reputation
A higher retention rate
would mean more
students to pay the
tuition. This might lower
the tuition per student.
Fewer students in the
classroom might mean
higher tuition since this
may increase the studentteacher ratio
More students may
decrease tuition costs
since there are more
students paying
Tuition may be higher if
the percentage was
higher so that the college
can pay salaries
Freshmen in top
10% of HS
Numeric
College Acceptance
Rate
Numeric
Alumni
Contribution Rate
Numeric
Room & Board
Numeric
Percent with
Financial Need
Numeric
Average Aid
Package
Numeric
Setting
Text
Size of Full-time
Students
Numeric
% of freshmen that
graduated in the top
10% of their high schools
Recruiting more of the
smarter students may
lead to an increase in
tuition for an overall
perception of a better
school
% of applicants who are
A higher acceptance rate
accepted into the college may lead to an increase in
tuition (supply increases,
demand increases)
% of alumni contributing A greater contribution
to the college
from alumni may lead to
a decrease in tuition
Room and board
A greater cost of room
expenses at the college
and board may lead to an
increase in tuition (cost of
living)?
% of students with
If more students are
determined financial
determined to need
assistance
financial assistance,
tuition may increase
Average financial aid
A larger aid package may
package awarded to
lead to higher tuition
students
costs; raking in more
revenue from students
Location of the college:
The cost of living varies;
rural, suburban, urban
the cost of land, building,
maintenance may be
higher in urban areas
compared to rural areas
Size of the full-time
More full-time students
student population
may lead to a decrease in
tuition since more
revenue is coming in from
current students
Data preparation
Formatting the data
We briefly swept through the data after eliminating the attributes and corrected minor issues such as
apostrophes and spelling mistakes. We formatted the data to exclude symbols such as percentages and
dollar signs, and made sure to eliminate unique identifiers, in this case, the individual college names.
We chose to run numeric estimation which requires numeric inputs to deliver numeric outputs,
therefore if we wanted to use setting, we had to use numbers to distinguish them setting 3=urban,
2=suburban, 1=rural.
Before running the tests, we randomized and split our main data (133 instances) into 1/3 for the test
data and 2/3 for the training data. We eliminated the following attributes: Overall Score, Peer Score,
Predicted Retention Rate, Actual Retention Rate. We feel that these four did not have a direct
relationship with the attribute Tuition. We also wanted to use attributes that were more accessible and
readily available to students and users.
Creating the Database
createCollegeDataDB.py: This program was written with SQL within Python to create the college data
database containing a table with the attributes. It also parses through the comma separated value file
(CSV) to insert the values in the correct columns. This is the database that our tuition calculator will be
pulling data from so it only contains attributes relevant to the main model that we chose. (See Appendix
A for full code).
1. Connected to the database (essentially creating the database file)
2. Created a handler to execute queries
a. Executed the following query to create the table:
CREATE TABLE collegeData
(
CollegeName text,
Rank int,
Classes20 numeric,
AlumniRate numeric,
Tuition numeric,
Board numeric,
AvgAid numeric,
SizeFT numeric
)
3. Connected to the CSV file containing the data
4. Read in each row of the data, splitting the string by the commas
a. Executed the following query in Python to insert the values:
for record in readCSV:
record = string.split(record, ',')
INSERT INTO collegeData
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
parameters = (record[0], record[1], record[2],
record[3], record[4], record[5], record[6], record[7])
cursor.execute(sql, parameters)
Tuition-Calculator.py: This program was written to project tuition costs for any college, given the data
inputs used in the model. The calculator was written based on the Linear Regression model. (See
Appendix B for full code).
1. Connects to the database
2. Prompts user to enter a college name
a. If the college appears in the table, the tuition is automatically calculated using the model
b. If the college does not, the user is prompted to enter the attributes and this is added to the
table
Data Analysis
Our goal of this project is to understand the weights and relationships of the variables used by U.S.
News in determining the ranking of the Top National Universities and their effect on tuition. Using data
from U.S. News college ranking, we ran three different numeric models and derive the equations and
regression tree that estimate tuition based on the given attributes.
We utilized ten-fold cross-validation on all data runs. The following tables give the results of the three
models in Weka:
Approaches
Linear Regression
M5P
LeastMedSqd
DecisionTable
SMOreg
Test Data Correlation
Coefficient
0.8140
0.8783
0.7723
0.7115
0.8556
Training Data Correlation
Coefficient
0.8789
0.8799
0.8725
0.6566
0.8878
The following models were used for analysis (Weka documentation):
Linear Regression Model: “Class for using linear regression for prediction. Uses the Akaike criterion for
model selection, and is able to deal with weighted instances.” This model gives us better correlation and
yields a simple regression equation that is much more useful. We select this approach for further
analysis.
M5 Pruned Model Tree: “Split the parameter space into areas (subspaces) and build in each of them a
linear regression model.” This model follows a decision tree approach, but uses linear regression. In this
case, the tree breaks down into many-level nodes. Even though it yields good correlation, this model
isn’t practical in comparison to the Linear Regression.
Least Median Squared: “Implements a least median squared linear regression utilizing the existing Weka
LinearRegression class to form predictions. The basis of the algorithm is Robust regression and outlier
detection.”
DecisionTable: “Class for building and using a simple decision table majority classifier.” The result came
up with 10 rules for the data. There is a loss of accuracy in the training data. We believe this model is too
simple for data and the many attributes.
SMOreg: “Sequential minimal optimization algorithm for training a support vector regression using
polynomial or RBF kernels. This implementation globally replaces all missing values and transforms
nominal attributes into binary ones.” This model has the best correlation. However, the algorithm
normalizes all attributes.
Results
We select the Linear Regression Model as our main model due to its correlation, simplicity and
practicality in nature. The following result and analysis show the relationships between the attributes
and their effect on Tuition. This regression equation gives us a lot of insights about the data.
Tuition =
Effect on Tuition
Analysis
-89.1861 * Rank +
Negative
-8758.1325 * %ClassesUnder20 +
Negative
-16707.8015 * AlumniGivingRate +
Negative
1.0977 * BoardCost +
Positive
0.3953 * AvgFinAid +
Positive
-0.2794 * SizeFullTimeStudents
+ 27317.312
Negative
Better rankings (lower number) make tuition more
expensive.
Schools with smaller classrooms tend to be
cheaper.
Schools where Alumni donate a lot of money tend
to be cheaper.
Higher boarding cost means higher tuition cost as
well.
Higher financial aid package means tuition will be
higher as well.
More students decrease tuition.
From looking at this table, we can see the weights of these 6 attributes on determining tuition. They give
insights about colleges that one might not think about. Better ranked schools have higher tuition, but
they also give higher financial aid. The linear regression shows us “hidden” data.
Graphic 1: Tuition (color) vs. Rank (size) Treemap: Better ranking, higher tuition for colleges
http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/rank-vs-tuition-treemap
The smaller boxes indicate better rankings. The darker the color indicates the more expensive the school
is. In the Urban segment, the concentration of better-ranked colleges is noticeably more expensive. So
when a college improves its ranking, it has more leverage in the market to increase its tuition. When its
tuition increases, the college can increases its financial aid as well, and both increases would make the
school looks better and more prestigious.
Graphic 2: Tuition vs. Rank on the Size of the school
http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/tuition-vs-rank-on-size-of-school
This is a different representation of the tree map above. In addition, the larger the dot indicates the
larger the size of the school. From this depiction, there seem to be no large school with tuition above
$30,000. This is probably due to the differences between public and private schools in term of ranking.
State schools tend to be cheaper and larger, which dominates the $20-30 thousand range.
Graphic 3: Average Financial Aid (color) vs. Acceptance Rate (size)
http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/average-financial-aid-vs-acceptanc
The larger boxes indicate higher acceptance rate. The darker boxes indicate better financial assistance.
In this case, the graphic makes it apparent that there is a correlation between the two attributes. More
selective schools give out better financial aid packages. Subsequently, more selective schools are likely
to be better-ranked with higher tuition. The regression model also tells us that higher tuition means
higher financial aid as well.
The regression model and the graphics above all tie together the relationships of the attributes and how
a college would charge its tuition.
Conclusion
From our analysis, we learn that better ranked and more selective schools do not simply imply to be
more expensive. There are many attributes that factor into the cost of tuition. Our analysis also shows
the natural relationship of costs in attending college.
Tuition Cost + Board = Total Cost
Total Cost – Financial Aid = Net Cost
Here, we can see that schools with high Total Cost will compensate by offering better financial
assistance. We would advise high school seniors to not be afraid to apply for better-ranked schools that
looked expensive and prestigious, but may turn out to have a lower Net Cost.
We also conclude that the colleges operate on economy of scale. The schools with more economy of
scale can reduce its average cost of operations and allocate that surplus to financial aid for students.
Extensions
To increase the accuracy of our model, we acknowledge that we could have further categorized our data
by private and public 4-year colleges, and private and public 2-year colleges. Public colleges have a
tendency to charge less for overall tuition and this factor may change our variables.
Possible Extensions to Python modules:
1. We could make this more customizable by calculating the tuition based on other criteria
a. E.g., user wants the predicted tuition for colleges with a specific class size
b. This calculator could then return all colleges and predicted tuition costs that fall within the
specified class size
2. We could turn this into a web page calculator
a. Command Python to interpret part of the text as HTML
b. Use forms – either drop down to select a specific college, or input search criteria
c. Select the necessary attribute information based on the user input
d. Code in the model to calculate the tuition cost
Appendices
Appendix A: createCollegeDataDB.py
#
#
#
#
#
CS105 project: Predicting College Tuition
by: Jeanette Chu & Khiem Tran
name: createCollegeDataDB.py
description: create the college data database and insert all data into the table
# import necessary modules
import sqlite3 as db
import string
# create/connect to the database
filename = "collegeData.db"
conn = db.connect(filename)
cursor = conn.cursor()
# create the table
try:
cursor.execute("""
CREATE TABLE collegeData
(
CollegeName text,
Rank int,
Classes20 numeric,
AlumniRate numeric,
Tuition numeric,
Board numeric,
AvgAid numeric,
SizeFT numeric
);
""")
except:
pass
# commit changes and close cursor
conn.commit()
cursor.close()
# open file to read in
readCSV = open('CollegeTuitionDataset-forPythonCalculator.csv', 'r')
for record in readCSV:
record = string.split(record, ",")
# insert the values
sql = """INSERT INTO collegeData
VALUES (?, ?, ?, ?, ?, ?, ?, ?)"""
parameters = (record[0], record[1], record[2], record[3], record[4],
record[5], record[6], record[7])
cursor = conn.cursor()
cursor.execute(sql, parameters)
conn.commit()
cursor.close()
Appendix B: tuitionCalculator.py
# CS105 project: Predicting College Tuition
# by: Jeanette Chu & Khiem Tran
# name: tuitionCalculator.py
# description: calculates the tuition based on selected model
#
# import necessary modules
import sqlite3 as db
import math
###############################################################################
# to format dollars
def dollarFormat(total):
# Converting number to string, cut out cents to work with dollars only
total = float(total)
total = "%.2f" % total
cents = total[-3: ]
total = total[0:-3]
# Format with dollar sign and commas
if len(total) <= 3:
numForm = "$"+total+cents
else:
n = len(total)
numForm = ""
while n > 3:
last = total[-3: ]
numForm = ","+last + numForm
n -= 3
total = total[0:-3]
numForm = "$"+total+numForm+cents
return numForm
###############################################################################
# to format numbers
def numFormat(number):
# Converting number to string, cut out cents to work with dollars only
number = str(number)
# Format with dollar sign and commas
if len(number)> 3:
n = len(number)
numForm = ""
while n > 3:
last = number[-3: ]
numForm = ","+last + numForm
n -= 3
number = number[0:-3]
numForm = number+numForm
return numForm
###############################################################################
# to format percent
def percentFormat(decimal):
decimal = decimal * 100
decimal = "%.2f" % decimal
decimal = str(decimal) + "%"
return decimal
###############################################################################
def main():
# connect to the database and create handler
filename = "collegeData.db"
conn = db.connect(filename)
cursor = conn.cursor()
# prompt for college name
collegeName = raw_input("Enter name of college: ")
# check for results from
sql = ("SELECT * FROM collegeData WHERE CollegeName=?")
cursor.execute(sql, [collegeName])
count = 0
# for row in results:
for row in cursor:
rank = row[1]
classes20 = row[2]
alumniRate = row[3]
board = row[5]
avgAid = row[6]
sizeFT = row[7]
count = count + 1
if count == 0:
# prompt user for college info
rank = input("Enter the school rank: ")
classes20 = input("Enter % of class sizes under 20 (0 if N/A): ")
alumniRate = input("Enter alumni contribution rate: ")
board = input("Enter room & board expenses: ")
avgAid = input("Enter average financial aid package: ")
sizeFT = input("Enter size of full-time students: ")
# apply model equation
calcTuition = -89.1861 * rank + -8758.1325 * classes20 + -16707.8015 * alumniRate
+ 1.0977 * board + 0.3953 * avgAid + -0.2794 * float(sizeFT) + 27317.312
# update the table to include entered data
sql = ("INSERT INTO CollegeData VALUES (?,?,?,?,?,?,?,?)")
parameters = (collegeName, rank, classes20, alumniRate, calcTuition, board,
avgAid, sizeFT)
cursor.execute(sql, parameters)
# commit changes
conn.commit()
cursor.close()
# output results to user
print "The predicted cost for %s is: %s\n" % (collegeName,
dollarFormat(calcTuition))
print "*"*50
# more results hit enter
print "To see the factors affecting the cost hit enter:"
enter = raw_input(">")
print """
%s
------------------------------------School Rank:
%d
%% of Classes < 20 Students:
%s
Alumni Contribution Rate:
%s
Room & Board Expenses:
%s
Average Financial Aid Package:
%s
Size of Full-time Students:
%s
""" % (collegeName, rank, percentFormat(classes20), percentFormat(alumniRate),
dollarFormat(board), dollarFormat(avgAid), numFormat(sizeFT))
if __name__ == "__main__":
main()
Download