CS105 College Tuition: Data mining and analysis By Jeanette Chu & Khiem Tran 4/28/2010 Introduction College tuition issues are steadily increasing every year. According to the college pricing trends report released by Collegeboard, the average increase in tuition and fees at four-year colleges is around 6 percent. Roughly 15 percent of students that attend four-year colleges have experienced 15 percent or more increase in tuition pricing. How do colleges determine their financial worth? We hypothesized that several factors contribute to this large figure that are paid off by loans, which numerous students and parents pledge 20 years of their lives to paying off. The main contributors include the following attributes: Attribute Size of the Student Body Size of Faculty and Staff Location of the school Reasoning The students supply the revenue that keeps the college running The college needs to pay the salaries, a larger budget is needed for a larger staff The college needs to account for cost of maintenance Dataset Description The entire table was compiled by using data from the U.S. News and World Report College Rankings of National Colleges and the complete University Guide found towards the back of the survey. The university guide included the setting, room and board expenses, and financial aid packages, all of which we believed would affect the tuition costs. Attribute Rank Type Numeric Description Based on the Survey’s overall college scoring Average Freshman Retention Numeric % of the freshmen that continue at that college Percent of Class Sizes under 20 Numeric % of classes with less than 20 students Percent of Class Sizes over 50 Numeric % of classes with more than 50 students Percent of Full-time Faculty Numeric % of faculty that work full time Hypothesis Higher-ranked schools may charge more for the prestige and reputation A higher retention rate would mean more students to pay the tuition. This might lower the tuition per student. Fewer students in the classroom might mean higher tuition since this may increase the studentteacher ratio More students may decrease tuition costs since there are more students paying Tuition may be higher if the percentage was higher so that the college can pay salaries Freshmen in top 10% of HS Numeric College Acceptance Rate Numeric Alumni Contribution Rate Numeric Room & Board Numeric Percent with Financial Need Numeric Average Aid Package Numeric Setting Text Size of Full-time Students Numeric % of freshmen that graduated in the top 10% of their high schools Recruiting more of the smarter students may lead to an increase in tuition for an overall perception of a better school % of applicants who are A higher acceptance rate accepted into the college may lead to an increase in tuition (supply increases, demand increases) % of alumni contributing A greater contribution to the college from alumni may lead to a decrease in tuition Room and board A greater cost of room expenses at the college and board may lead to an increase in tuition (cost of living)? % of students with If more students are determined financial determined to need assistance financial assistance, tuition may increase Average financial aid A larger aid package may package awarded to lead to higher tuition students costs; raking in more revenue from students Location of the college: The cost of living varies; rural, suburban, urban the cost of land, building, maintenance may be higher in urban areas compared to rural areas Size of the full-time More full-time students student population may lead to a decrease in tuition since more revenue is coming in from current students Data preparation Formatting the data We briefly swept through the data after eliminating the attributes and corrected minor issues such as apostrophes and spelling mistakes. We formatted the data to exclude symbols such as percentages and dollar signs, and made sure to eliminate unique identifiers, in this case, the individual college names. We chose to run numeric estimation which requires numeric inputs to deliver numeric outputs, therefore if we wanted to use setting, we had to use numbers to distinguish them setting 3=urban, 2=suburban, 1=rural. Before running the tests, we randomized and split our main data (133 instances) into 1/3 for the test data and 2/3 for the training data. We eliminated the following attributes: Overall Score, Peer Score, Predicted Retention Rate, Actual Retention Rate. We feel that these four did not have a direct relationship with the attribute Tuition. We also wanted to use attributes that were more accessible and readily available to students and users. Creating the Database createCollegeDataDB.py: This program was written with SQL within Python to create the college data database containing a table with the attributes. It also parses through the comma separated value file (CSV) to insert the values in the correct columns. This is the database that our tuition calculator will be pulling data from so it only contains attributes relevant to the main model that we chose. (See Appendix A for full code). 1. Connected to the database (essentially creating the database file) 2. Created a handler to execute queries a. Executed the following query to create the table: CREATE TABLE collegeData ( CollegeName text, Rank int, Classes20 numeric, AlumniRate numeric, Tuition numeric, Board numeric, AvgAid numeric, SizeFT numeric ) 3. Connected to the CSV file containing the data 4. Read in each row of the data, splitting the string by the commas a. Executed the following query in Python to insert the values: for record in readCSV: record = string.split(record, ',') INSERT INTO collegeData VALUES (?, ?, ?, ?, ?, ?, ?, ?) parameters = (record[0], record[1], record[2], record[3], record[4], record[5], record[6], record[7]) cursor.execute(sql, parameters) Tuition-Calculator.py: This program was written to project tuition costs for any college, given the data inputs used in the model. The calculator was written based on the Linear Regression model. (See Appendix B for full code). 1. Connects to the database 2. Prompts user to enter a college name a. If the college appears in the table, the tuition is automatically calculated using the model b. If the college does not, the user is prompted to enter the attributes and this is added to the table Data Analysis Our goal of this project is to understand the weights and relationships of the variables used by U.S. News in determining the ranking of the Top National Universities and their effect on tuition. Using data from U.S. News college ranking, we ran three different numeric models and derive the equations and regression tree that estimate tuition based on the given attributes. We utilized ten-fold cross-validation on all data runs. The following tables give the results of the three models in Weka: Approaches Linear Regression M5P LeastMedSqd DecisionTable SMOreg Test Data Correlation Coefficient 0.8140 0.8783 0.7723 0.7115 0.8556 Training Data Correlation Coefficient 0.8789 0.8799 0.8725 0.6566 0.8878 The following models were used for analysis (Weka documentation): Linear Regression Model: “Class for using linear regression for prediction. Uses the Akaike criterion for model selection, and is able to deal with weighted instances.” This model gives us better correlation and yields a simple regression equation that is much more useful. We select this approach for further analysis. M5 Pruned Model Tree: “Split the parameter space into areas (subspaces) and build in each of them a linear regression model.” This model follows a decision tree approach, but uses linear regression. In this case, the tree breaks down into many-level nodes. Even though it yields good correlation, this model isn’t practical in comparison to the Linear Regression. Least Median Squared: “Implements a least median squared linear regression utilizing the existing Weka LinearRegression class to form predictions. The basis of the algorithm is Robust regression and outlier detection.” DecisionTable: “Class for building and using a simple decision table majority classifier.” The result came up with 10 rules for the data. There is a loss of accuracy in the training data. We believe this model is too simple for data and the many attributes. SMOreg: “Sequential minimal optimization algorithm for training a support vector regression using polynomial or RBF kernels. This implementation globally replaces all missing values and transforms nominal attributes into binary ones.” This model has the best correlation. However, the algorithm normalizes all attributes. Results We select the Linear Regression Model as our main model due to its correlation, simplicity and practicality in nature. The following result and analysis show the relationships between the attributes and their effect on Tuition. This regression equation gives us a lot of insights about the data. Tuition = Effect on Tuition Analysis -89.1861 * Rank + Negative -8758.1325 * %ClassesUnder20 + Negative -16707.8015 * AlumniGivingRate + Negative 1.0977 * BoardCost + Positive 0.3953 * AvgFinAid + Positive -0.2794 * SizeFullTimeStudents + 27317.312 Negative Better rankings (lower number) make tuition more expensive. Schools with smaller classrooms tend to be cheaper. Schools where Alumni donate a lot of money tend to be cheaper. Higher boarding cost means higher tuition cost as well. Higher financial aid package means tuition will be higher as well. More students decrease tuition. From looking at this table, we can see the weights of these 6 attributes on determining tuition. They give insights about colleges that one might not think about. Better ranked schools have higher tuition, but they also give higher financial aid. The linear regression shows us “hidden” data. Graphic 1: Tuition (color) vs. Rank (size) Treemap: Better ranking, higher tuition for colleges http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/rank-vs-tuition-treemap The smaller boxes indicate better rankings. The darker the color indicates the more expensive the school is. In the Urban segment, the concentration of better-ranked colleges is noticeably more expensive. So when a college improves its ranking, it has more leverage in the market to increase its tuition. When its tuition increases, the college can increases its financial aid as well, and both increases would make the school looks better and more prestigious. Graphic 2: Tuition vs. Rank on the Size of the school http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/tuition-vs-rank-on-size-of-school This is a different representation of the tree map above. In addition, the larger the dot indicates the larger the size of the school. From this depiction, there seem to be no large school with tuition above $30,000. This is probably due to the differences between public and private schools in term of ranking. State schools tend to be cheaper and larger, which dominates the $20-30 thousand range. Graphic 3: Average Financial Aid (color) vs. Acceptance Rate (size) http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/average-financial-aid-vs-acceptanc The larger boxes indicate higher acceptance rate. The darker boxes indicate better financial assistance. In this case, the graphic makes it apparent that there is a correlation between the two attributes. More selective schools give out better financial aid packages. Subsequently, more selective schools are likely to be better-ranked with higher tuition. The regression model also tells us that higher tuition means higher financial aid as well. The regression model and the graphics above all tie together the relationships of the attributes and how a college would charge its tuition. Conclusion From our analysis, we learn that better ranked and more selective schools do not simply imply to be more expensive. There are many attributes that factor into the cost of tuition. Our analysis also shows the natural relationship of costs in attending college. Tuition Cost + Board = Total Cost Total Cost – Financial Aid = Net Cost Here, we can see that schools with high Total Cost will compensate by offering better financial assistance. We would advise high school seniors to not be afraid to apply for better-ranked schools that looked expensive and prestigious, but may turn out to have a lower Net Cost. We also conclude that the colleges operate on economy of scale. The schools with more economy of scale can reduce its average cost of operations and allocate that surplus to financial aid for students. Extensions To increase the accuracy of our model, we acknowledge that we could have further categorized our data by private and public 4-year colleges, and private and public 2-year colleges. Public colleges have a tendency to charge less for overall tuition and this factor may change our variables. Possible Extensions to Python modules: 1. We could make this more customizable by calculating the tuition based on other criteria a. E.g., user wants the predicted tuition for colleges with a specific class size b. This calculator could then return all colleges and predicted tuition costs that fall within the specified class size 2. We could turn this into a web page calculator a. Command Python to interpret part of the text as HTML b. Use forms – either drop down to select a specific college, or input search criteria c. Select the necessary attribute information based on the user input d. Code in the model to calculate the tuition cost Appendices Appendix A: createCollegeDataDB.py # # # # # CS105 project: Predicting College Tuition by: Jeanette Chu & Khiem Tran name: createCollegeDataDB.py description: create the college data database and insert all data into the table # import necessary modules import sqlite3 as db import string # create/connect to the database filename = "collegeData.db" conn = db.connect(filename) cursor = conn.cursor() # create the table try: cursor.execute(""" CREATE TABLE collegeData ( CollegeName text, Rank int, Classes20 numeric, AlumniRate numeric, Tuition numeric, Board numeric, AvgAid numeric, SizeFT numeric ); """) except: pass # commit changes and close cursor conn.commit() cursor.close() # open file to read in readCSV = open('CollegeTuitionDataset-forPythonCalculator.csv', 'r') for record in readCSV: record = string.split(record, ",") # insert the values sql = """INSERT INTO collegeData VALUES (?, ?, ?, ?, ?, ?, ?, ?)""" parameters = (record[0], record[1], record[2], record[3], record[4], record[5], record[6], record[7]) cursor = conn.cursor() cursor.execute(sql, parameters) conn.commit() cursor.close() Appendix B: tuitionCalculator.py # CS105 project: Predicting College Tuition # by: Jeanette Chu & Khiem Tran # name: tuitionCalculator.py # description: calculates the tuition based on selected model # # import necessary modules import sqlite3 as db import math ############################################################################### # to format dollars def dollarFormat(total): # Converting number to string, cut out cents to work with dollars only total = float(total) total = "%.2f" % total cents = total[-3: ] total = total[0:-3] # Format with dollar sign and commas if len(total) <= 3: numForm = "$"+total+cents else: n = len(total) numForm = "" while n > 3: last = total[-3: ] numForm = ","+last + numForm n -= 3 total = total[0:-3] numForm = "$"+total+numForm+cents return numForm ############################################################################### # to format numbers def numFormat(number): # Converting number to string, cut out cents to work with dollars only number = str(number) # Format with dollar sign and commas if len(number)> 3: n = len(number) numForm = "" while n > 3: last = number[-3: ] numForm = ","+last + numForm n -= 3 number = number[0:-3] numForm = number+numForm return numForm ############################################################################### # to format percent def percentFormat(decimal): decimal = decimal * 100 decimal = "%.2f" % decimal decimal = str(decimal) + "%" return decimal ############################################################################### def main(): # connect to the database and create handler filename = "collegeData.db" conn = db.connect(filename) cursor = conn.cursor() # prompt for college name collegeName = raw_input("Enter name of college: ") # check for results from sql = ("SELECT * FROM collegeData WHERE CollegeName=?") cursor.execute(sql, [collegeName]) count = 0 # for row in results: for row in cursor: rank = row[1] classes20 = row[2] alumniRate = row[3] board = row[5] avgAid = row[6] sizeFT = row[7] count = count + 1 if count == 0: # prompt user for college info rank = input("Enter the school rank: ") classes20 = input("Enter % of class sizes under 20 (0 if N/A): ") alumniRate = input("Enter alumni contribution rate: ") board = input("Enter room & board expenses: ") avgAid = input("Enter average financial aid package: ") sizeFT = input("Enter size of full-time students: ") # apply model equation calcTuition = -89.1861 * rank + -8758.1325 * classes20 + -16707.8015 * alumniRate + 1.0977 * board + 0.3953 * avgAid + -0.2794 * float(sizeFT) + 27317.312 # update the table to include entered data sql = ("INSERT INTO CollegeData VALUES (?,?,?,?,?,?,?,?)") parameters = (collegeName, rank, classes20, alumniRate, calcTuition, board, avgAid, sizeFT) cursor.execute(sql, parameters) # commit changes conn.commit() cursor.close() # output results to user print "The predicted cost for %s is: %s\n" % (collegeName, dollarFormat(calcTuition)) print "*"*50 # more results hit enter print "To see the factors affecting the cost hit enter:" enter = raw_input(">") print """ %s ------------------------------------School Rank: %d %% of Classes < 20 Students: %s Alumni Contribution Rate: %s Room & Board Expenses: %s Average Financial Aid Package: %s Size of Full-time Students: %s """ % (collegeName, rank, percentFormat(classes20), percentFormat(alumniRate), dollarFormat(board), dollarFormat(avgAid), numFormat(sizeFT)) if __name__ == "__main__": main()