TOWSON UNIVERSITY Multiple Regression Analysis Four Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le 12/11/2012 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le ABSTRACT The United States education system is known as the best system all over the world. Most students graduated from an American college will have enough knowledge and skills to be successful. On the other hand, associated with the high quality school system, the cost necessary for a bachelor degree in the U.S. is raising every semester nowadays. Surprisingly, researches show that students are taking longer time to finish their higher education; they tend to take 5 to 6 years to finish a 4-year program. This is a serious bad sign for both the students as well as the American education system. To find out what factor is behind this problem, we decide to conduct regression model to understand the 4-year graduation rate of college across the country. To approach this study, we collected data of Top 200 American Colleges in 2011, which were ranked by Forbes. Forbes is known as one of the most reliable source for economic, business and financial news and information. Before actually conduct the regression model, we decided to do a small research to find out what others have find out about the graduation rate and a hypothesis test in order to bring more accuracy to our regression model. 1 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le LITERATURE REVIEW In order to conduct the most efficient study about college graduation rate, our group researched to see what others have find out about the graduation rate. We found a very useful article of a study about “factors behind college 4-years and 6-years graduation rates”, which also is our main topic. The study was conducted by Brian Gallardo in Fall 2004. Although it has been 8 years now, I believe that the outcome of Gallardo’s study is still valuable for us. After reviewing several prior studies about graduation rate, Brian decided to use an ordinary least square analysis to understand the impact of different variables on the graduation rate. The graduation rate was defined as the time it takes for a full-time, non-transferred student to receive his/her bachelor degree. The variables of his model are number of undergraduate population, instate tuition, student/faculty ratio, retention rate, average high school GPA of admitted students and region within the U.S, where the school located. The information about 100 public schools in the United States was collected from Peterson’s 4-year College and University (2005) and the National Center for Education Statistics website. Brian conducted 2 separated analyses about 4-year and 6-year graduation rate. Unfortunately, the outcome of the 4-year graduation analysis was not as good as expected. It only can explain about 46% the variation of 4-year graduation rate by the independent variables set. On the other hand, the analysis on 6-year graduation rate was very good, explains 81% the variation of 6-year graduation rate. After brought up the result, Brian Gallardo concluded that the 4-year and 6-year graduation rate of a random school can be explained by its average high school GPA of admitted students, the retention rate and instate tuition. Unlike the high school GPA, which was stated as a weak predictor, the retention rate and instate tuition are two very important variables in predicting the graduation rate of a school. 2 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Brian’s study also carries some odd outcome and limitations, such as the differences between analyses on 4-year and 6-year graduation rate; small sample size, both horizontally and vertically. Overall, the study of “What are the Factors behind College Graduation Rates?” by Brian Gallardo is very valuable for us. It helps us to refine the scope of our study to only certain aspects. By referring back to this study while conducting ours, we can improve and revise our study to make it become more accurate. We also found a research named “Placing College Graduation Rates in Context”, which was conducted by the US Department of Education. This research indicates how the “Gender and Ethnicity” influence the graduation rate of a university or college. According to the research, the average 6-year graduation rate for women was 60 percent. It is nearly 6 percentage points higher than the comparable rate for men. In general, the number of low- income students increased, so the distance in graduation rates between women and men are changed. In addition, the gender gap was greatest in institutions with large low-income enrollments. Within each racial/ethnic group, as with all students, graduation rates tended to decline as the overall proportion of low-income students in the cohort increased. In general, White and Asian students tended to graduate at higher rates than Black and Hispanic students. The average gap in graduation rates between White and Black students was 18 percentage points, and between White and Hispanic students was 12 percentage points. However, in very selective baccalaureate institutions with large low-income enrollments, which include many Historically Black Colleges and Universities, the graduation rate of Black students was slightly higher than of White students. 3 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le DESCRIPTIVE STATISTIC After researching about what factors are behind the graduation rate of a college, we decided to conduct a model using the following variables: Rank (RANK) This RANK independent variable shows the places of schools in the “Top 200 American Colleges” ranked by Forbes. The rank runs from 1 – 200. The higher the rank is, in other words, the lower the variable is, the better Forbes thinks the university is. We believe that the students will perform better in higher ranked schools, which will boost the school’s graduation rate. The mean, the standard average, of this independent variable is 100.59. The median is 101, which means there are 100 colleges ranked below and 100 colleges ranked above the rank of 101. There are 25% of the colleges in the data set have the ranks below 50 and 25% colleges have the ranks above 151. Obviously, the inter-quartiles range of this variable is 101. The average difference between a school’s rank and the mean of this variable is 58.01. Within the RANK variable, the case that has the data which is closest to each of the statistic above is obviously the school in the respective rank. For example, the school ranked number 1, Princeton University, has the closest rank to the minimum of the “Rank” variable. The RANK variable also has the Pearson’s Coefficient of Skewness of -0.0212, which indicates that the data’s distribution is symmetric and the mean is close to the median. There is no outliner in this variable. 4 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Rank Histogram Number of Schools 30 25 20 15 10 5 0 Rank Figure 1: RANK Histogram 5 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Out-of-State Tuition (OUT_STA_TUITION) This variable explains the amount of tuition fee student have to pay for a year of 4-yearundegraduation period. It’s seemed to have a better education in the school with higher tuition fee, so we think they will have higher graduation rate. The mean of this variable is $35,742; we can understand this in this way: the average payment for tuition fee is $35,742. The median is $38,290, which means that 50% of the data is lower than $38,290 and the school has the closest tuition fee to this point is The College of Wooster. There is 25% of schools have the tuition fee lower than $31,822 (Marquette University) and 25% of them higher than $41,992 (Santa Clara University). The highest tuition fee is $45,290 and in our research, there are some military universities that provide their education at zero. That is the reason why we have the minimum point of $0 in our data. The average distance from the mean is $8,875.835. Pearson’s skewness is -0.86, this indicates that, there is a negative skewed in the data of tuition fee. There are more high cost colleges (compared to Mean) than low cost college. The outliners of this variable are 0 (US Military Academy; US Air Force Academy; US Naval Academy; US Coast Guard Academy), 910 (Berea College) and 4650 (Brigham Young University). 6 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Tuition Fee 90 80 70 60 50 40 30 20 10 $40,000 <$50,000 $30,000 <$40,000 $20,000 <$30,000 $10,000 <$20,000 0 - <$10,000 0 Figure 2: OUT_STA_TUITION Histogram 7 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Total Number of Enrolment (TOTAL_ENROLL) This variable shows the total enrolments that apply to the universities and colleges in academic year of 2011. The more number of students want to study in a school, the higher quality of education it has. So, we think there is a direct relationship between the total enrolment number and graduation rate. The average enrolment in our data is 12,988. The highest number of enrolment is 61,545 and the lowest is 171. There is 100 universities (50% of the data) have the enrolment number under 7,195 (the median, this number is the total enrolment of University of Puget Sound). This also mean other 100 universities have the enrolment number higher than 7,195. There is 25% of schools have the total enrolments lower than 3,587 (Haverford College and Goucher College) and 25% of them higher than 20,828 (Virginia Polytechnic Institute and State University are the schools has closest total enrolment number). The average distance from the mean of the data set is 12,758, which is displayed in the descriptive statistic table as Standard Deviation. There is a positive skewed in our data, with the value of PCS is 1.36, we can say there are more colleges and universities that have the number of enrolment lower than the mean. The outliners of this variable are 54871 (St. John’s University – New York) and 61,545 (University of California – Los Angeles). 8 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Total Enrollment 140 120 100 80 60 40 20 0 010,000 - 20,000 - 30,000 - 40,000 - 50,000 - 60,000 <10,000 <20,000 <30,000 <40,000 <50,000 <60,000 <70,000 Figure 3: TOTAL_ENROLL Histogram 9 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Undergraduate Population (UNDERGRAD_POP) The UNDERGRAD_POP variable shows the number of undergraduate students in each school in the data set. We think that students in bigger size schools will have more activities besides studying; accordingly, they will not be able to perform as well as students from smaller schools. As the result, the graduation rate in bigger size schools is expected to be lower than another similar school with smaller undergraduate students’ population. The mean is 7,549.10, which means in average, each college in the list has about 7,549 – 7,550 undergraduate students. In our data set, the Georgetown University has 7,590 undergraduate students, which is closest the mean. The median of 2,950 indicates that about 100 colleges in the data set have less than 2,950 undergraduate students and the other 100 schools have more than 2,950 undergraduate students. The College of Holy Cross, which has 2,905 undergrad students, is the median case of our data. In Top 200 America’s Colleges by Forbes, the College of the Atlantic has the lowest number of undergraduate students of 354 students while the Ohio State University – Main Campus has the highest number of 42,916 students. The first quartiles is 1,769, which means 25% colleges in the data set has less than 1,769 undergraduate students; and the third quartiles is 8,127 students that means 75% colleges in our data has less than 8,127 undergraduate students. Schools have the number of undergraduate students of the first and third quartiles are the Bates College and the Columbia University in the City of New York, respectively. The first and third quartiles give us the inter-quartiles range of 6,358 students. The PCS of this variable is 1.46, which indicate that the UNDERGRAD_POP variable has the positive skew, there are more colleges those has lower number of undergraduate students than colleges those have higher number. The PCS also can be supported by the fact that there are 4 outliners in this variable. They are the four colleges which have highest number of undergraduate students: the University of Texas at Austin (38,437 10 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le students), Pennsylvania State University – Main Campus (38,954 students), Texas A&M University (39,867 students) and Ohio State University – Main campus (42,916 students). Number of Schools Undergraduate Population Histogram 140 120 100 80 60 40 20 0 Undergraduate Population Figure 4: UNDERGRAD_POP Histogram 11 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Fall 2011 Acceptance Rate (ACCEPT_RATE) The ACCEPT_RATE variable shows the percentage of admitted application of Top 200 America’s Colleges in the Fall semester of 2011. We believe that the lower the acceptance rate is, the more competitive the study environment will be in each particular school. For that reason, the students will have more incentive to improve their performance in school, which will increase the graduation rate. The mean of this variable shows that in average, 45.19% applications were admitted to the Top 200 America’s College. The percent admitted of 45% is the median of this variable, which are very close to the mean. There are seven schools which accepted 45% of received applications in Fall 2011, and they are Grinnell College, Franklin and Marshall College, Smith College, Colorado School of Mines, University of California-Santa Barbara, University of California-Irvine and University of Maryland-College Park. The average difference between an actual acceptance rate of a school and the mean of this variable is 21.636%. There are 25% of the schools in the data set have the Fall Acceptance Rate lower than the 1st quartiles of 27%; and 25% of them have the rate higher than the 3rd quartiles of 63%. These two numbers provides us the inter-quartiles range of 36%. Bates College and Hamilton College are schools that have the Fall Acceptance Rate closest to the first quartiles; and Wabash College, Hiram College, Loyola University Maryland, Brigham Young University, Texas A & M University and Ohio State University-Main Campus are colleges which accept 63% of submitted application in Fall 2011. The PCS of this variable is 0.027, which is very low as expected since the mean and the median are very close. The PCS also indicates that the Fall Acceptance Rate data is symmetrically distributed. There is no outliner within this variable. 12 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Fall 2011 Acceptance Rate Histogram Number of Schools 50 40 30 20 10 0 Acceptance Rate Figure 5: ACCEPT_RATE Histogram 13 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Students/Faculty Ratio (STU_FAC_RATIO) The STU_FAC_RATIO variable shows the number of students over one faculty of the school. This ratio indicates how well the faculty can take care of their students. We think that the helps student receive from their professor is a very important factor that decide their performance in school. The more help from faculty the better the student will perform. The STU_FAC_RATIO variable indicates that the lower the ratio is the more availability the faculty will be to help students. As the result, we believe that the lower the variable is the better the student will perform; hence the higher the graduation rate will be. In average, each school in 200 colleges in the data set has the students per faculty ratio of 11.61, which showed in the mean. Half of schools in the list have less than 11 students per faculty and the other half of the data has the students per faculty ratio higher than 11.00. In average, a random selected student/faculty ratio in the data has the difference from the mean of 3.713. Among the data of this variable, the California Institution of Technology has the minimum students per faculty value of 3; and Florida State University has the highest ratio of 27 students per faculty. According to the quartiles number, a quarter of the data has the student/faculty ratio below 9 and a quarter of the data has the ratio above 13. The difference between the first and third quartiles is the interquartile range, which is 4 in this case. The STU_FAC_RATIO variable has the Pearson’s Coefficient Skewness of 0.49 indicates that the data is symmetrically distributed; but there is a possible high outliner, since the PCS is very close to 0.5. As expected, there is one outliner among the student/faculty ratios, which is the ratio of 27 students per faculty of the Florida State University. 14 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Student/Faculty Ratio Histogram 100 90 Number of Schools 80 70 60 50 40 30 20 10 0 0-5 6-10 11-15 16-20 21-25 26-30 Student/Faculty Ratio Figure 6: STU_FAC_RATIO Histogram 15 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le 4-Years Graduation Rate (GRAD_RATE) The “Four Years Graduate Rate” variable presents the percent of students graduated in four years of Top 200 America’s Colleges ranked by Forbes. This data is the dependent variable, which we want to predict by using other variables. The mean of this variable shows that in average, 70.78% students of the Top 200 America’s Colleges graduated in 4 years. The median of this variable is 72.50%. There are 13 schools have the percent close with that median, some of them are Clark University (73%), Saint Michaels College (73%), and University of Southern California (72%). The standard deviation tells us that average difference between a real percent students graduated of a school and the mean of this variable is 15.38%. This is a big number; it means that the difference between the percent of graduation rate of those schools is large. There are 25% of the schools in the data set have the percent of 4 years graduated students lower than the 1st quartiles of 60%, we have 7 schools have that percent, such as College of the Atlantic, Southern Methodist University, Virginia Military Institute and etc. And There are 25% of them have the percent higher than the 3rd quartiles of 83%, there are 5 universities have that percentage, such as United States Military Academy, Colgate University, Wake Forest University and etc. These two numbers provides us the inter-quartiles range of 23%. The PCS of this variable is -0.404, it means that our graduation rate data have a negative skewed. There is only one outlier in our data. That school is the University of Utah, with the 4-years graduation rate is only 23%. 16 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Number of Schools Graduation Rate Histogram 90 80 70 60 50 40 30 20 10 0 0%-20% 21%-40% 41%-60% 61%-80% 81%-100% Graduation Ratio Figure 7: GRAD_RATE Histogram 17 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Athlete (ATH) The “Athletic” variable shows the percent of student body that is a varsity athlete of Top 200 America’s Colleges ranked by Forbes. The mean of this variable shows that in average, 13.03% student's body in the Top 200 America’s College is varsity athlete. The percent admitted of 11% is the median of this variable, there are 5 schools have this percentage, they are: Rice University, Georgetown University and Saint Marys College of California, etc. It is seemed to have better body will have some positive effect to our study's abilities. So with the low percentages of Athletic students, it may decrease the percent of graduate rate a little bit. The average difference between a real percent of athletic students of a school and the mean of this variable is 10.53%. It is a huge number when compare with the percent of mean and median of it, so it means that difference in percent of athletic students between those schools is large. There are 25% of the schools in the data set have the percent of athletic students lower than the 1st quartiles of 3%; there are 12 schools have same that percent, some of them are University of California-Los Angeles, University of Michigan-Ann Arbor, University of Washington-Seattle Campus, and etc . And 25% of them have the percent higher than the 3rd quartiles of 22%, there are 10 universities share this same percentage, such as Lafayette College, Wheaton College (MA), Lawrence University, etc. These two numbers provides us the inter-quartiles range of 19%. The number of school which did not have any students, who has the athletic body is 20, such as Harvey Mudd College, Wellesley College, Mount Holyoke College, etc. The PCS of this variable is 0.578, it implies that the there is a positive skewed in the data of athletic students. Erskine College and Seminary is only one school which has the percent of students body that is a varsity athlete especially higher than other schools, with the percent is 45%. 18 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Number of Schools Athletic Rate Histogram 120 100 80 60 40 20 0 0%-10% 11%-20% 21%-30% 31%-40% 41%-50% Athletic Students Ratio Figure 8: ATH Histogram 19 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Financial Aid (FINANCIAL_AID) This variable is showing us the percentage of students who can receive financial aid from universities and colleges. Financial aid is usually given to the students who have a good academic result. In a school with high number of students who can receive financial aid, the graduation rate should be higher. The shape of the histogram is pretty similar to the histogram of tuition fee so we think there is a direct relationship between the Tuition Fee and Financial aid. The average of data is 75.88% means, for overall, there is 75.88% number of students who are studying in the colleges can receive financial aid. The highest and most frequency value is 100%, this means there are a lot of school offered financial aid to their entire student. The median is 77% (Carleton College; Gettysburg College; Vassar College; Virginia Polytechnic Institute and State University; University of California-Davis). There is 50% colleges provide financial aid to under 77% in total 100% population of their student and 50% of them offers financial aid to more than 77% population of their student. There is 25% of schools provide financial aid to lower 64% total student population and 25% provide financial aid to higher 93% total student. The 64%(1st quartile) financial aid is belonged to Claremont McKenna College; Amherst College; Vanderbilt University; Brown University; University of Michigan-Ann Arbor. The 92.75% (3rd quartile) financial aid is belonged to University of Redlands; Willamette University; Loyola Marymount University; University of Georgia. The average distance from the mean of the data set is 18.90, which is displayed in the descriptive statistic table as Standard Deviation. There is a no-skewed in our data with the skewness is -0.17. 44.5% of the schools provide over 80% financial aid. Only 3.5% of the university provides fewer than 20% financial aid (see frequency table). The outliners of this variable are 0 (US Air Force Academy; US Naval Academy; US Coast Guard Academy) and 11(US Military Academy). 20 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Financial Aid 100 90 80 70 60 50 40 30 20 10 0 0 - <20% 20% - <40% 40% - <60% 60% - <80% 80% <=100% Figure 9: FINANCIAL_AID Histogram 21 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Private (PRI) The “Private” variable presents the private schools and the public school in the Top 200 America’s Colleges ranked by Forbes. In this variables, 1 represents for the private schools and 0 represents for the public schools. The mean of this variable shows that in average, 77% schools in the Top 200 America’s Colleges are the private schools. As can be seen from the graph, there are more private schools than the public ones. That will have some influence on our regression model. The private schools gave the better environment for the students to study and graduate sooner than the public schools. Percentage of Male Students (MALE_PERCENT) This variable shows us the percentage of universities and colleges that have the percentage of male students is greater than the percentage of female students. According to the research “Placing College Graduation Rates in Context”, we think that, the school with higher percentage of female student will have higher graduation rate compare to the other. Each “1” record means the percentage of male student is equal or higher than 50%, else the record will show us “0”. 70.5% of schools have the smaller number of male students compare to the number of female students. The mean of the dataset is 0.295. That number indicates the minority of universities or colleges with the percentage of male students over 50%. Percentage of White Students (WHITE_PERCENT) The variable indicates the percentage of universities and colleges that has the population of White student is more than 50% total number of students. According to the research “Placing College Graduation Rates in Context”, we think that, the school with higher percentage of white student will have higher graduation rate compare to the other. The “1” value represents the school with the percentage of white student is equal or higher than 50% total population, else it 22 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le will be shown as “0”. We can see in the graph only 24% of the schools has the population of White student is more than 50% total number of students. The mean of the dataset is 0.24. That number indicates the minority of universities or colleges with the percentage of White students over 50%. Percentage of Part-time Student (PART) The "Pate-Time" variable presents the percent of part-time students in the schools of the Top 200 America’s Colleges ranked by Forbes. In this data set, “1” represents for the schools which have the percent of part-time students less than 2%; and “0” represents for the schools which have the percent of part-time students equal or more than 2%. The mean of this variable is 0.38; it shows that in average only 38% schools in the Top 200 America’s Colleges have less than 2% of parttime students. We can see that the part-time student body is big; and this will have an impact on our regression model since it takes a longer time for a part-time student to graduate. After choosing the variables to study on and collecting the data, we found that it might be interesting to compare the graduation rate between male dominated colleges and female dominated colleges. 23 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le HYPOTHESIS TEST There is a hypothesis test about the relationship between the percentage of male student and the Graduation Rate. We are guessing that the Graduation Rate is higher in the universities or colleges which have the percentage of Male Student lower than 50% compare to the universities or colleges which have the percentage of Male Student higher than 50%. We conduct a hypothesis test to check the accuracy of that claim. Ho: µGRAD_RATE, MALE_PERCENT=1 ≥ µGRAD_RATE, MALE_PERCENT=0 Hα: µGRAD_RATE, MALE_PERCENT=1 < µGRAD_RATE, MALE_PERCENT=0 Since the Levene’s Sig is less than 0.05 we have decided to use the Equal variances not assumed. The p-value of Equal variances not assumed in this case is: 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 𝑆𝑖𝑔 0.249 = = 0.1255 2 2 This means that if H0 is true, then there is a 12.55 % chance of getting a difference more extreme then the sample difference of 3.0898305 that we have in this test. α = 0.05 Decision Rule: Reject H0 if p-value < α , otherwise, do not reject H0 Since 0.1255 > 0.05, we cannot reject H0. We do not have enough evidence to say that the Graduation Rate is higher in the universities or colleges which have the percentage of Male student lower than 50% compare to the universities or colleges which have percentage of Male Student higher than 50%. 24 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Independent Samples Test Levene's Test for Equality of Variances 4 years Equal variances assumed graduation rate (%) Equal variances not assumed F 20.45 4 Sig. .000 t 1.368 1.162 t-test for Equality of Means Sig. (2Mean Std. Error df tailed) Difference Difference 197 .173 - 2.2582316 3.0898305 80.02 .249 - 2.6584805 2 3.0898305 25 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le MULTIPLE REGRESSIONS ANALYSIS This is the regression model that we will use to predict the 4-year graduation rate of a particular college in the U.S. For the purpose of the model, we change the unit of the out-of-states tuition fee, number of total enrollment and undergraduate population to the unit of 1,000. For example, University of California-Los Angeles have its out-of-state tuition of $35,564, which will show up in out model as $35.564 thousands. We will use the IBM SPSS program to help us conduct our analysis. Our regression model has the dependent variable is 4-years graduation rate. We are expected Out-state Tuition fee, Number of Enrollments, Private, Financial Aid, Percentage of White, and Part-time to have a positive relationship with our dependent variable; Rank, Fall 2011 Acceptance Rate, Student/Faculty, and Percentage of Male to have negative relationship with the graduation rate. We do not have any expectation about the relationship between our dependent variable with Undergraduate Population and Athletic variables. After running the regression, we choose the model 3, which has the highest adjusted R2 and the smallest standard error of the estimate. We find that we should not use both private and the percentage of white students since removing these variables decreases our standard error of the estimate. 26 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Variables Entered/Removed Model Variables Entered Variables Removed Method 1 Athletic, Fall 2011 . Enter Acceptance Rate, % Male =50, Part-time, Out-state Tuition fee, % White = 50, Private, Financial Aid, Number of Enrollments, Rank, Student/Faculty , Undergraduate Population 2 . Private Backward (criterion: Probability of Fto-remove >= .100). 3 . % White =50 Backward (criterion: Probability of Fto-remove >= .100). 4 . Number of Backward (criterion: Probability of FEnrollments to-remove >= .100). 5 . Part-time Backward (criterion: Probability of Fto-remove >= .100). Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate a 1 .828 .685 .665 8.4425118 b 2 .827 .685 .666 8.4253857 3 .827c .684 .667 8.4172614 d 4 .825 .681 .666 8.4257515 5 .823e .678 .664 8.4498776 Regression Equation: 4 years graduation rate (%) = 90.570 - 0.093(Rank) + 0.226(Out-State Tuition fee) + 0.103(Number of Enrollments) - 0.278(Undergraduate Population) - 0.076(Fall 2011 Acceptance Rate) - 0.537(Student/Faculty) - 0.127(Financial Aid) - 4.831(%Male =50) + 2.304(Part-time) + 0.157(Athletic) 27 Multiple Regression Analysis – 4-Years Graduation Rate Model 3 (Constant) Rank Out-state Tuition fee Number of Enrollments Undergraduate Population Fall 2011 Acceptance Rate Student/Faculty Financial Aid % Male =50 Part-time Athletic Nhat Nguyen – Minh Tran – Viet Le Coefficientsa Unstandardized Standardized Coefficients Coefficients B Std. Error Beta 90.570 4.997 -.093 .017 -.368 .226 .086 .138 .103 .087 .090 -.278 .132 -.182 t 18.126 -5.487 2.636 1.175 -2.097 Sig. .000 .000 .009 .241 .037 -.076 .047 -.112 -1.609 .109 -.537 -.127 -4.831 2.304 .157 .294 .042 1.456 1.443 .071 -.137 -.165 -.152 .077 .113 -1.830 -3.012 -3.317 1.596 2.202 .069 .003 .001 .112 .029 Statistical Significance and Interpretation: RANK is statistically significant at the 1% level of significance. For each rank lower of the University, in other words the higher the variable is, the percent of students graduate in four years will decrease by 0.093%, holding all else constant. It is obvious. Because the rank of the school is based on the quality of the schools, so the higher the rank is the better environment for students to study and graduate sooner. PERCENTAGE OF MALE is statistically significant at the 1% level of significance. The PERCENTAGE OF MALE implies that if the school has the percent of male students equal or higher than 50% in the school, the graduate rate in four years will decrease by 4.831%, holding all else constant. This is not really an evidence to provide that female students are good students than male; but they already did a better job than the man in those Universities. The reasons for that could be most of female students work harder and focusing more than the other part. 28 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le OUT-STATE TUITION FEE is statistically significant at the 1% level of significance. For each $1,000 increasing in the outstate tuition fee for one year of the university, the percent of students graduate in four years will increase by 0.226%, holding all else constant. The reason for that may be because the high tuition fee makes the students have more effort to finish the university as soon as possible. FINANCIAL AID is statistically significant at the 1% level of significance. For each additional percent students receive the financial aid from University the percent of students graduate in four years will decrease by 0.127%, holding all else constant. It could be that if more students receive the scholarships from the schools, they would want to stay at the schools to study more. UNDERGRADUATE POPULATION is statistically significant at the 5% level of significance. For each more 1,000 undergraduate students in the University, the percent of students graduate in four years will decrease by 0.278%, holding all else constant. The interaction of the number of undergraduate students and the percentage of students graduated in four years is understandable. The students study in a smaller environment tent to focus more on study, while students from crowded school may be distracted by other activities. ATHLETIC is statistically significant at the 5% level of significance. For each percent increasing of Student Body that is a Varsity Athlete, the percent of students graduate in four years will increase by 0.157%, holding all else constant. The connection between the number of healthy students and the ratio of number students graduated is very clearly. Because, the healthy body will help the students have enough energy and ability to study hard and have clearly mind. It also reduces the stressful of the academic pressure. STUDENT/FACULTY is statistically significant at the 10% level of significance. For each one more student who one professor has to work on, the percent of students graduate in four years will decrease by 0.537%, holding all else constant. It is easy to understand the reason for which 29 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le the number of students who study with one professor have the influence on the ratio of students graduated. With the larger number of students, professors are harder to help and teach each individual student to have better knowledge. NUMBER OF ENROLLMENTS, PART-TIME and FALL 2011 ACCEPTANCE are statistically insignificant, and thus we do not interpret their meaning in the model, since we cannot reject the hypothesis that they are really zero. Standard Error of the Estimate: The standard error of the estimate in this case is 8.4172614%, which means that on average, our prediction of the percent of students graduated in 4 years will be off by 8.4172614% from its actual percent. This is an acceptable level of standard of error of estimation, since it is only about 8% within the 100% scale of the graduation rate. R2 and Adjusted R2: R2 in this case means that 68.4% of the variation in the percent of students graduated in 4 years of a university can be explained by the variation in the set of independent variables. Adjusted R2 in this case is 66.7%, which means that 66.7% of the variation in percent of students graduated in 4 years of a university can be explained by the variation in the set of independent variables, adjusting for the number of independent variables in our model. Overall, this is a good enough model, since we can predict over 66% of the variation in percentage. However, there are a lot of different things have affected on our regression, such as the school retention rate, the average high school GPA of freshmen, the average SAT score of all students, what majors the school offers, the school’s calendar system, etc. If we have more information, we may be able to increase the predictive power of our regression. F-test: The F-test has the following hypotheses. 30 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le H0: β1 = β2 = β3 = . . . = βn = 0 H1: At least one β does not equal zero In this case, since Fsig < 0.01, we know that our overall model is statistically significant at the 1% level of significance. What this means is that we know that at least one of our coefficients is different than zero and thus, our model is able to explain some of the reason for the percent of students graduated in 4 year. ANOVAf Model 3 Regression Residual Total Sum of Squares 28775.854 13319.854 42095.709 df Mean Square 10 2877.585 188 70.850 198 F 40.615 Sig. .000c 31 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le ASSUMPTION ON LINEAR REGRESSION After having the result of the regression analysis, we decided to conduct a series of assumption tests in order to find out some more information about our dependent variable as well as independent variables. Test 1: Residuals’ Expected Value The residual mean is 0.00. This indicates that our regression meets the test of residuals’ expected value. The number also means that the standard average of all residuals in our regression is 0.00. In other words, each case has some residual and the values of all the residuals offset each other and give us the residual mean of 0.00 Residuals Statistics Minimum Maximum Mean Std. Deviation Predicted 39.849262237548835 92.92359924 70.7839195979899 12.068485436 Value 3164060 50 218943 Residual - 20.09190368 -.000000000000010 8.1826797490 28.479360580444336 6523438 39162 Std. Predicted -2.563 1.835 .000 1.000 Value Std. Residual -3.373 2.380 .000 .969 N 199 199 199 199 Test 2: Residuals’ Constant Variance The Scatterplot chart shows that the differences between predicted values by our regression are varied. The regression failed the residuals’ constant variance test; and we say that our regression has hetroskedasticity. The more data the regression has or the further the regression progress, the less different the predicted values will be. 32 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Test 3: Residuals are Normal Distributed The normal P-P plot chart clearly indicates that our regression meets the normal distributed test. All the observations are very close to the predict values. This means our regression is quite accurate in predicting the 4 years graduation rate of college students. Our regression model does not predict very accurate only a several universities which have the graduation rate between 20% and 40%. Even so, the differences between the predicted value and the actual observation are insignificant. 33 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Test 4: Residuals are Independent The Durbin-Watson number of this regression is 1.767, which is between 1.5 and 2.5. This number means that our regression’s residuals are independent from each other. In other words, error on one variable can only effect on the regression result up to the limit of the variable significant; it should not affect other variables and cannot help us to find other variable’s errors. Model 1 R .828a Model Summaryb Adjusted R Std. Error of R Square Square the Estimate .685 .665 8.442511818 337984 DurbinWatson 1.767 Test 5: Independent Variable Are Not Related All of our variables have their VIF less than 10. This means our regression meets the fifth assumption; in other words, our independent variables are not highly related to each other. Coefficientsa Unstandardized Standardized Coefficients Coefficients Model 1 (Constant) Rank Out-state Tuition fee Number of Enrollments Undergraduate Population Fall 2011 Acceptance Rate Student/Faculty Private Financial Aid % Male &gt;=50 % White &gt;= 50 Part-time Athletic B Std. Error 88.76 5.370 9 -.092 .017 .221 .092 .129 .093 Collinearity Statistics Toleranc Beta t Sig. e VIF 16.53 .000 2 -.367 -5.449 .000 .373 2.678 .135 2.408 .017 .540 1.851 .113 1.388 .167 .256 3.900 -.268 .140 -.176 -1.922 .056 .203 4.932 -.080 .048 -.119 -1.682 .094 .339 2.950 -.538 1.142 -.129 -4.739 1.421 2.310 .150 .301 2.321 .045 1.466 1.735 1.452 .072 -.137 .033 -.167 -.149 .041 .077 .108 .289 .380 .508 .798 .660 .719 .629 -1.791 .492 -2.900 -3.232 .819 1.590 2.081 .075 .623 .004 .001 .414 .113 .039 3.462 2.631 1.967 1.252 1.516 1.390 1.589 34 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le CONCLUSION After conducting this study about 4-year graduation rate of top 200 America Colleges, by Forbes, our team are very satisfied with the result. We were able to create a regression model that can explain about 66.7% the variation of the 4-year graduation rate. Our model also indicates that out-state tuition fee, percentage of male students, the schools’ rank and percentage of students who receive financial aid, which are statistically significant at 1% level of significant, are important factors, in predicting graduation rate of a particular college in the United States. A higher ranked school, indicated by a smaller number in the RANK variable, should have a higher graduation rate than a similar school which is ranked below. We believe that the result is very reasonable since a higher rank school tends to have an overall environment that encourages students to perform better in their academic career. We also think that Forbes is doing a very good job in valuation colleges across the country. Unfortunately, the school itself cannot intervene in Forbes’s job; but still, they can try to raise their rank by improving their quality through adopting new technology, create a larger and better academic resource for student, orient teaching and learning style toward engaging, and sharing, etc. Out-state tuition fee also is a factor that has positive relationship with the 4-year graduation rate. There are many reasons to explain this relationship: it can be that students will try to finish their program faster when they have to pay more to stay in school, or it can be that school that collect higher tuition from students has more capability to improve the quality, which directly affect students’ performance. However, this does not mean that schools should try to increase their tuition fee, which they are already doing. Instead of boosting the cost, school should try to efficiently spend their money to make a better study environment for students. On the other hand, percentage of students who receive financial aid has a negative relationship with the graduation rate. More students receive financial aid will decrease the school’s 35 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le graduation rate since those student have to pay less than others, who does not have financial aid, can afford to stay in school for a longer time. Our model also says that schools with more male students than female have lower graduation rate than similar schools which have less male than female. Although this is a very interesting point, we do not have any further comment about this result since our hypothesis test’s failed to conclude that school with percentage of male students lower than 50% have the higher graduation rate than the opposite group. Besides, our regression model also indicates that Undergraduate student population, Percentage of varsity athlete student and Student/faculty ratio also have impacts on the school’s graduation rate but not as significant as the above four variable. Although Number of enrollment in Fall 2011, percentage of part-time student and acceptance rate in Fall 2011 are in our equation, they are stated to be statistically insignificant. We believe that these numbers are easily changed in a short period of time. They can be very different from one semester to another; hence it is intelligible that they can only affect very little on the school’s graduation rate. We find an unexpected outcome in our regression model that the independent variable of percentage of white student was rejected by SPSS. We intend to collect data about the percentage of white student with the belief that white students used to out-perform other races at school. This is also what we found in the study of US Department of Education, which indicates that white and Asian students tended to graduate at higher rate than African and Hispanic students. This may because of the inefficient nature of the data since it only can separate White students versus other races. Although we believe that our model is very good, there are still several limitations that need to be improved. First, we think that the model can be much more accurate if we are able to collect 36 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le data from more than just 200 schools. The more data we collect the deeper SPSS can understand the relationship between the dependent variable and each independent variable. Secondly, our group should have collected some historic data; so that we can understand the data’s trend over time and be able to remove the influence of short-term characteristic of the data. Besides, we also suggest adding some other independent variable, such as the school retention rate, the average high school GPA of freshmen, the average SAT score of all students, what majors the school offers, the school’s calendar system, etc. 37 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le APPENDIX A - FREQUENCY TABLE Rank (RANK) 0-25 26-50 51-75 76-100 101-125 126-150 151-175 176-200 Total Cumulative frequency 25 50 75 99 124 149 174 199 Frequency 25 25 25 24 25 25 25 25 199 Percentage 12.56% 12.56% 12.56% 12.06% 12.56% 12.56% 12.56% 12.56% 100% Cumulative percentage 12.56% 25.12% 37.68% 49.74% 62.30% 74.86% 87.42% 100.00% Outstate Tuition (OUT_STA_TUITION) 0 - $10,000 $10,000 - $20,000 $20,000 - $30,000 $30,000 - $40,000 $40,000 - $50,000 Total Frequency 7 3 34 72 84 199 Cumulative frequency 7 10 44 116 200 Percentage 3.5% 1.5% 17.0% 36.0% 42.0% 100% Cumulative percentage 3.5% 5.0% 22.0% 58.0% 100.0% Number of Total Enrollment (TOTAL_ENROLL) 0 - 10,000 10,000 - 20,000 20,000 - 30,000 30,000 - 40,000 40,000 - 50,000 50,000 - 60,000 60,000 - 70,000 Total Frequency 116 31 30 13 6 2 1 199 Cumulative frequency 116 148 178 191 197 199 199 Percentage 58.5% 15.5% 15.0% 6.5% 3.0% 1.0% 0.5% 100% Cumulative percentage 58.5% 74.0% 89.0% 95.5% 98.5% 99.5% 100.0% 38 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Undergraduate Population (UNDERGRAD_POP) 0-5,000 5,001-10,000 10,001-15,000 15,001-20,000 20,001-25,000 25,001-30,000 30,001-35,000 35,001-40,000 40,001-45,000 Total Frequency 123 32 10 9 5 9 7 3 1 199 Cumulative frequency 123 155 165 174 179 188 195 198 196 Percentage 61.81% 16.08% 5.03% 4.52% 2.51% 4.52% 3.52% 1.51% 0.50% 100% Cumulative percentage 61.81% 77.89% 82.91% 87.44% 89.95% 94.47% 97.99% 99.50% 100.00% Fall 2011 Acceptance Rate (ACCEPT_RATE) 0-10% 11%-20% 21%-30% 31%-40% 41%-50% 51%-60% 61%-70% 71%-80% 81%-90% 91%-100% Total Frequency 10 23 25 27 27 24 38 17 8 0 199 Cumulative frequency 10 33 58 85 112 136 174 191 199 199 Percentage 5.03% 11.56% 12.56% 13.57% 13.57% 12.06% 19.10% 8.54% 4.02% 0.00% 100% Cumulative percentage 5.03% 16.58% 29.15% 42.71% 56.28% 68.34% 87.44% 95.98% 100.00% 100.00% Student/Faculty Ratio (STU_FAC_RATIO) 0-5 6-10 11-15 16-20 21-25 26-30 Total Frequency 2 90 74 28 4 1 199 Cumulative frequency 2 92 166 194 198 199 Percentage 1.01% 45.23% 37.19% 14.07% 2.01% 0.50% 100% Cumulative percentage 1.01% 46.23% 83.42% 97.49% 99.50% 100.00% 39 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le 4-Years Graduation Rate (GRAD_RATE) 0 - 20% 21% - 40% 41% - 60% 61% - 80% 81% - 100% Total Frequency 0 7 45 82 65 199 Cumulative frequency 0 7 52 134 199 Percentage 0.00% 3.52% 22.61% 41.21% 32.66% 100.00% Cumulative percentage 0.00% 3.52% 26.13% 67.34% 100.00% Financial Aid (FINANCIAL_AID) 0 - 20% 21% - 40% 41% - 60% 61% - 80% 81% - 100% Total Frequency 5 2 27 77 89 199 Cumulative frequency 5 7 34 111 199 Percentage 2.5% 1.0% 13.5% 38.5% 44.5% 100% Cumulative percentage 2.5% 3.5% 17.0% 55.5% 100.0% Athlete (ATH) 0 - 10% 11% - 20% 21% - 30% 31% - 40% 41% - 50% Total Frequency 97 43 47 11 1 199 Cumulative frequency 97 140 187 198 199 Percentage 48.74% 21.61% 23.62% 5.53% 0.50% 100.00% Cumulative percentage 48.74% 70.35% 93.97% 99.50% 100.00% Private (PRI) Private Public Total Frequency 154 45 199 Percentage 77.39% 22.61% 100.00% 40 Multiple Regression Analysis – 4-Years Graduation Rate Nhat Nguyen – Minh Tran – Viet Le Percentage of Part-time Student (PART) <2% ≥2% Total Frequency 76 123 199 Percentage 38.19% 61.81% 100.00% Percentage of Male Students (MALE_PERCENT) ≥ 50% (1) < 50% (0) Total Frequency 59 141 200 Percentage 29.5% 70.5% 100.0% Percentage of White Students (WHITE_PERCENT) ≥ 50% < 50% Total Frequency 48 152 200 Percentage 24.0% 76.0% 100.0% 41