Multiple Regression Analysis

advertisement
TOWSON UNIVERSITY
Multiple Regression
Analysis
Four Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
12/11/2012
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
ABSTRACT
The United States education system is known as the best system all over the world. Most
students graduated from an American college will have enough knowledge and skills to be
successful. On the other hand, associated with the high quality school system, the cost necessary
for a bachelor degree in the U.S. is raising every semester nowadays. Surprisingly, researches
show that students are taking longer time to finish their higher education; they tend to take 5 to 6
years to finish a 4-year program. This is a serious bad sign for both the students as well as the
American education system. To find out what factor is behind this problem, we decide to
conduct regression model to understand the 4-year graduation rate of college across the country.
To approach this study, we collected data of Top 200 American Colleges in 2011, which were
ranked by Forbes. Forbes is known as one of the most reliable source for economic, business and
financial news and information. Before actually conduct the regression model, we decided to do
a small research to find out what others have find out about the graduation rate and a hypothesis
test in order to bring more accuracy to our regression model.
1
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
LITERATURE REVIEW
In order to conduct the most efficient study about college graduation rate, our group researched
to see what others have find out about the graduation rate. We found a very useful article of a
study about “factors behind college 4-years and 6-years graduation rates”, which also is our main
topic. The study was conducted by Brian Gallardo in Fall 2004. Although it has been 8 years
now, I believe that the outcome of Gallardo’s study is still valuable for us. After reviewing
several prior studies about graduation rate, Brian decided to use an ordinary least square analysis
to understand the impact of different variables on the graduation rate. The graduation rate was
defined as the time it takes for a full-time, non-transferred student to receive his/her bachelor
degree. The variables of his model are number of undergraduate population, instate tuition,
student/faculty ratio, retention rate, average high school GPA of admitted students and region
within the U.S, where the school located. The information about 100 public schools in the United
States was collected from Peterson’s 4-year College and University (2005) and the National
Center for Education Statistics website.
Brian conducted 2 separated analyses about 4-year and 6-year graduation rate. Unfortunately, the
outcome of the 4-year graduation analysis was not as good as expected. It only can explain about
46% the variation of 4-year graduation rate by the independent variables set. On the other hand,
the analysis on 6-year graduation rate was very good, explains 81% the variation of 6-year
graduation rate. After brought up the result, Brian Gallardo concluded that the 4-year and 6-year
graduation rate of a random school can be explained by its average high school GPA of admitted
students, the retention rate and instate tuition. Unlike the high school GPA, which was stated as a
weak predictor, the retention rate and instate tuition are two very important variables in
predicting the graduation rate of a school.
2
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Brian’s study also carries some odd outcome and limitations, such as the differences between
analyses on 4-year and 6-year graduation rate; small sample size, both horizontally and
vertically.
Overall, the study of “What are the Factors behind College Graduation Rates?” by Brian
Gallardo is very valuable for us. It helps us to refine the scope of our study to only certain
aspects. By referring back to this study while conducting ours, we can improve and revise our
study to make it become more accurate.
We also found a research named “Placing College Graduation Rates in Context”, which was
conducted by the US Department of Education. This research indicates how the “Gender and
Ethnicity” influence the graduation rate of a university or college.
According to the research, the average 6-year graduation rate for women was 60 percent. It is
nearly 6 percentage points higher than the comparable rate for men. In general, the number of
low- income students increased, so the distance in graduation rates between women and men are
changed. In addition, the gender gap was greatest in institutions with large low-income
enrollments.
Within each racial/ethnic group, as with all students, graduation rates tended to decline as the
overall proportion of low-income students in the cohort increased. In general, White and Asian
students tended to graduate at higher rates than Black and Hispanic students. The average gap in
graduation rates between White and Black students was 18 percentage points, and between
White and Hispanic students was 12 percentage points. However, in very selective baccalaureate
institutions with large low-income enrollments, which include many Historically Black Colleges
and Universities, the graduation rate of Black students was slightly higher than of White
students.
3
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
DESCRIPTIVE STATISTIC
After researching about what factors are behind the graduation rate of a college, we decided to
conduct a model using the following variables:
Rank (RANK)
This RANK independent variable shows the places of schools in the “Top 200 American
Colleges” ranked by Forbes. The rank runs from 1 – 200. The higher the rank is, in other words,
the lower the variable is, the better Forbes thinks the university is. We believe that the students
will perform better in higher ranked schools, which will boost the school’s graduation rate. The
mean, the standard average, of this independent variable is 100.59. The median is 101, which
means there are 100 colleges ranked below and 100 colleges ranked above the rank of 101. There
are 25% of the colleges in the data set have the ranks below 50 and 25% colleges have the ranks
above 151. Obviously, the inter-quartiles range of this variable is 101. The average difference
between a school’s rank and the mean of this variable is 58.01. Within the RANK variable, the
case that has the data which is closest to each of the statistic above is obviously the school in the
respective rank. For example, the school ranked number 1, Princeton University, has the closest
rank to the minimum of the “Rank” variable. The RANK variable also has the Pearson’s
Coefficient of Skewness of -0.0212, which indicates that the data’s distribution is symmetric and
the mean is close to the median. There is no outliner in this variable.
4
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Rank Histogram
Number of Schools
30
25
20
15
10
5
0
Rank
Figure 1: RANK Histogram
5
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Out-of-State Tuition (OUT_STA_TUITION)
This variable explains the amount of tuition fee student have to pay for a year of 4-yearundegraduation period. It’s seemed to have a better education in the school with higher tuition
fee, so we think they will have higher graduation rate. The mean of this variable is $35,742; we
can understand this in this way: the average payment for tuition fee is $35,742. The median is
$38,290, which means that 50% of the data is lower than $38,290 and the school has the closest
tuition fee to this point is The College of Wooster. There is 25% of schools have the tuition fee
lower than $31,822 (Marquette University) and 25% of them higher than $41,992 (Santa Clara
University). The highest tuition fee is $45,290 and in our research, there are some military
universities that provide their education at zero. That is the reason why we have the minimum
point of $0 in our data. The average distance from the mean is $8,875.835. Pearson’s skewness is
-0.86, this indicates that, there is a negative skewed in the data of tuition fee. There are more
high cost colleges (compared to Mean) than low cost college. The outliners of this variable are 0
(US Military Academy; US Air Force Academy; US Naval Academy; US Coast Guard
Academy), 910 (Berea College) and 4650 (Brigham Young University).
6
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Tuition Fee
90
80
70
60
50
40
30
20
10
$40,000 <$50,000
$30,000 <$40,000
$20,000 <$30,000
$10,000 <$20,000
0 - <$10,000
0
Figure 2: OUT_STA_TUITION Histogram
7
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Total Number of Enrolment (TOTAL_ENROLL)
This variable shows the total enrolments that apply to the universities and colleges in academic
year of 2011. The more number of students want to study in a school, the higher quality of
education it has. So, we think there is a direct relationship between the total enrolment number
and graduation rate. The average enrolment in our data is 12,988. The highest number of
enrolment is 61,545 and the lowest is 171. There is 100 universities (50% of the data) have the
enrolment number under 7,195 (the median, this number is the total enrolment of University of
Puget Sound). This also mean other 100 universities have the enrolment number higher than
7,195. There is 25% of schools have the total enrolments lower than 3,587 (Haverford College
and Goucher College) and 25% of them higher than 20,828 (Virginia Polytechnic Institute and
State University are the schools has closest total enrolment number). The average distance from
the mean of the data set is 12,758, which is displayed in the descriptive statistic table as Standard
Deviation. There is a positive skewed in our data, with the value of PCS is 1.36, we can say there
are more colleges and universities that have the number of enrolment lower than the mean. The
outliners of this variable are 54871 (St. John’s University – New York) and 61,545 (University
of California – Los Angeles).
8
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Total Enrollment
140
120
100
80
60
40
20
0
010,000 - 20,000 - 30,000 - 40,000 - 50,000 - 60,000 <10,000 <20,000 <30,000 <40,000 <50,000 <60,000 <70,000
Figure 3: TOTAL_ENROLL Histogram
9
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Undergraduate Population (UNDERGRAD_POP)
The UNDERGRAD_POP variable shows the number of undergraduate students in each school in
the data set. We think that students in bigger size schools will have more activities besides
studying; accordingly, they will not be able to perform as well as students from smaller schools.
As the result, the graduation rate in bigger size schools is expected to be lower than another
similar school with smaller undergraduate students’ population. The mean is 7,549.10, which
means in average, each college in the list has about 7,549 – 7,550 undergraduate students. In our
data set, the Georgetown University has 7,590 undergraduate students, which is closest the mean.
The median of 2,950 indicates that about 100 colleges in the data set have less than 2,950
undergraduate students and the other 100 schools have more than 2,950 undergraduate students.
The College of Holy Cross, which has 2,905 undergrad students, is the median case of our data.
In Top 200 America’s Colleges by Forbes, the College of the Atlantic has the lowest number of
undergraduate students of 354 students while the Ohio State University – Main Campus has the
highest number of 42,916 students. The first quartiles is 1,769, which means 25% colleges in the
data set has less than 1,769 undergraduate students; and the third quartiles is 8,127 students that
means 75% colleges in our data has less than 8,127 undergraduate students. Schools have the
number of undergraduate students of the first and third quartiles are the Bates College and the
Columbia University in the City of New York, respectively. The first and third quartiles give us
the inter-quartiles range of 6,358 students. The PCS of this variable is 1.46, which indicate that
the UNDERGRAD_POP variable has the positive skew, there are more colleges those has lower
number of undergraduate students than colleges those have higher number. The PCS also can be
supported by the fact that there are 4 outliners in this variable. They are the four colleges which
have highest number of undergraduate students: the University of Texas at Austin (38,437
10
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
students), Pennsylvania State University – Main Campus (38,954 students), Texas A&M
University (39,867 students) and Ohio State University – Main campus (42,916 students).
Number of Schools
Undergraduate Population
Histogram
140
120
100
80
60
40
20
0
Undergraduate Population
Figure 4: UNDERGRAD_POP Histogram
11
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Fall 2011 Acceptance Rate (ACCEPT_RATE)
The ACCEPT_RATE variable shows the percentage of admitted application of Top 200
America’s Colleges in the Fall semester of 2011. We believe that the lower the acceptance rate
is, the more competitive the study environment will be in each particular school. For that reason,
the students will have more incentive to improve their performance in school, which will
increase the graduation rate. The mean of this variable shows that in average, 45.19%
applications were admitted to the Top 200 America’s College. The percent admitted of 45% is
the median of this variable, which are very close to the mean. There are seven schools which
accepted 45% of received applications in Fall 2011, and they are Grinnell College, Franklin and
Marshall College, Smith College, Colorado School of Mines, University of California-Santa
Barbara, University of California-Irvine and University of Maryland-College Park. The average
difference between an actual acceptance rate of a school and the mean of this variable is
21.636%. There are 25% of the schools in the data set have the Fall Acceptance Rate lower than
the 1st quartiles of 27%; and 25% of them have the rate higher than the 3rd quartiles of 63%.
These two numbers provides us the inter-quartiles range of 36%. Bates College and Hamilton
College are schools that have the Fall Acceptance Rate closest to the first quartiles; and Wabash
College, Hiram College, Loyola University Maryland, Brigham Young University, Texas A & M
University and Ohio State University-Main Campus are colleges which accept 63% of submitted
application in Fall 2011. The PCS of this variable is 0.027, which is very low as expected since
the mean and the median are very close. The PCS also indicates that the Fall Acceptance Rate
data is symmetrically distributed. There is no outliner within this variable.
12
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Fall 2011 Acceptance Rate
Histogram
Number of Schools
50
40
30
20
10
0
Acceptance Rate
Figure 5: ACCEPT_RATE Histogram
13
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Students/Faculty Ratio (STU_FAC_RATIO)
The STU_FAC_RATIO variable shows the number of students over one faculty of the school.
This ratio indicates how well the faculty can take care of their students. We think that the helps
student receive from their professor is a very important factor that decide their performance in
school. The more help from faculty the better the student will perform. The STU_FAC_RATIO
variable indicates that the lower the ratio is the more availability the faculty will be to help
students. As the result, we believe that the lower the variable is the better the student will
perform; hence the higher the graduation rate will be. In average, each school in 200 colleges in
the data set has the students per faculty ratio of 11.61, which showed in the mean. Half of
schools in the list have less than 11 students per faculty and the other half of the data has the
students per faculty ratio higher than 11.00. In average, a random selected student/faculty ratio in
the data has the difference from the mean of 3.713. Among the data of this variable, the
California Institution of Technology has the minimum students per faculty value of 3; and
Florida State University has the highest ratio of 27 students per faculty. According to the
quartiles number, a quarter of the data has the student/faculty ratio below 9 and a quarter of the
data has the ratio above 13. The difference between the first and third quartiles is the interquartile range, which is 4 in this case. The STU_FAC_RATIO variable has the Pearson’s
Coefficient Skewness of 0.49 indicates that the data is symmetrically distributed; but there is a
possible high outliner, since the PCS is very close to 0.5. As expected, there is one outliner
among the student/faculty ratios, which is the ratio of 27 students per faculty of the Florida State
University.
14
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Student/Faculty Ratio Histogram
100
90
Number of Schools
80
70
60
50
40
30
20
10
0
0-5
6-10
11-15
16-20
21-25
26-30
Student/Faculty Ratio
Figure 6: STU_FAC_RATIO Histogram
15
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
4-Years Graduation Rate (GRAD_RATE)
The “Four Years Graduate Rate” variable presents the percent of students graduated in four years
of Top 200 America’s Colleges ranked by Forbes. This data is the dependent variable, which we
want to predict by using other variables. The mean of this variable shows that in average,
70.78% students of the Top 200 America’s Colleges graduated in 4 years. The median of this
variable is 72.50%. There are 13 schools have the percent close with that median, some of them
are Clark University (73%), Saint Michaels College (73%), and University of Southern
California (72%). The standard deviation tells us that average difference between a real percent
students graduated of a school and the mean of this variable is 15.38%. This is a big number; it
means that the difference between the percent of graduation rate of those schools is large. There
are 25% of the schools in the data set have the percent of 4 years graduated students lower than
the 1st quartiles of 60%, we have 7 schools have that percent, such as College of the Atlantic,
Southern Methodist University, Virginia Military Institute and etc. And There are 25% of them
have the percent higher than the 3rd quartiles of 83%, there are 5 universities have that
percentage, such as United States Military Academy, Colgate University, Wake Forest
University and etc. These two numbers provides us the inter-quartiles range of 23%. The PCS of
this variable is -0.404, it means that our graduation rate data have a negative skewed. There is
only one outlier in our data. That school is the University of Utah, with the 4-years graduation
rate is only 23%.
16
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Number of Schools
Graduation Rate
Histogram
90
80
70
60
50
40
30
20
10
0
0%-20%
21%-40%
41%-60%
61%-80%
81%-100%
Graduation Ratio
Figure 7: GRAD_RATE Histogram
17
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Athlete (ATH)
The “Athletic” variable shows the percent of student body that is a varsity athlete of Top 200
America’s Colleges ranked by Forbes. The mean of this variable shows that in average, 13.03%
student's body in the Top 200 America’s College is varsity athlete. The percent admitted of 11%
is the median of this variable, there are 5 schools have this percentage, they are: Rice University,
Georgetown University and Saint Marys College of California, etc. It is seemed to have better
body will have some positive effect to our study's abilities. So with the low percentages of
Athletic students, it may decrease the percent of graduate rate a little bit. The average difference
between a real percent of athletic students of a school and the mean of this variable is 10.53%. It
is a huge number when compare with the percent of mean and median of it, so it means that
difference in percent of athletic students between those schools is large. There are 25% of the
schools in the data set have the percent of athletic students lower than the 1st quartiles of 3%;
there are 12 schools have same that percent, some of them are University of California-Los
Angeles, University of Michigan-Ann Arbor, University of Washington-Seattle Campus, and etc
. And 25% of them have the percent higher than the 3rd quartiles of 22%, there are 10 universities
share this same percentage, such as Lafayette College, Wheaton College (MA), Lawrence
University, etc. These two numbers provides us the inter-quartiles range of 19%. The number of
school which did not have any students, who has the athletic body is 20, such as Harvey Mudd
College, Wellesley College, Mount Holyoke College, etc. The PCS of this variable is 0.578, it
implies that the there is a positive skewed in the data of athletic students. Erskine College and
Seminary is only one school which has the percent of students body that is a varsity athlete
especially higher than other schools, with the percent is 45%.
18
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Number of Schools
Athletic Rate Histogram
120
100
80
60
40
20
0
0%-10%
11%-20%
21%-30%
31%-40%
41%-50%
Athletic Students Ratio
Figure 8: ATH Histogram
19
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Financial Aid (FINANCIAL_AID)
This variable is showing us the percentage of students who can receive financial aid from
universities and colleges. Financial aid is usually given to the students who have a good
academic result. In a school with high number of students who can receive financial aid, the
graduation rate should be higher. The shape of the histogram is pretty similar to the histogram of
tuition fee so we think there is a direct relationship between the Tuition Fee and Financial aid.
The average of data is 75.88% means, for overall, there is 75.88% number of students who are
studying in the colleges can receive financial aid. The highest and most frequency value is 100%,
this means there are a lot of school offered financial aid to their entire student. The median is
77% (Carleton College; Gettysburg College; Vassar College; Virginia Polytechnic Institute and
State University; University of California-Davis). There is 50% colleges provide financial aid to
under 77% in total 100% population of their student and 50% of them offers financial aid to
more than 77% population of their student. There is 25% of schools provide financial aid to
lower 64% total student population and 25% provide financial aid to higher 93% total student.
The 64%(1st quartile) financial aid is belonged to Claremont McKenna College; Amherst
College; Vanderbilt University; Brown University; University of Michigan-Ann Arbor. The
92.75% (3rd quartile) financial aid is belonged to University of Redlands; Willamette University;
Loyola Marymount University; University of Georgia. The average distance from the mean of
the data set is 18.90, which is displayed in the descriptive statistic table as Standard Deviation.
There is a no-skewed in our data with the skewness is -0.17. 44.5% of the schools provide over
80% financial aid. Only 3.5% of the university provides fewer than 20% financial aid (see
frequency table). The outliners of this variable are 0 (US Air Force Academy; US Naval
Academy; US Coast Guard Academy) and 11(US Military Academy).
20
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Financial Aid
100
90
80
70
60
50
40
30
20
10
0
0 - <20%
20% - <40% 40% - <60% 60% - <80%
80% <=100%
Figure 9: FINANCIAL_AID Histogram
21
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Private (PRI)
The “Private” variable presents the private schools and the public school in the Top 200
America’s Colleges ranked by Forbes. In this variables, 1 represents for the private schools and 0
represents for the public schools. The mean of this variable shows that in average, 77% schools
in the Top 200 America’s Colleges are the private schools. As can be seen from the graph, there
are more private schools than the public ones. That will have some influence on our regression
model. The private schools gave the better environment for the students to study and graduate
sooner than the public schools.
Percentage of Male Students (MALE_PERCENT)
This variable shows us the percentage of universities and colleges that have the percentage of
male students is greater than the percentage of female students. According to the research
“Placing College Graduation Rates in Context”, we think that, the school with higher percentage
of female student will have higher graduation rate compare to the other.
Each “1” record means the percentage of male student is equal or higher than 50%, else the
record will show us “0”. 70.5% of schools have the smaller number of male students compare to
the number of female students. The mean of the dataset is 0.295. That number indicates the
minority of universities or colleges with the percentage of male students over 50%.
Percentage of White Students (WHITE_PERCENT)
The variable indicates the percentage of universities and colleges that has the population of
White student is more than 50% total number of students. According to the research “Placing
College Graduation Rates in Context”, we think that, the school with higher percentage of white
student will have higher graduation rate compare to the other. The “1” value represents the
school with the percentage of white student is equal or higher than 50% total population, else it
22
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
will be shown as “0”. We can see in the graph only 24% of the schools has the population of
White student is more than 50% total number of students. The mean of the dataset is 0.24. That
number indicates the minority of universities or colleges with the percentage of White students
over 50%.
Percentage of Part-time Student (PART)
The "Pate-Time" variable presents the percent of part-time students in the schools of the Top 200
America’s Colleges ranked by Forbes. In this data set, “1” represents for the schools which have
the percent of part-time students less than 2%; and “0” represents for the schools which have the
percent of part-time students equal or more than 2%. The mean of this variable is 0.38; it shows
that in average only 38% schools in the Top 200 America’s Colleges have less than 2% of parttime students. We can see that the part-time student body is big; and this will have an impact on
our regression model since it takes a longer time for a part-time student to graduate.
After choosing the variables to study on and collecting the data, we found that it might be
interesting to compare the graduation rate between male dominated colleges and female
dominated colleges.
23
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
HYPOTHESIS TEST
There is a hypothesis test about the relationship between the percentage of male student and the
Graduation Rate. We are guessing that the Graduation Rate is higher in the universities or
colleges which have the percentage of Male Student lower than 50% compare to the universities
or colleges which have the percentage of Male Student higher than 50%.
We conduct a hypothesis test to check the accuracy of that claim.
Ho: µGRAD_RATE, MALE_PERCENT=1 ≥ µGRAD_RATE, MALE_PERCENT=0
Hα: µGRAD_RATE, MALE_PERCENT=1 < µGRAD_RATE, MALE_PERCENT=0
Since the Levene’s Sig is less than 0.05 we have decided to use the Equal variances not assumed.
The p-value of Equal variances not assumed in this case is:
𝑝 − 𝑣𝑎𝑙𝑢𝑒 =
𝑆𝑖𝑔 0.249
=
= 0.1255
2
2
This means that if H0 is true, then there is a 12.55 % chance of getting a difference more extreme
then the sample difference of 3.0898305 that we have in this test.

α = 0.05

Decision Rule: Reject H0 if p-value < α , otherwise, do not reject H0
Since 0.1255 > 0.05, we cannot reject H0. We do not have enough evidence to say that the
Graduation Rate is higher in the universities or colleges which have the percentage of Male
student lower than 50% compare to the universities or colleges which have percentage of Male
Student higher than 50%.
24
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Independent Samples Test
Levene's Test
for Equality of
Variances
4 years
Equal variances assumed
graduation rate
(%)
Equal variances not
assumed
F
20.45
4
Sig.
.000
t
1.368
1.162
t-test for Equality of Means
Sig.
(2Mean
Std. Error
df
tailed) Difference Difference
197
.173
- 2.2582316
3.0898305
80.02
.249
- 2.6584805
2
3.0898305
25
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
MULTIPLE REGRESSIONS ANALYSIS
This is the regression model that we will use to predict the 4-year graduation rate of a particular
college in the U.S. For the purpose of the model, we change the unit of the out-of-states tuition
fee, number of total enrollment and undergraduate population to the unit of 1,000. For example,
University of California-Los Angeles have its out-of-state tuition of $35,564, which will show up
in out model as $35.564 thousands. We will use the IBM SPSS program to help us conduct our
analysis.
Our regression model has the dependent variable is 4-years graduation rate. We are expected
Out-state Tuition fee, Number of Enrollments, Private, Financial Aid, Percentage of White, and
Part-time to have a positive relationship with our dependent variable; Rank, Fall 2011
Acceptance Rate, Student/Faculty, and Percentage of Male to have negative relationship with the
graduation rate. We do not have any expectation about the relationship between our dependent
variable with Undergraduate Population and Athletic variables.
After running the regression, we choose the model 3, which has the highest adjusted R2 and the
smallest standard error of the estimate. We find that we should not use both private and the
percentage of white students since removing these variables decreases our standard error of the
estimate.
26
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Variables Entered/Removed
Model
Variables Entered
Variables Removed
Method
1
Athletic, Fall 2011
.
Enter
Acceptance Rate, % Male
=50, Part-time, Out-state
Tuition fee, % White = 50,
Private, Financial Aid,
Number of Enrollments,
Rank, Student/Faculty ,
Undergraduate Population
2
. Private
Backward (criterion: Probability of Fto-remove >= .100).
3
. % White =50
Backward (criterion: Probability of Fto-remove >= .100).
4
. Number of
Backward (criterion: Probability of FEnrollments
to-remove >= .100).
5
. Part-time
Backward (criterion: Probability of Fto-remove >= .100).
Model Summary
Model
R
R Square Adjusted R Square
Std. Error of the Estimate
a
1
.828
.685
.665
8.4425118
b
2
.827
.685
.666
8.4253857
3
.827c
.684
.667
8.4172614
d
4
.825
.681
.666
8.4257515
5
.823e
.678
.664
8.4498776
Regression Equation:
4 years graduation rate (%) = 90.570 - 0.093(Rank) + 0.226(Out-State Tuition fee) +
0.103(Number of Enrollments) - 0.278(Undergraduate Population) - 0.076(Fall 2011 Acceptance
Rate) - 0.537(Student/Faculty) - 0.127(Financial Aid) - 4.831(%Male =50) + 2.304(Part-time) +
0.157(Athletic)
27
Multiple Regression Analysis – 4-Years Graduation Rate
Model
3
(Constant)
Rank
Out-state Tuition fee
Number of Enrollments
Undergraduate
Population
Fall 2011 Acceptance
Rate
Student/Faculty
Financial Aid
% Male =50
Part-time
Athletic
Nhat Nguyen – Minh Tran – Viet Le
Coefficientsa
Unstandardized
Standardized
Coefficients
Coefficients
B
Std. Error
Beta
90.570
4.997
-.093
.017
-.368
.226
.086
.138
.103
.087
.090
-.278
.132
-.182
t
18.126
-5.487
2.636
1.175
-2.097
Sig.
.000
.000
.009
.241
.037
-.076
.047
-.112
-1.609
.109
-.537
-.127
-4.831
2.304
.157
.294
.042
1.456
1.443
.071
-.137
-.165
-.152
.077
.113
-1.830
-3.012
-3.317
1.596
2.202
.069
.003
.001
.112
.029
Statistical Significance and Interpretation:
RANK is statistically significant at the 1% level of significance. For each rank lower of the
University, in other words the higher the variable is, the percent of students graduate in four
years will decrease by 0.093%, holding all else constant. It is obvious. Because the rank of the
school is based on the quality of the schools, so the higher the rank is the better environment for
students to study and graduate sooner.
PERCENTAGE OF MALE is statistically significant at the 1% level of significance. The
PERCENTAGE OF MALE implies that if the school has the percent of male students equal or
higher than 50% in the school, the graduate rate in four years will decrease by 4.831%, holding
all else constant. This is not really an evidence to provide that female students are good students
than male; but they already did a better job than the man in those Universities. The reasons for
that could be most of female students work harder and focusing more than the other part.
28
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
OUT-STATE TUITION FEE is statistically significant at the 1% level of significance. For
each $1,000 increasing in the outstate tuition fee for one year of the university, the percent of
students graduate in four years will increase by 0.226%, holding all else constant. The reason for
that may be because the high tuition fee makes the students have more effort to finish the
university as soon as possible.
FINANCIAL AID is statistically significant at the 1% level of significance. For each additional
percent students receive the financial aid from University the percent of students graduate in four
years will decrease by 0.127%, holding all else constant. It could be that if more students receive
the scholarships from the schools, they would want to stay at the schools to study more.
UNDERGRADUATE POPULATION is statistically significant at the 5% level of
significance. For each more 1,000 undergraduate students in the University, the percent of
students graduate in four years will decrease by 0.278%, holding all else constant. The
interaction of the number of undergraduate students and the percentage of students graduated in
four years is understandable. The students study in a smaller environment tent to focus more on
study, while students from crowded school may be distracted by other activities.
ATHLETIC is statistically significant at the 5% level of significance. For each percent
increasing of Student Body that is a Varsity Athlete, the percent of students graduate in four
years will increase by 0.157%, holding all else constant. The connection between the number of
healthy students and the ratio of number students graduated is very clearly. Because, the healthy
body will help the students have enough energy and ability to study hard and have clearly mind.
It also reduces the stressful of the academic pressure.
STUDENT/FACULTY is statistically significant at the 10% level of significance. For each one
more student who one professor has to work on, the percent of students graduate in four years
will decrease by 0.537%, holding all else constant. It is easy to understand the reason for which
29
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
the number of students who study with one professor have the influence on the ratio of students
graduated. With the larger number of students, professors are harder to help and teach each
individual student to have better knowledge.
NUMBER OF ENROLLMENTS, PART-TIME and FALL 2011 ACCEPTANCE are
statistically insignificant, and thus we do not interpret their meaning in the model, since we
cannot reject the hypothesis that they are really zero.
Standard Error of the Estimate:
The standard error of the estimate in this case is 8.4172614%, which means that on average, our
prediction of the percent of students graduated in 4 years will be off by 8.4172614% from its
actual percent. This is an acceptable level of standard of error of estimation, since it is only about
8% within the 100% scale of the graduation rate.
R2 and Adjusted R2:
R2 in this case means that 68.4% of the variation in the percent of students graduated in 4 years
of a university can be explained by the variation in the set of independent variables.
Adjusted R2 in this case is 66.7%, which means that 66.7% of the variation in percent of students
graduated in 4 years of a university can be explained by the variation in the set of independent
variables, adjusting for the number of independent variables in our model. Overall, this is a good
enough model, since we can predict over 66% of the variation in percentage. However, there are
a lot of different things have affected on our regression, such as the school retention rate, the
average high school GPA of freshmen, the average SAT score of all students, what majors the
school offers, the school’s calendar system, etc. If we have more information, we may be able to
increase the predictive power of our regression.
F-test:
The F-test has the following hypotheses.
30
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
H0: β1 = β2 = β3 = . . . = βn = 0
H1: At least one β does not equal zero
In this case, since Fsig < 0.01, we know that our overall model is statistically significant at the 1%
level of significance. What this means is that we know that at least one of our coefficients is
different than zero and thus, our model is able to explain some of the reason for the percent of
students graduated in 4 year.
ANOVAf
Model
3
Regression
Residual
Total
Sum of
Squares
28775.854
13319.854
42095.709
df
Mean Square
10
2877.585
188
70.850
198
F
40.615
Sig.
.000c
31
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
ASSUMPTION ON LINEAR REGRESSION
After having the result of the regression analysis, we decided to conduct a series of assumption
tests in order to find out some more information about our dependent variable as well as
independent variables.
Test 1: Residuals’ Expected Value
The residual mean is 0.00. This indicates that our regression meets the test of residuals’ expected
value. The number also means that the standard average of all residuals in our regression is 0.00.
In other words, each case has some residual and the values of all the residuals offset each other
and give us the residual mean of 0.00
Residuals Statistics
Minimum
Maximum
Mean
Std. Deviation
Predicted
39.849262237548835 92.92359924 70.7839195979899 12.068485436
Value
3164060
50
218943
Residual
- 20.09190368 -.000000000000010 8.1826797490
28.479360580444336
6523438
39162
Std. Predicted
-2.563
1.835
.000
1.000
Value
Std. Residual
-3.373
2.380
.000
.969
N
199
199
199
199
Test 2: Residuals’ Constant Variance
The Scatterplot chart shows that the differences between predicted values by our regression are
varied. The regression failed the residuals’ constant variance test; and we say that our regression
has hetroskedasticity. The more data the regression has or the further the regression progress, the
less different the predicted values will be.
32
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Test 3: Residuals are Normal Distributed
The normal P-P plot chart clearly indicates that our regression meets the normal distributed test.
All the observations are very close to the predict values. This means our regression is quite
accurate in predicting the 4 years graduation rate of college students. Our regression model does
not predict very accurate only a several universities which have the graduation rate between 20%
and 40%. Even so, the differences between the predicted value and the actual observation are
insignificant.
33
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Test 4: Residuals are Independent
The Durbin-Watson number of this regression is 1.767, which is between 1.5 and 2.5. This
number means that our regression’s residuals are independent from each other. In other words,
error on one variable can only effect on the regression result up to the limit of the variable
significant; it should not affect other variables and cannot help us to find other variable’s errors.
Model
1
R
.828a
Model Summaryb
Adjusted R Std. Error of
R Square
Square
the Estimate
.685
.665 8.442511818
337984
DurbinWatson
1.767
Test 5: Independent Variable Are Not Related
All of our variables have their VIF less than 10. This means our regression meets the fifth
assumption; in other words, our independent variables are not highly related to each other.
Coefficientsa
Unstandardized Standardized
Coefficients
Coefficients
Model
1 (Constant)
Rank
Out-state Tuition fee
Number of
Enrollments
Undergraduate
Population
Fall 2011 Acceptance
Rate
Student/Faculty
Private
Financial Aid
% Male >=50
% White >= 50
Part-time
Athletic
B
Std. Error
88.76
5.370
9
-.092
.017
.221
.092
.129
.093
Collinearity
Statistics
Toleranc
Beta
t
Sig.
e
VIF
16.53 .000
2
-.367 -5.449 .000
.373 2.678
.135 2.408 .017
.540 1.851
.113 1.388 .167
.256 3.900
-.268
.140
-.176 -1.922 .056
.203 4.932
-.080
.048
-.119 -1.682 .094
.339 2.950
-.538
1.142
-.129
-4.739
1.421
2.310
.150
.301
2.321
.045
1.466
1.735
1.452
.072
-.137
.033
-.167
-.149
.041
.077
.108
.289
.380
.508
.798
.660
.719
.629
-1.791
.492
-2.900
-3.232
.819
1.590
2.081
.075
.623
.004
.001
.414
.113
.039
3.462
2.631
1.967
1.252
1.516
1.390
1.589
34
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
CONCLUSION
After conducting this study about 4-year graduation rate of top 200 America Colleges, by Forbes,
our team are very satisfied with the result. We were able to create a regression model that can
explain about 66.7% the variation of the 4-year graduation rate. Our model also indicates that
out-state tuition fee, percentage of male students, the schools’ rank and percentage of students
who receive financial aid, which are statistically significant at 1% level of significant, are
important factors, in predicting graduation rate of a particular college in the United States.
A higher ranked school, indicated by a smaller number in the RANK variable, should have a
higher graduation rate than a similar school which is ranked below. We believe that the result is
very reasonable since a higher rank school tends to have an overall environment that encourages
students to perform better in their academic career. We also think that Forbes is doing a very
good job in valuation colleges across the country. Unfortunately, the school itself cannot
intervene in Forbes’s job; but still, they can try to raise their rank by improving their quality
through adopting new technology, create a larger and better academic resource for student, orient
teaching and learning style toward engaging, and sharing, etc.
Out-state tuition fee also is a factor that has positive relationship with the 4-year graduation rate.
There are many reasons to explain this relationship: it can be that students will try to finish their
program faster when they have to pay more to stay in school, or it can be that school that collect
higher tuition from students has more capability to improve the quality, which directly affect
students’ performance. However, this does not mean that schools should try to increase their
tuition fee, which they are already doing. Instead of boosting the cost, school should try to
efficiently spend their money to make a better study environment for students.
On the other hand, percentage of students who receive financial aid has a negative relationship
with the graduation rate. More students receive financial aid will decrease the school’s
35
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
graduation rate since those student have to pay less than others, who does not have financial aid,
can afford to stay in school for a longer time.
Our model also says that schools with more male students than female have lower graduation
rate than similar schools which have less male than female. Although this is a very interesting
point, we do not have any further comment about this result since our hypothesis test’s failed to
conclude that school with percentage of male students lower than 50% have the higher
graduation rate than the opposite group.
Besides, our regression model also indicates that Undergraduate student population, Percentage
of varsity athlete student and Student/faculty ratio also have impacts on the school’s graduation
rate but not as significant as the above four variable.
Although Number of enrollment in Fall 2011, percentage of part-time student and acceptance
rate in Fall 2011 are in our equation, they are stated to be statistically insignificant. We believe
that these numbers are easily changed in a short period of time. They can be very different from
one semester to another; hence it is intelligible that they can only affect very little on the school’s
graduation rate.
We find an unexpected outcome in our regression model that the independent variable of
percentage of white student was rejected by SPSS. We intend to collect data about the percentage
of white student with the belief that white students used to out-perform other races at school.
This is also what we found in the study of US Department of Education, which indicates that
white and Asian students tended to graduate at higher rate than African and Hispanic students.
This may because of the inefficient nature of the data since it only can separate White students
versus other races.
Although we believe that our model is very good, there are still several limitations that need to
be improved. First, we think that the model can be much more accurate if we are able to collect
36
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
data from more than just 200 schools. The more data we collect the deeper SPSS can understand
the relationship between the dependent variable and each independent variable. Secondly, our
group should have collected some historic data; so that we can understand the data’s trend over
time and be able to remove the influence of short-term characteristic of the data. Besides, we also
suggest adding some other independent variable, such as the school retention rate, the average
high school GPA of freshmen, the average SAT score of all students, what majors the school
offers, the school’s calendar system, etc.
37
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
APPENDIX A - FREQUENCY TABLE
Rank (RANK)
0-25
26-50
51-75
76-100
101-125
126-150
151-175
176-200
Total
Cumulative
frequency
25
50
75
99
124
149
174
199
Frequency
25
25
25
24
25
25
25
25
199
Percentage
12.56%
12.56%
12.56%
12.06%
12.56%
12.56%
12.56%
12.56%
100%
Cumulative
percentage
12.56%
25.12%
37.68%
49.74%
62.30%
74.86%
87.42%
100.00%
Outstate Tuition (OUT_STA_TUITION)
0 - $10,000
$10,000 - $20,000
$20,000 - $30,000
$30,000 - $40,000
$40,000 - $50,000
Total
Frequency
7
3
34
72
84
199
Cumulative
frequency
7
10
44
116
200
Percentage
3.5%
1.5%
17.0%
36.0%
42.0%
100%
Cumulative
percentage
3.5%
5.0%
22.0%
58.0%
100.0%
Number of Total Enrollment (TOTAL_ENROLL)
0 - 10,000
10,000 - 20,000
20,000 - 30,000
30,000 - 40,000
40,000 - 50,000
50,000 - 60,000
60,000 - 70,000
Total
Frequency
116
31
30
13
6
2
1
199
Cumulative
frequency
116
148
178
191
197
199
199
Percentage
58.5%
15.5%
15.0%
6.5%
3.0%
1.0%
0.5%
100%
Cumulative
percentage
58.5%
74.0%
89.0%
95.5%
98.5%
99.5%
100.0%
38
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Undergraduate Population (UNDERGRAD_POP)
0-5,000
5,001-10,000
10,001-15,000
15,001-20,000
20,001-25,000
25,001-30,000
30,001-35,000
35,001-40,000
40,001-45,000
Total
Frequency
123
32
10
9
5
9
7
3
1
199
Cumulative
frequency
123
155
165
174
179
188
195
198
196
Percentage
61.81%
16.08%
5.03%
4.52%
2.51%
4.52%
3.52%
1.51%
0.50%
100%
Cumulative
percentage
61.81%
77.89%
82.91%
87.44%
89.95%
94.47%
97.99%
99.50%
100.00%
Fall 2011 Acceptance Rate (ACCEPT_RATE)
0-10%
11%-20%
21%-30%
31%-40%
41%-50%
51%-60%
61%-70%
71%-80%
81%-90%
91%-100%
Total
Frequency
10
23
25
27
27
24
38
17
8
0
199
Cumulative
frequency
10
33
58
85
112
136
174
191
199
199
Percentage
5.03%
11.56%
12.56%
13.57%
13.57%
12.06%
19.10%
8.54%
4.02%
0.00%
100%
Cumulative
percentage
5.03%
16.58%
29.15%
42.71%
56.28%
68.34%
87.44%
95.98%
100.00%
100.00%
Student/Faculty Ratio (STU_FAC_RATIO)
0-5
6-10
11-15
16-20
21-25
26-30
Total
Frequency
2
90
74
28
4
1
199
Cumulative
frequency
2
92
166
194
198
199
Percentage
1.01%
45.23%
37.19%
14.07%
2.01%
0.50%
100%
Cumulative
percentage
1.01%
46.23%
83.42%
97.49%
99.50%
100.00%
39
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
4-Years Graduation Rate (GRAD_RATE)
0 - 20%
21% - 40%
41% - 60%
61% - 80%
81% - 100%
Total
Frequency
0
7
45
82
65
199
Cumulative
frequency
0
7
52
134
199
Percentage
0.00%
3.52%
22.61%
41.21%
32.66%
100.00%
Cumulative
percentage
0.00%
3.52%
26.13%
67.34%
100.00%
Financial Aid (FINANCIAL_AID)
0 - 20%
21% - 40%
41% - 60%
61% - 80%
81% - 100%
Total
Frequency
5
2
27
77
89
199
Cumulative
frequency
5
7
34
111
199
Percentage
2.5%
1.0%
13.5%
38.5%
44.5%
100%
Cumulative
percentage
2.5%
3.5%
17.0%
55.5%
100.0%
Athlete (ATH)
0 - 10%
11% - 20%
21% - 30%
31% - 40%
41% - 50%
Total
Frequency
97
43
47
11
1
199
Cumulative
frequency
97
140
187
198
199
Percentage
48.74%
21.61%
23.62%
5.53%
0.50%
100.00%
Cumulative
percentage
48.74%
70.35%
93.97%
99.50%
100.00%
Private (PRI)
Private
Public
Total
Frequency
154
45
199
Percentage
77.39%
22.61%
100.00%
40
Multiple Regression Analysis – 4-Years Graduation Rate
Nhat Nguyen – Minh Tran – Viet Le
Percentage of Part-time Student (PART)
<2%
≥2%
Total
Frequency
76
123
199
Percentage
38.19%
61.81%
100.00%
Percentage of Male Students (MALE_PERCENT)
≥ 50% (1)
< 50% (0)
Total
Frequency
59
141
200
Percentage
29.5%
70.5%
100.0%
Percentage of White Students (WHITE_PERCENT)
≥ 50%
< 50%
Total
Frequency
48
152
200
Percentage
24.0%
76.0%
100.0%
41
Download