Development of the Mathematical Model

advertisement
Predicting Retention Rates from Placement
Exam Scores in Engineering Calculus
First Author
Affiliation
Country
Email Address
Abstract: As part of an NSF Science, Technology, Engineering, and Mathematics Talent
Expansion Program (STEP) grant, a math placement exam (MPE) has been developed at a large
public University for the purpose of evaluating the pre-calculus mathematical skills of entering
students. Approximately 4500 students take the placement exam before beginning their freshman
year. Beginning in 2011, a minimum score of 22 out of 33 is needed to enroll in Calculus I. If the
minimum score is not achieved, students are placed in a pre-calculus course.
In previous work, the authors focused on the psychometric properties of the MPE. In this article
we examine the distribution of MPE scores for two groups of students – those who pass Calculus I
and those who don't. We show that the cumulative distribution function (CDF) of pass/fail
outcomes can be effectively modeled by a logistic regression function.
Using historical data (e.g., pass rates) and data from the MPE test (available before the semester
begins) we show that one can accurately predict the number of students who will pass (as a
function of their MPE scores) as well as overall retention rates.
We examine the data over a 4-year period (2009-2012). Despite the fact that a mandatory MPE
cutoff score was imposed in 2011, we show that the model retains validity.
Background
XYZ University has one of the largest engineering programs in the United States. There are over 7,000
undergraduate engineering majors. Traditionally, students in this program take Calculus during their freshmen year,
along with Physics and other Science, Technology, Engineering, and Mathematics (STEM) courses. Some students
are not sufficiently prepared and have difficulty passing their mathematics courses. In order to identify students
with potential problems, a Math Placement Exam (MPE) was developed using confirmatory and item-response
theory (Muthen & Muthen, 2012; Raykov & Marcoulides, 2011). A comprehensive statistical analysis was detailed
in a previous SITE publication (XXX, 2012).
The MPE consists of 33 multiple choice questions in the following areas: polynomials, functions, graphing,
exponentials, logarithms, and trigonometric functions. Questions were designed by two faculty members
experienced in teaching both pre-calculus and calculus.
Based on historical performance data (MPE scores vs. grades), an MPE cutoff score of 22 was established.
Starting in fall 2011, students with an MPE score below this cutoff were blocked from enrolling in Calculus I.
Passing is defined as receiving a grade of A, B, or C. If they score below 22, they make retake the exam or enroll in
a summer program called the Personalized Pre-Calculus Program (PPP).
Development of the Mathematical Model
Because mathematics is a critical component of the undergraduate engineering major (typically consisting
of three semesters of calculus and one semester of differential equations), it is important to ensure that beginning
students are mathematically prepared to enter Calculus. If a student cannot complete a required mathematics course
with a C or better, they must retake the course, usually resulting in graduation delays and extra cost.
Historically, it has been observed that some students enter the University with weak algebraic skills and
little facility with trigonometric, exponential, and logarithmic functions. These are precisely the areas targeted by
the MPE. Other issues such as over-reliance on graphing calculators and lack of problem-solving ability are not
addressed by the MPE.
Sample Groups
In this study, two semesters of math course grades and MPE data, beginning in Fall 2009, are used to
develop a mathematical model. The sample for this investigation is derived from students who have taken the MPE,
enroll in Calculus I, and receive a grade of A, B, C, D or F. Excluded from the sample are students, enrolled in
Calculus I, who did not take the MPE, or students with MPE scores who dropped, withdrew, or otherwise failed to
complete course requirements (Pitoniak & Morgan, 2012).
From the sample of eligible students (n = 1451), two groups were formed. The pass group (P; n =1193)
was defined as those students who received grades of A, B, or C in Calculus I. The fail group (F; n = 258) consisted
of students who received grades of D, F, or equivalent. Each of these groups has an underlying distribution of MPE
scores, which can be modeled by a logistic regression function.
Table 1.
Fall 2009 data set
MPE score
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Number Pass
0
0
1
0
1
4
7
2
5
13
11
15
28
24
26
35
40
57
89
81
116
93
111
118
97
90
71
50
8
Number Fail
0
0
0
2
3
1
8
4
8
15
9
8
10
9
10
18
20
21
21
22
9
13
14
8
4
7
10
3
1
Cumulative Pass
0
0
1
1
2
6
13
15
20
33
44
59
87
111
137
172
212
269
348
439
555
648
759
877
974
1064
1135
1185
1193
Cumulative Fail
0
0
0
2
5
6
14
18
26
41
50
58
68
77
87
105
125
146
167
189
198
211
225
233
237
244
254
257
258
MPE score distributions for each group (P and F) are shown in Figure 1. The data is quite noisy, and does
not appear to follow any particular probability distribution. As shown in Figure 2, however, plotting the cumulative
distribution of scores improves the situation.
Pass
Fail
Number of Students
140
120
100
80
60
40
20
0
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33
MPE Correct Responses
Figure 1.
MPE score distributions for the Pass and Fail groups.
Cum Pass
Cum Fail
Number of Students
1400
1200
1000
800
600
400
200
0
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33
MPE Correct Responses
Figure 2.
Cumulative score distributions for the Pass and Fail groups.
The distribution of scores in Figure 2 has the characteristic “S-curve” shape of cumulative distribution
functions corresponding to the logistic or Gaussian probability distributions. We will investigate fitting a
cumulative logistic function for several reasons. First, frequency distributions associated with MPE scores for
groups P and F are not symmetric, in part due to a majority of the scores being in the interval [20, 33]. Secondly,
the cumulative logistic function can be written explicitly in terms of exponentials, unlike the cumulative Gaussian
which is written in terms of error functions (Secolsky, Krishnan, & Judd, 2012).
Although the best fit of a logistic regression function to the data involves the maximum likelihood principle
(Breslow & Holubkov, 1997), we can get a good approximation to this by requiring that the logistic probability
distribution have the same mean and standard deviation as the data.
If we let S denote the MPE score, then the data in Figure 2 can be modeled by two logistic functions
(1)
P( S)=
(2)
F( S)=
NP
1+e a − b
P
P
S
NF
1+e a − b
F
F
S
These have the property that as 𝑆 → ∞ both P and F approach 0. As 𝑆 → ∞, P approaches N P (the total
number of individuals who pass) and F approaches N F (the total number of individuals who fail). The probability
density function is the derivative of the cumulative density function. We can see the fit to both groups in Figure 3.
Number of Students
N Pass
Pass Theoretical
140
120
100
80
60
40
20
0
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33
MPE Questions Correct
Figure 3.
Theoretical vs. actual probability for students passing Calculus I.
Number of Students
N Fail
Fail Theoretical
25
20
15
10
5
0
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33
MPE Questions Correct
Figure 4.
Theoretical vs. actual probability for students failing Calculus I.
This also indicates the problems with predicting the probability that an individual student will pass (or fail) given
only their score, rather than the probability of passing (or failing) given a range of scores.
The idea of a “cutoff” score is introduced in the following way. We wish to find a particular score Sc such
that the 70% of the students with scores above the cutoff pass. This was found to be approximately a score of 22
which was then adopted as the minimum MPE score required for registering for Calculus I.
Paradoxically, since virtually all students in Calculus I now have MPE scores of 22 or higher, if the overall
pass rate (retention rate) is different from 70%, the cutoff score would have to be revised. In our case, we did not
want to have a “floating” cutoff score, so it has been fixed at 22.
Essential Parameters
The statistical properties of each group are determined by three parameters N, ,  which describe the
number of individuals in the group, and the mean and standard deviation of the scores. This data is available for all
students entering Calculus I with an MPE score. We know the number of individuals in Calculus I with MPE
scores, and we know the mean and standard deviation of the groups, but clearly do not yet know membership or
statistics of subgroups P and F until the semester ends.
In this section we will show how to develop a mathematical model which takes historical data from
pervious semesters, as well as the parameters N, ,  from the incoming Calculus I students to provide accurate
estimates of the number of students who will pass or fail.
Mathematical Model
We define the retention rate, R, as the percentage of students in a population who pass. Consequently,
N P  RN
N F  (1  R) N
(3)
(4)
The mean values of the MPE for the two groups P and F are related by
N P  P  N F  F  N
If we define the difference,
   P   F , we get equations (5) and (6)
 P    (1  R)
 F   P      R
(5)
(6)
With a little bit of algebra, and defining the parameter
 as the ratio of standard deviations, we have
 F   P
In terms of this ratio, we have
P 
 2  R(1  R)2
R  (1  R) 2
Computational Algorithm
I.
II.
III.
Let N be the number of students who take the MPE and are admitted into Calculus I
Compute the mean and standard deviation of these students MPE scores, , .
Determine the pass rate of the previous year, R, and the computed values
  P  F
and

IV.
F
P
Estimate the mean and standard deviation of the two subgroups P and F by
 P    (1  R)
 F    R
P 
 2  R(1  R)2
R  (1  R) 2
 F   P
V.
Calculate the logistic regression parameters for the two groups

3
a  b
b
and the cumulative number of individuals with scores less than or equal to S by
N
1  e a bS
Validation Study – Fall 2009 and 2010
The following table shows the actual MPE scores, and statistics, for the four fall semesters, beginning in
2009. The values for N, ,  are known, but the values for ∆, , R must be estimated from previous year’s historical
data. From the computational algorithm, we can therefore compute the estimated values for 2010, 2011, and 2012.
This is contained in Table 3.
Table 2.
Actual MPE data for Fall 2009-Fall 2012
Year
2009
2010
2011
2012
N
1451
1416
1742
1606


NP
NF
P
F
P
F


24.579
25.402
26.713
26.911
5.025
4.733
3.432
3.936
1193
1119
1355
1311
258
297
387
295
25.308
26.174
27.246
27.496
21.205
22.492
24.798
24.922
4.577
4.200
3.149
3.497
5.604
5.451
3.802
4.324
4.103
3.683
2.448
2.574
1.224
1.298
1.207
1.237
R
0.806
0.765
0.771
0.814
Table 3.
Estimated MPE data for Fall 2010-Fall 2012
Year
2010
2011
2012
N
1416
1742
1606


NP
NF
P
F
P
F
25.402
26.713
26.911
4.733
3.432
3.936
1142
1333
1239
274
409
367
26.196
27.568
27.582
22.093
23.885
25.134
4.246
2.863
3.476
5.199
3.716
4.197
Approximation Results
The easiest way to consider the errors is to look at the numbers of students in the P and F groups,
comparing the actual numbers, the “best fit” model numbers, and the estimated numbers. This is shown in Table 4.
Although we can calculate the absolute errors (in terms of numbers of students), it is perhaps more valuable
to consider the retention rate, which is defined as the number of students who pass (with scores greater than or equal
to S) out of the total number of students (with scores greater than or equal to S). The number of students passing
with scores greater than or equal to S is given by
N P  P( S )  N P 
NP
1  e aP bP S
Similarly, the number of students failing with scores greater than or equal to S is given by
N F  F (S )  N F 
NF
1  e aF bF S
NP 
So the retention rate is equal to
NP 
NP
1  e aP bP S
NP
NF
 NF 
a P bP S
1 e
1  e aF bF S
Table 4.
Numbers of Students who Pass/Fail vs MPE score
2010
Scaled
Scaled
Exact
Model
Estimate
MPE
Pass
Fail
Pass
Fail
Pass
Fail
1
0
0
0
0
0
0
2
0
0
0
0
0
0
3
0
2
0
0
0
0
4
0
2
0
1
0
1
5
0
2
0
1
0
1
6
1
2
0
1
0
1
7
1
2
0
2
0
1
8
2
5
0
2
1
2
9
2
7
1
3
1
3
10
2
8
1
5
1
4
11
2
12
2
7
2
6
12
6
19
3
9
3
8
13
10
23
4
13
4
12
14
16
28
6
17
7
16
15
24
35
9
24
10
22
16
33
43
14
32
15
31
17
46
52
22
43
23
42
18
64
63
33
57
35
56
19
87
74
51
74
53
73
20
112
94
76
95
79
93
21
145
108
113
118
117
117
22
192
124
166
143
171
141
23
259
147
237
169
243
166
24
339
173
329
194
337
190
25
434
202
440
217
449
211
26
515
225
564
237
573
229
27
612
247
689
254
700
243
28
739
267
806
268
818
255
29
865
276
905
279
919
264
30
963
289
984
287
1000
270
31
1058
293
1042
294
1061
275
32
1109
297
1084
298
1104
278
33
1119
297
1114
302
1135
281
Note: the theoretical model and estimated model values have been scaled
a  33b
by a factor 1  e
to make sure that the cumulative values sum to the
number of students in the group.
The results are shown below in Figure 5.
Retention %
Actual
Model
Estimate
100
95
90
85
80
75
70
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29
MPE Score
Note: The retention rates do not have to be monotone, due to the increasingly
smaller sample sizes.
Figure 5.
Actual, Theoretical, and Estimated retention rates in Calculus I as a function of MPE score.
Quantifying the notion of “at-risk” students
The mathematical model we have developed for MPE scores allow one to predict the probability of success –
not of individual students, but of groups of students with a similar range of scores. We can define at-risk students as
those whose MPE scores put them in a range which has a probability of success which is less than some prescribed
threshold.
Summary
Using historical data from the previous year, as well as data from the MPE scores of the entering freshman,
we have shown that one can model the cumulative distribution function for the subgroups of passing and failing
students very accurately. This in turn allows one to model retention rates as a function of MPE cutoff score. We
have begun to examine intervention strategies to improve retention among students who are at-risk due to their low
MPE scores.
References
Breslow, N. E. and Holubkov, R. (1997). Maximum likelihood estimation of logistic regression parameters under
two-phase, outcome-dependent sampling. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 59: 447–461. doi: 10.1111/1467-9868.00078
Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York, NY: The Guilford Press.
Muthén, L.K. and Muthén, B.O. (1998-2013). MPlus User’s Guide. Seventh Edition. Los Angeles, CA: Muthén
& Muthén.
Pitoniak, M.J. & Morgan, D.L. (2012). Setting and validating cut scores for tests. In C. Secolsky and D.B. Denison
(Eds.) Handbook on measurement, assessment, and evaluation in higher education (pp.343-366). New York:
Routledge.
Raykov, T. & Marcoulides, G. A. (2011). Introduction to Psychometric Theory. New York, NY: Routledge.
Secolsky, C., Krishnan, K, & Judd, T. P. (2012). Using logistic regression for validating or invalidating initial
statewide cut-off scores on basic skills placement tests at the community college level, Research in Higher
Education Journal, 19.
XXX (2012). SITE conference paper. SITE Conference Proceedings.
Download