Uploaded by peter vozvoteca

PMST MODELLING TAS Update

advertisement
PMST MODELLING TASK
USING REGRESSION ANALYSIS TO ESTIMATE THE SCALED SCORE FROM THE RAW SCORE IN
TWO SUBJECTS
BACKGROUND
Scaling is a statistical and mathematical process that converts a raw score derived from
cumulative assignments and exams into a scaled score. This allows a more direct and fairer
comparison for students with different subject combinations and degrees of difficulty. For
example, a student may achieve a lower raw score with a higher level of difficulty compared
to a student with a higher raw score and lower level of difficulty.
By using regression analysis, two data sets can be compared to model the most appropriate
function with the highest correlation (r). One of the simplest methods is using a linear
function (straight line) to model the ’line of best fit’ on a data scatter plot. The aim is to find
a correlation as close to + 1 or - 1. This can usually be achieved by ensuring the data points
are equally distributed around the line of best fit. Correlations XXXXX
This task will use regression modeling to compare the raw and scaled scores for two
subjects: Mathematics Methods and Economics for years 2020, 2021 and 2022. These
scores are summarised in Table 1 below.
Mathematical Methods:
2020
Raw
Scaled
2021
Raw
Scaled
2022
Raw
Scaled
25%
57.00
77.12
25%
60.00
78.45
25%
63.00
79.42
50%
69.00
88.78
50%
72.00
89.40
50%
73.00
89.00
75%
80.00
94.53
75%
84.00
95.13
75%
84.00
94.81
90%
88.00
96.83
90%
92.00
97.13
90%
92.00
97.06
99%
97.00
98.30
99%
99.00
98.24
99%
98.00
98.09
25%
59.00
67.49
25%
63.00
71.10
25%
65.00
71.12
50%
70.00
81.00
50%
74.00
86.37
50%
75.00
84.88
75%
81.00
93.16
75%
84.08
93.84
75%
86.00
96.24
90%
88.00
96.13
90%
90.00
96.16
90%
92.00
95.75
99%
95.00
97.83
99%
96.00
97.67
99%
97.00
97.14
Economics
2020
Raw
Scaled
2021
Raw
Scaled
2022
Raw
Scaled
From the data sets it can be seen that there are ‘natural’ limits of zero (0) to represent the
lowest minimum raw and scaled scores and one hundred (100) to represent the maximum
highest scores. This will be important in setting the Domain and Range values for each
regression function. This will be set as Domain {0 ≤ x ≤ 100} and Range {0 ≤ y ≤ 100}. This will
be useful when estimating values using interpolation and extrapolation methods to test the
validity of the functions. For example, any values outside these defined ranges will be
problematic in making accurate predictions.
The Desmos calculator XXXXXX
REGRESSION ANALYSIS FOR ECONOMICS
1. Linear Function (y=mx+c)
The first function regressed against the data set was a linear function in the standard form
y=mx+c written as y1 ~ ax1 + b in the Desmos calculator. The Domain was set to {0 ≤ x ≤ 100}
to show the maximum and minimum scores possible. The data points and line of best fit can
be seen as green and blue respectively. The function equation was given as y=1.01399x
+5.86219 where the gradient a = 1.01399 and Y-Intercept = 5.86219. Even though the
Correlation (r=0.9842) and the Coefficient of Determination (r2=0.9686) was relatively
strong, there were several problems. Firstly, the line intersects the y-axis (Scales Scores) at
5.86219. By inspection, this suggests that a raw score of 0 will convert to a scaled score of
5.86219. In addition, a raw score of 100 will convert to a scaled score of 107.261. This is a
5.86% and 7.26% discrepancy respectively. These would be significant issues in practical
application.
Secondly, the residual points seen in red on the graph show a linear pattern instead of a
random scattering of points especially for the raw score values above 80.
It was interesting to note that when I attempted to set the Range values to {0 ≤ y ≤ 100} the
Desmos calculator would not generate a linear function. This suggests that a line of best fit
would not be possible.
1a. Linear function regression y=ax+c
1b. Linear function regression (adjusted) y=ax
To address these issues, the function formula was adjusted so the line could pass through
the origin (0,0). This was written as y1 ~ ax1. The resulting function equation was given as
y=1.08475x. This improved the accuracy of estimates at the low end of the raw score scale,
however, the estimate for higher raw scores was still inaccurate because a raw score of 100
converts to a scaled score of 108.041, which is impossible. Even though the Coefficient of
Determination (r2=0.9635) and Correlation (r=98158) were still strong, the residual data
points plot still shows a pattern. By manually adjusting the parameters for ‘a’ (gradient) to y
~ 1.0 x1, I was able to eliminate the problem. However, this impacted the Correlation score
(r=0.9392). Equally, the residual data plot is not randomly scattered as seen in Graph 1c.
1c. Linear function regression (adjusted) y=x
When I attempted to set the Range values to {0 ≤ y ≤ 100} the Desmos calculator would not
generate a linear function because too many data points would be above a possible line of
best fit.
2. Quadratic Function (y=ax2+bx+c)
The second function regressed against the data set was a parabolic function in the standard
form y=ax2+bx+c written as y1 ~ ax21 + bx1+c in the Desmos calculator with the value limits set
as Domain {0 ≤ x ≤ 100} and Range {0 ≤ y ≤ 100}. The results can be seen in Graph 2a below.
The function equation was given as y=-0.004305x2+1.4563- 0.644546. The correlation
(r=0.99408) and the Coefficient of Determination (r2=0.9882) was relatively strong.
2a. Quadratic function regression
y=ax2+bx+c
2b. Quadratic regression for residuals
An inspection of the residual plot was more positive than the linear regression, however, it
would have been better for more points to be below the line between raw scores 70 and 90.
There still appears to be a pattern between raw score values between 81 and 97.
Once again, similar to the previous linear regression, there are data points outside the set
limits. For example, a raw score of zero (0) converts to a scaled score of (-0.644546) and a
raw score of 100 converts to a scaled score of 101.935. These don’t appear to be major
discrepancies, however, it was important to look for improvements. Since one of the roots
for the quadratic function must be (0,0), the formula was adjusted to y1 ~ ax21 + bx. This
would ensure one zero root passing through the origin.
2c. Quadratic function regression y=ax2+bx
2d Quadratic function regression y=ax2
The resulting equation did not change significantly y=-0.00420066x2+1.4397. This addressed
the issue of scores at the lower end of the scale where a raw score of 0 now converts to a
scaled score of 0. However, it did not solve the issue at the higher end of the scale where a
raw score of 100 converts to 101.963. This could be problematic because an increase of
nearly 2 points could be significant when making comparisons. Even though the data points
in green seem to show a scattering of points above and below the line of best fit, the
residual data plot still shows a slight pattern.
As an experiment, I did a further adjustment to the regression formula by removing the b
coefficient to read y1 ~ ax21. The results can be seen in Graph 2d below where several of the
parameters changed significantly, especially the Coefficient of Determination (r2=0.5893)
and Correlation (r=0.7676).
3. Exponential Function (y=axn)
The third function regressed against the data set was an exponential function in the
standard form y=axn written as y1 ~ ax1n in the Desmos calculator with the value limits set as
Domain {0 ≤ x ≤ 100} and Range {0 ≤ y ≤ 100}. The results can be seen in Graph 3a below.
The function equation was given as y=3.75652x0.71926. The correlation (r=0.0.99276) and the
Coefficient of Determination (r2=0.9856) were very strong.
3a Exponential function regression y=axn
3b Exponential function residual plot
A closer look at the residual data plot shown in Graph 3b below shows an uneven scattering
of points. This adds weight to the possibility that this function may not be the best fit.
There is a similar issue with the previous regression samples. At the upper πππend of the
scale a raw score of 100 converts to a scaled score of 103.111. this represents an error of 3%
which may be significant in practical application.
4. Trigonometric Function (y=cosx)
The fourth function regressed against the data set was a trigonometric function in the
standard form y=acosx written as y1 ~ acos(2πx/b)+c. The results can be seen in Graph 4a
below. The function equation was given as y=-48.6182cos(-2πx/3.34085)+48.6088. The
correlation (r=0.0.9979) and the Coefficient of Determination (r2=0.996) were very strong.
The same issue arises a score of 100 converts to a scaled score of 96.745 (-3.3%).
4a Trigonometric function regression
4a Trigonometric function residual plot
By adding a maximum raw and scaled score (100,100) to the table of values, the equation
adjusted to y= -48.8005cos(-2πx/3.37069)+49.0496. The correlation (r=0.9976) and the
Coefficient of Determination (r2=0.9953) remained very strong. This looks like a good fit.
REGRESSION ANALYSIS FOR MATHEMATICAL METHODS
The same process was used for the Economics data. A minimum score (0,0) and maximum
score (100,100) was added to the table of values. The result for each regression can be seen
in the graphs below
5a Linear regression
B Quadratic regression
y=x
R2 = 0.7116
Correlation: R = 0.8835
Residual plot: Poor fit because a pattern is
obvious from the data points.
y = ax2 + bx
R2 = 0.9976
Correlation: R = 0.9987
Residual plot: Good Fit with and even
spread of data points.
5c Exponential regression
5d Trigonometric function regression
y = axb
R2 = 0.9934
Correlation: R = 0.9966
Residual plot: Not an even spread of data
points
y = a cos (2πx/b) + c
R2 = 0.9864
Correlation: R = 0.9931
Residual plot: Not an even spread of data
points
Download