PMST MODELLING TASK USING REGRESSION ANALYSIS TO ESTIMATE THE SCALED SCORE FROM THE RAW SCORE IN TWO SUBJECTS BACKGROUND Scaling is a statistical and mathematical process that converts a raw score derived from cumulative assignments and exams into a scaled score. This allows a more direct and fairer comparison for students with different subject combinations and degrees of difficulty. For example, a student may achieve a lower raw score with a higher level of difficulty compared to a student with a higher raw score and lower level of difficulty. By using regression analysis, two data sets can be compared to model the most appropriate function with the highest correlation (r). One of the simplest methods is using a linear function (straight line) to model the ’line of best fit’ on a data scatter plot. The aim is to find a correlation as close to + 1 or - 1. This can usually be achieved by ensuring the data points are equally distributed around the line of best fit. Correlations XXXXX This task will use regression modeling to compare the raw and scaled scores for two subjects: Mathematics Methods and Economics for years 2020, 2021 and 2022. These scores are summarised in Table 1 below. Mathematical Methods: 2020 Raw Scaled 2021 Raw Scaled 2022 Raw Scaled 25% 57.00 77.12 25% 60.00 78.45 25% 63.00 79.42 50% 69.00 88.78 50% 72.00 89.40 50% 73.00 89.00 75% 80.00 94.53 75% 84.00 95.13 75% 84.00 94.81 90% 88.00 96.83 90% 92.00 97.13 90% 92.00 97.06 99% 97.00 98.30 99% 99.00 98.24 99% 98.00 98.09 25% 59.00 67.49 25% 63.00 71.10 25% 65.00 71.12 50% 70.00 81.00 50% 74.00 86.37 50% 75.00 84.88 75% 81.00 93.16 75% 84.08 93.84 75% 86.00 96.24 90% 88.00 96.13 90% 90.00 96.16 90% 92.00 95.75 99% 95.00 97.83 99% 96.00 97.67 99% 97.00 97.14 Economics 2020 Raw Scaled 2021 Raw Scaled 2022 Raw Scaled From the data sets it can be seen that there are ‘natural’ limits of zero (0) to represent the lowest minimum raw and scaled scores and one hundred (100) to represent the maximum highest scores. This will be important in setting the Domain and Range values for each regression function. This will be set as Domain {0 ≤ x ≤ 100} and Range {0 ≤ y ≤ 100}. This will be useful when estimating values using interpolation and extrapolation methods to test the validity of the functions. For example, any values outside these defined ranges will be problematic in making accurate predictions. The Desmos calculator XXXXXX REGRESSION ANALYSIS FOR ECONOMICS 1. Linear Function (y=mx+c) The first function regressed against the data set was a linear function in the standard form y=mx+c written as y1 ~ ax1 + b in the Desmos calculator. The Domain was set to {0 ≤ x ≤ 100} to show the maximum and minimum scores possible. The data points and line of best fit can be seen as green and blue respectively. The function equation was given as y=1.01399x +5.86219 where the gradient a = 1.01399 and Y-Intercept = 5.86219. Even though the Correlation (r=0.9842) and the Coefficient of Determination (r2=0.9686) was relatively strong, there were several problems. Firstly, the line intersects the y-axis (Scales Scores) at 5.86219. By inspection, this suggests that a raw score of 0 will convert to a scaled score of 5.86219. In addition, a raw score of 100 will convert to a scaled score of 107.261. This is a 5.86% and 7.26% discrepancy respectively. These would be significant issues in practical application. Secondly, the residual points seen in red on the graph show a linear pattern instead of a random scattering of points especially for the raw score values above 80. It was interesting to note that when I attempted to set the Range values to {0 ≤ y ≤ 100} the Desmos calculator would not generate a linear function. This suggests that a line of best fit would not be possible. 1a. Linear function regression y=ax+c 1b. Linear function regression (adjusted) y=ax To address these issues, the function formula was adjusted so the line could pass through the origin (0,0). This was written as y1 ~ ax1. The resulting function equation was given as y=1.08475x. This improved the accuracy of estimates at the low end of the raw score scale, however, the estimate for higher raw scores was still inaccurate because a raw score of 100 converts to a scaled score of 108.041, which is impossible. Even though the Coefficient of Determination (r2=0.9635) and Correlation (r=98158) were still strong, the residual data points plot still shows a pattern. By manually adjusting the parameters for ‘a’ (gradient) to y ~ 1.0 x1, I was able to eliminate the problem. However, this impacted the Correlation score (r=0.9392). Equally, the residual data plot is not randomly scattered as seen in Graph 1c. 1c. Linear function regression (adjusted) y=x When I attempted to set the Range values to {0 ≤ y ≤ 100} the Desmos calculator would not generate a linear function because too many data points would be above a possible line of best fit. 2. Quadratic Function (y=ax2+bx+c) The second function regressed against the data set was a parabolic function in the standard form y=ax2+bx+c written as y1 ~ ax21 + bx1+c in the Desmos calculator with the value limits set as Domain {0 ≤ x ≤ 100} and Range {0 ≤ y ≤ 100}. The results can be seen in Graph 2a below. The function equation was given as y=-0.004305x2+1.4563- 0.644546. The correlation (r=0.99408) and the Coefficient of Determination (r2=0.9882) was relatively strong. 2a. Quadratic function regression y=ax2+bx+c 2b. Quadratic regression for residuals An inspection of the residual plot was more positive than the linear regression, however, it would have been better for more points to be below the line between raw scores 70 and 90. There still appears to be a pattern between raw score values between 81 and 97. Once again, similar to the previous linear regression, there are data points outside the set limits. For example, a raw score of zero (0) converts to a scaled score of (-0.644546) and a raw score of 100 converts to a scaled score of 101.935. These don’t appear to be major discrepancies, however, it was important to look for improvements. Since one of the roots for the quadratic function must be (0,0), the formula was adjusted to y1 ~ ax21 + bx. This would ensure one zero root passing through the origin. 2c. Quadratic function regression y=ax2+bx 2d Quadratic function regression y=ax2 The resulting equation did not change significantly y=-0.00420066x2+1.4397. This addressed the issue of scores at the lower end of the scale where a raw score of 0 now converts to a scaled score of 0. However, it did not solve the issue at the higher end of the scale where a raw score of 100 converts to 101.963. This could be problematic because an increase of nearly 2 points could be significant when making comparisons. Even though the data points in green seem to show a scattering of points above and below the line of best fit, the residual data plot still shows a slight pattern. As an experiment, I did a further adjustment to the regression formula by removing the b coefficient to read y1 ~ ax21. The results can be seen in Graph 2d below where several of the parameters changed significantly, especially the Coefficient of Determination (r2=0.5893) and Correlation (r=0.7676). 3. Exponential Function (y=axn) The third function regressed against the data set was an exponential function in the standard form y=axn written as y1 ~ ax1n in the Desmos calculator with the value limits set as Domain {0 ≤ x ≤ 100} and Range {0 ≤ y ≤ 100}. The results can be seen in Graph 3a below. The function equation was given as y=3.75652x0.71926. The correlation (r=0.0.99276) and the Coefficient of Determination (r2=0.9856) were very strong. 3a Exponential function regression y=axn 3b Exponential function residual plot A closer look at the residual data plot shown in Graph 3b below shows an uneven scattering of points. This adds weight to the possibility that this function may not be the best fit. There is a similar issue with the previous regression samples. At the upper πππend of the scale a raw score of 100 converts to a scaled score of 103.111. this represents an error of 3% which may be significant in practical application. 4. Trigonometric Function (y=cosx) The fourth function regressed against the data set was a trigonometric function in the standard form y=acosx written as y1 ~ acos(2πx/b)+c. The results can be seen in Graph 4a below. The function equation was given as y=-48.6182cos(-2πx/3.34085)+48.6088. The correlation (r=0.0.9979) and the Coefficient of Determination (r2=0.996) were very strong. The same issue arises a score of 100 converts to a scaled score of 96.745 (-3.3%). 4a Trigonometric function regression 4a Trigonometric function residual plot By adding a maximum raw and scaled score (100,100) to the table of values, the equation adjusted to y= -48.8005cos(-2πx/3.37069)+49.0496. The correlation (r=0.9976) and the Coefficient of Determination (r2=0.9953) remained very strong. This looks like a good fit. REGRESSION ANALYSIS FOR MATHEMATICAL METHODS The same process was used for the Economics data. A minimum score (0,0) and maximum score (100,100) was added to the table of values. The result for each regression can be seen in the graphs below 5a Linear regression B Quadratic regression y=x R2 = 0.7116 Correlation: R = 0.8835 Residual plot: Poor fit because a pattern is obvious from the data points. y = ax2 + bx R2 = 0.9976 Correlation: R = 0.9987 Residual plot: Good Fit with and even spread of data points. 5c Exponential regression 5d Trigonometric function regression y = axb R2 = 0.9934 Correlation: R = 0.9966 Residual plot: Not an even spread of data points y = a cos (2πx/b) + c R2 = 0.9864 Correlation: R = 0.9931 Residual plot: Not an even spread of data points