Linear Regression and Correlation Topic 18 Linear Regression Is the link between two factors i.e. one value depends on the other. E.g. Drivers age – risk of accident. Gender – time spent shopping Car price – depends on age (of car) Sales – depend on Marketing Crickets and Temperature Crickets make their chirping sounds by rapidly sliding one wing over the other. The faster they move their wings, the higher the chirping sound that is produced. Crickets and Temperature Analysing the data First graph the data using the XY (Scatter) option Analysing the data Then right click on one of the data points and select – Add Trendline Analysing the data Select the Linear Regression type Analysing the data Now right click on the Trendline and select Format Trendline then select Options – finally select Display equation on Chart Analysing the data We can now predict the Temperature. Line of Best Fit You can see differences between the Measured Values and the Calculated values – why? Mean Squared Error (MSE) The mean squared error or MSE of an estimator is the expected value of the square of the "error." The error is the amount by which the estimator differs from the quantity to be estimated. The difference occurs because of randomness or because the estimator doesn't account for information that could produce a more accurate estimate. Root Mean Square Error The root mean square error (RMSE) is a frequently-used measure of the difference between values predicted by a model and the values actually observed from the thing being modelled or estimated. The lower the value of the RMSE the better the fit of observed to calculated data. RMSE Stating the Error For our Crickets we could then say: Temperature Y = 1.8635X – 3.7532 Where X is the recorded beats per second of the Crickets wings. Accurate to + or – 2.07 o C Correlation Coefficient The correlation coefficient is a measure of how well trends in the predicted values follow trends in the actual values. It is a measure of how well the predicted values from a forecast model "fit" with the real-life data. Correlation Coefficient The correlation coefficient is a number between 0 and +/- 1. If there is no relationship between the predicted values and the actual values the correlation coefficient is 0 or very low (the predicted values are no better than random numbers). As the strength of the relationship between the predicted values and actual values increases, so does the correlation coefficient. A perfect fit gives a coefficient of +/- 1.0. Thus the higher the correlation coefficient the better. A demonstration correlation Correlation Two main methods of calculating correlations are: Spearman's Rank Correlation Coefficient and Pearson's or the Product-Moment Correlation Coefficient. Spearman’s Rank Correlation Coefficient Spearman's Rank Correlation Coefficient In calculating this coefficient, we use the Greek letter 'rho' or r The formula used to calculate this coefficient is: r = 1 - (6 d2 ) / n(n2 - 1) Pearson's or Product-Moment Correlation Coefficient The Pearson Correlation Coefficient is denoted by the symbol r. Its formula is based on the standard deviations of the x-values and the y-values: Coefficient of Determination R Squared Shows the amount of variation in y that depends on x The version most common in statistics texts is based on an analysis of variance decomposition as follows: SST is the total sum of squares, SSR is the explained sum of squares, and SSE is the residual sum of squares Coefficient of Determination R Squared Thankfully Excel calculates this for you: