Methodology Glossary Tier 1 Correlation and Regression Predicting the unknown Relationships between Variables Correlation and regression are widely used methods to look into the association and relationships between variables. A correlation looks at the strength of the relationship between variables and regression analysis helps determine the nature of the relationship, or how it behaves. This allows predictions to be made. These methods are very useful, but easily misused. Correlation A correlation is expressed by a value, called a correlation coefficient, which is between – 1 and + 1. The further away the correlation coefficient is from 0 the stronger the relationship between the variables. If the correlation coefficient is 0 it means there is no relationship between the variables being measured. It should be stressed that a correlation does not measure cause and effect, simply the relationship between variables. Perfect correlation is where the data points all lie in a perfectly straight line when the two features are plotted against each other on a chart called a scatter plot. An example of two variables that could produce a 0 correlation coefficient are illustrated below. In reality the correlation coefficient of these data points will be very close to zero as it is very rare that there is absolutely no relationship between variables. A perfect negative correlation has a coefficient of -1. A negative correlation means that the high values of one of the data sets are matched with the small values of the other data set, or as the values for one variable increase the values for the other decrease, similar to the age and value of a car. Methodology Glossary Tier 1 A perfect positive correlation has a coefficient of +1. A positive correlation means that the high values of one data set are matched with the high values of the other data set, or as the values for one variable increase so do the values for the other variable. It is extremely unusual to have correlation coefficients of +1 or -1, they generally vary between these two extremes. If the correlation coefficient is calculated as zero, then this means that there is no link between the two data sets although this is also very unusual. When interpreting correlation coefficients it is important to remember two points – (a) The measurement is correlation not causation. Just because two variables are correlated does not mean that one causes the other. For example the occurrence of snakebites and the volume of ice cream eaten are highly positively correlated. It is clear that eating ice cream does not cause snakebites, there is a third explanatory factor- warm weather. As the weather gets warmer snakes are more active and our appetite for cool refreshments also increases. It is important to remember that the variables are simply correlated, and any explanation of causation must be considered carefully. Methodology Glossary Tier 1 (b) It is worth examining the pattern of the scatter plot to see if there is anything odd with the data sets. The following chart shows a correlation coefficient of +0.5, but if the last variable was removed the correlation coefficient would increase to +1. It would seem from the chart that the last x-variable was suspect and either very atypical or even a mistake, i.e. it is an outlier. Regression : predicting from linear relationships The simplest type of statistical prediction uses linear correlation between a variable of interest and another variable that either directly affects it, or is at least correlated with it (the explanatory variable). If we look at a scatter plot that shows the size of ten towns along the bottom (x-axis) and the number of dentists in each of those towns up the side (yaxis). So in the first town there is a population of 50,000 people and a total of 6 dentists. If these two variables were perfectly correlated it would be very easy to predict how many dentists would be in any size town. It would simply be a matter of tracing a vertical line from the town size on the x-axis up to the line passing through the all the points and then taking a horizontal line from there over to the y-axis. Methodology Glossary Tier 1 Unfortunately our variables are not perfectly correlated. So instead of using a line which joins all the points we use what is called a least squares regression line. This is a line which passes through all the data points in such a way that the minimum possible distance lies between each point and the line. This line is also called the line of best fit and is illustrated using our original dentist/population chart below. This line can also be represented by an equation known as regression equation, or model. This is a way of describing numerically the relationship between population size and number of dentists. Provided this model is reasonably accurate when the predictions are compared to the actual values, then the unknown number of dentists in any similar town can be predicted (or projected, or forecast) from its population size in this way. Many relationships that one would wish to model in order to generate useful predictions are however, more complex than in this simple illustration. The number of dentists that set up practices in different towns will, in reality, be governed by a whole suite of factors such as the local economy, proximity to a neighbouring town, so the resulting regression equation would contain additional terms for each of these variables, these terms are known as regression coefficients. Methodology Glossary Tier 1 Evaluating the model The computer output for a regression model will include confidence intervals around the estimates for the prediction and each regression coefficient, which allow us to judge the reliability of these statistics. The success of the regression model in describing the relationship or system 2 can also be judged by means of the R statistic which describes the proportion of the total variability in the system that is explained by the model. This is expressed either as a proportion or a percentage of the variation that is explained by the model (68% in the example above). More of the variability in the data will be explained as more and more variables that truly influence it are added to the analysis. However, increasing the complexity of the analysis is often at the expense of its usefulness: complex models are harder to understand, and more difficult to apply to generate useful predictions, because the calculations are difficult, or because some of the explanatory variables are difficult to measure. So the goal when using regression modelling to generate useful predictions, is to achieve the best compromise between maximising the reliability and explanatory power of the model, and minimising the number of variables that need to be used to achieve a reliable model. The success of the predictions generated by any model can only be determined empirically, in other words by seeing how accurate they are in real life. In the case of a spatial prediction concerning a statistic for locations where it would not normally be measured, we can go to these locations and measure its actual value to test the prediction. A prediction for some point in the future can be tested when it is eventually measured at that time, but a backwards (retrospective) projection in time is only testable if there is independent evidence of its value in the past. Further Information Tier 1 Confidence Intervals