Correlation and Regression

advertisement
Methodology Glossary Tier 1
Correlation and Regression
Predicting the unknown
Relationships between Variables
Correlation and regression are widely used methods to look into the association
and relationships between variables. A correlation looks at the strength of the
relationship between variables and regression analysis helps determine the
nature of the relationship, or how it behaves. This allows predictions to be made.
These methods are very useful, but easily misused.
Correlation
A correlation is expressed by a value, called a correlation coefficient, which is
between – 1 and + 1. The further away the correlation coefficient is from 0 the
stronger the relationship between the variables. If the correlation coefficient is 0 it
means there is no relationship between the variables being measured. It should
be stressed that a correlation does not measure cause and effect, simply the
relationship between variables. Perfect correlation is where the data points all lie
in a perfectly straight line when the two features are plotted against each other
on a chart called a scatter plot.
An example of two variables that could produce a 0 correlation coefficient are
illustrated below. In reality the correlation coefficient of these data points will be
very close to zero as it is very rare that there is absolutely no relationship
between variables.
A perfect negative correlation has a coefficient of -1. A negative correlation
means that the high values of one of the data sets are matched with the small
values of the other data set, or as the values for one variable increase the values
for the other decrease, similar to the age and value of a car.
Methodology Glossary Tier 1
A perfect positive correlation has a coefficient of +1. A positive correlation
means that the high values of one data set are matched with the high values
of the other data set, or as the values for one variable increase so do the
values for the other variable.
It is extremely unusual to have correlation coefficients of +1 or -1, they
generally vary between these two extremes. If the correlation coefficient is
calculated as zero, then this means that there is no link between the two data
sets although this is also very unusual.
When interpreting correlation coefficients it is important to remember two
points –
(a) The measurement is correlation not causation. Just because two
variables are correlated does not mean that one causes the other. For
example the occurrence of snakebites and the volume of ice cream eaten
are highly positively correlated. It is clear that eating ice cream does not
cause snakebites, there is a third explanatory factor- warm weather. As
the weather gets warmer snakes are more active and our appetite for cool
refreshments also increases. It is important to remember that the variables
are simply correlated, and any explanation of causation must be
considered carefully.
Methodology Glossary Tier 1
(b) It is worth examining the pattern of the scatter plot to see if there is
anything odd with the data sets. The following chart shows a correlation
coefficient of +0.5, but if the last variable was removed the correlation
coefficient would increase to +1.
It would seem from the chart that the last x-variable was suspect and either
very atypical or even a mistake, i.e. it is an outlier.
Regression : predicting from linear relationships
The simplest type of statistical prediction uses linear correlation between a
variable of interest and another variable that either directly affects it, or is at
least correlated with it (the explanatory variable).
If we look at a scatter plot that shows the size of ten towns along the bottom
(x-axis) and the number of dentists in each of those towns up the side (yaxis).
So in the first town there is a population of 50,000 people and a total of 6
dentists. If these two variables were perfectly correlated it would be very easy
to predict how many dentists would be in any size town. It would simply be a
matter of tracing a vertical line from the town size on the x-axis up to the line
passing through the all the points and then taking a horizontal line from there
over to the y-axis.
Methodology Glossary Tier 1
Unfortunately our variables are not perfectly correlated. So instead of using a
line which joins all the points we use what is called a least squares regression
line. This is a line which passes through all the data points in such a way that
the minimum possible distance lies between each point and the line. This line
is also called the line of best fit and is illustrated using our original
dentist/population chart below.
This line can also be represented by an equation known as regression
equation, or model. This is a way of describing numerically the relationship
between population size and number of dentists.
Provided this model is reasonably accurate when the predictions are
compared to the actual values, then the unknown number of dentists in any
similar town can be predicted (or projected, or forecast) from its population
size in this way.
Many relationships that one would wish to model in order to generate useful
predictions are however, more complex than in this simple illustration. The
number of dentists that set up practices in different towns will, in reality, be
governed by a whole suite of factors such as the local economy, proximity to
a neighbouring town, so the resulting regression equation would contain
additional terms for each of these variables, these terms are known as
regression coefficients.
Methodology Glossary Tier 1
Evaluating the model
The computer output for a regression model will include confidence intervals
around the estimates for the prediction and each regression coefficient, which
allow us to judge the reliability of these statistics.
The success of the regression model in describing the relationship or system
2
can also be judged by means of the R statistic which describes the
proportion of the total variability in the system that is explained by the model.
This is expressed either as a proportion or a percentage of the variation that
is explained by the model (68% in the example above).
More of the variability in the data will be explained as more and more
variables that truly influence it are added to the analysis. However, increasing
the complexity of the analysis is often at the expense of its usefulness:
complex models are harder to understand, and more difficult to apply to
generate useful predictions, because the calculations are difficult, or because
some of the explanatory variables are difficult to measure.
So the goal when using regression modelling to generate useful predictions,
is to achieve the best compromise between maximising the reliability and
explanatory power of the model, and minimising the number of variables that
need to be used to achieve a reliable model.
The success of the predictions generated by any model can only be
determined empirically, in other words by seeing how accurate they are in
real life. In the case of a spatial prediction concerning a statistic for locations
where it would not normally be measured, we can go to these locations and
measure its actual value to test the prediction. A prediction for some point in
the future can be tested when it is eventually measured at that time, but a
backwards (retrospective) projection in time is only testable if there is
independent evidence of its value in the past.
Further Information
Tier 1 Confidence Intervals
Download