Microbiology 3053
Microbiological Procedures
Correlation analysis is used when you have measured two continuous variables and want to quantify how consistently they vary together
The stronger the correlation, the more likely to accurately estimate the value of one variable from the other
Direction and magnitude of correlation is quantified by Pearson’s correlation coefficient, r
Perfectly negative (-1.00) to perfectly positive (1.00)
No relationship (0.00)
The closer r = |1|, the stronger the relationship
R=0 means that knowing the value of one variable tells us nothing about the value of the other
Correlation analysis uses data that has already been collected
Archival
Data not produced by experimentation
Correlation does not show cause and effect but may suggest such a relationship
There is a strong, positive correlation between
the number of churches and bars in a town smoking and alcoholism (consider the relationship between smoking and lung cancer) students who eat breakfast and school performance marijuana usage and heroin addiction (vs heroin addiction and marijuana usage)
Scatterplots are used to illustrate correlation analysis
Assignment of axes does not matter (no independent and dependent variables)
Order in which data pairs are plotted does not matter
In strict usage, lines are not drawn through correlation scatterplots
600
500
400
300
200
100
0
-100 0
-200
-300
-400
10
Weak Positive Correlation r = 0.266
20 30
Strong Negative Correlation r = 0.9960
40 50
120
100
80
60
40
20
0
-20
-40
-60
-80
-100
0
No Correlation r = 0.00
10 20 30
5000
4000
3000
2000
1000
0
-1000
0
-2000
50 100 150 200 250
40 50
Used to measure the relationship between two variables
Prediction and a cause and effect relationship
Does one variable change in a consistent manner with another variable?
x = independent variable (cause) y = dependent variable (effect)
If it is not clear which variable is the cause and which is the effect, linear regression is probably an inappropriate test
Calculated from experimental data
Independent variable is under the control of the investigator (exact value)
Dependent variable is normally distributed
Differs from correlation, where both variables are normally distributed and selected at random by investigator
Regression analysis with more than one independent variable is termed multiple
(linear) regression
70
Best fit line based on the sum of the squares of the distance of the data points from the predicted values (on the line)
60
50
40
30
20
10
0
0 y = 1.0092x + 8.6509
R
2
= 0.8863
10 20 30
Independent Variable
40 50
y = a + bx where
a = y intercept (point where x = 0 and the line passes through the y-axis)
b = slope of the line (y
2
-y
1
/x
2
-x
1
)
The slope indicates the nature of the correlation
Positive = y increases as x increases
Negative = y decreases as x increases
0 = no correlation
Same as Pearson’s correlation
No relationship between the variables
Shows the strength of the linear relationship between two variables, symbolized by r
The closer the data points are to the line, the closer the regression value is to 1 or -1
r varies between -1 (perfect negative correlation) to 1
(perfect positive correlation)
0 - 0.2 no or very weak association
0.2 -0.4 weak association
0.4 -0.6 moderate association
0.6 - 0.8 strong association
0.8 - 1.0 very strong to perfect association null hypothesis is no association (r = 0)
Salkind, N. J. (2000) Statistics for people who think they hate statistics.
Thousand Oaks, CA: Sage
2
Used to estimate the extent to which the dependent variable (y) is under the influence of the independent variable (x) r 2 (the square of the correlation coefficient)
Varies from 0 to 1 r 2 = 1 means that the value of y is completely dependent on x (no error or other contributing factors) r 2 < 1 indicates that the value of y is influenced by more than the value of x
A measurement of the proportion of variance of y explained by its dependence on x
Remainder (1 - r 2 ) is the variance of y that is not explained by x ( i.e., error or other factors) e.g., if r 2 = 0.84, it shows a strong, positive relationship between the variables and shows that the value of x is used to predict 84% of the variability of y (and 16% is due to other factors) r 2 can be calculated for correlation analysis by squaring r but
Not a measure of variation of y explained by variation in x
Variation in y is associated with the variance of x (and vice versa)
Independent variable (x) is selected by investigator (not random) and has no associated variance
For every value of x, values of y have a normal distribution
Observed values of y differ from the mean value of y by an amount called a distributed.) residual. (Residuals are normally
The variances of y for all values of x are equal
(homoscedasticity)
Observations are independent (Each individual in the sample is only measured once.)
The numbers alone do not guarantee that the data have been fitted well!
Anscombe, F. J. 1973. Graphs in Statistical Analysis. The American
Statistician 27(1):17-21.
Figure 1: Acceptable regression model with observations distributed evenly around the regression line
Figure 2: Strong curvature suggests that linear regression may not be appropriate (an additional variable may be required)
Figure 3: A single outlier alters the slope of the line.
The point may be erroneous but if not, a different test may be necessary
Figure 4: Actually a regression line connecting only two points. If the rightmost point was different, the regression line would shift.
Homoscedastic
Heteroscedastic
• Variance appears random
• Good regression model
• “Funnel” shaped and may be bowed
• Suggests that a transformation and inclusion of additional variables may be warranted
Helsel, D.R., and R.M. Hirsh. 2002. Statistical Methods in Water Resources. USGS
(http://water.usgs.gov/pubs/twri/twri4a3/)
4
3
2
1
0
-1
0
-2
Data Set 1
0
-0.5
-1
-1.5
-2
-2.5
2.5
2
1.5
1
0.5
0 5
X Variable 1
Data Set 3
10
5
X Variable 1
10
15
15
Data Set 2
1.5
1
0.5
0
-0.5
0
-1
-1.5
-2
-2.5
5 10 15
X Variable 1
Data Set 4
2.5
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
0 5 10 15 20
X Variable 1
Values that appear very different from others in the data set
Rule of thumb: an outlier is more than three standard deviations from mean
Three causes
Measurement or recording error
Observation from a different population
A rare event from within the population
Outliers need to be considered and not simply dismissed
May indicate important phenomenon e.g., ozone hole data (outliers removed automatically by analysis program, delaying observation about 10 years)
Helsel, D.R., and R.M. Hirsh. 2002. Statistical Methods in Water Resources. USGS
(http://water.usgs.gov/pubs/twri/twri4a3/)
Data should be interval or ratio
The dependent and independent variables should be identifiable
The relationship between variables should be linear (if not, a transformation might be appropriate)
Have you chosen the values of the independent variable?
Does the residual plot show a random spread
(homoscedastic) and does the normal probability plot display a straight line (or does a histogram of residuals show a normal distribution)?
The normal probability plot indicates whether the residuals follow a normal distribution, in which case the points will follow a straight line.
Expect some moderate scatter even with normal data. Look only for definite patterns like an
"S-shaped" curve, which indicates that a transformation of the response may provide a better analysis. (from
Design Expert 7.0 from
Stat-Ease)
The Michaelis-Menton equation to describe enzyme activity: v o
[ S
K m
] V max
[ S ] is linearized by taking its reciprocal:
1 v o
1
V max
K m
V max
1
[ S ] where: y = 1/v o x = 1/[S] a = 1/V max b = K m
/V max
Michaelis-Menton Plot
90
80
70
60
50
40
30
20
10
0
0 20 40 60 80
S (pennies/m^2)
100 120 140
Lineweaver-Burk Plot
0.090
0.080
0.070
0.060
0.050
0.040
0.030
0.020
0.010
0.000
0.000
0.020
y = 0.7053x + 0.0076
R
2
= 0.9785
0.040
0.060
0.080
1/S (pennies/m^2)^-1
0.100
0.120
Eadie-Hofstee
140
120
100
80
60
40
20
0
0 y = -85.671x + 124.48
R
2
= 0.8543
0.2
0.4
0.6
0.8
v/S (m^2/min)
1 1.2
1.4
Residual Plot
0.01
0.005
0
-0.005
0.00
-0.01
0.05
0.10
0.15
X Variable
Normal Probability Plot
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0 20 40 60 80
Sample Percentile
100 120