Chapter_10_11

advertisement
Chapters 10 and 11
Terminology:
- Measurement or Quantitative Variables: variables that involve measurement or
counting: e.g. number of pages in a book (count), number of students in class
(count), height (measured), weight (measured), exam score (measured).
- Deterministic Relationship: One variable is directly related to another. For
example, if we know your height in inches we can easily convert the height to
centimeters by multiplying by 2.54
- Statistical Relationship: a relationship where a natural variability exists.
Consider the first exam scores and mean quiz score. Although there is a
relationship between the performance on both of these variables, students having
the same mean quiz average did not achieve the same exam score. For instance, 4
students have a quiz average of 92.22 but had the following exam scores: 73.3,
86.7, 86.7, and 96.7
- Linear Relationship and Regression: if there exists a statistical relationship
where one measurement variable reacts in a linear manner with a change in the
second measurement variable then we can employ what is called in statistics
regression methods to explore and explain this linear relationship. For instance,
if we can show that a linear relationship exists between mean quiz scores and
exam scores then we can use regression methods to explain and predict one
variable based on the other.
- Response, Outcome, Dependent variable versus Explanatory, Independent,
Predictor variable: in regression the response or outcome or dependent variable
is the variable we are interested in predicting or explaining using a second
variable called the explanatory or independent or predictor variable. E.g. we want
to predict exam scores (response/outcome) based on mean quiz scores
(predictor/explanatory); explain the variation in weight (response/outcome) using
height (predictor/explanatory)
Determining Linear Relationship
- Scatterplot: this is a plot of points for the combinations of the observations of the
response and predictor variables putting the response on the Y or vertical axis and
the predictor on the X or horizontal axis.
- For the scatterplot below what can you see?
i. Does the plot indicate any relationship?
ii. If yes, is it linear?
iii. If linear what direction?
- From this plot you can see that there appears to be a linear relationship that is positive.
That is, as Quiz Average increases so does Exam score. A negative relationship evolves
when as the predictor variable increases the response decreases. (e.g. consider driving
speed and travel time: the faster you drive (i.e. speed increases) travel time decreases.
1
Scatterplot of Exam 1 vs Quizzes Average
100
90
Exam 1
80
70
60
50
40
30
0
20
40
60
Quizzes Average
80
100
0
-
-
-
Measuring this relationship: The scatterplot is helpful as it provides a picture of
what is happening, but does leave room for various interpretations. To quantify a
linear relationship between two measurement variables we use a statistical
measure called correlation symbolized by r.
Correlation: the measure of the strength and direction of a linear relationship
between two measurement variables. A perfect positive linear relationship (i.e.
the point of the scatterplot fall exactly in line with an increasing pattern) has a
correlation of 1. Conversely, a perfect negative correlation has a value of
negative one. No linear relationship has a correlation value of 0. Therefore the
range of possible correlation values is: -1 ≤ r ≤ 1. Keep in mind that the sign of
the correlation has nothing to do with the strength of the linear relationship, but
only the direction. That is, -1 and 1 indicate the same strength of the linear
relationship just in opposite directions. Thus a correlation of -0.9 would indicate
a stronger linear relationship than a correlation of 0.2
Statistical Significance: The scatterplot and correlation can lead to varying
interpretations; i.e. people may have differing opinions on the interpreting the plot
and/or whether the correlation value indicates a strong or weak linear relationship.
In an effort to remove these differences researchers the concept of statistical
significance is used. This concept, if used on the same set of data should direct
researchers to the same conclusion about the data. Once a correlation is found,
one can compare this value to zero to see if the difference between the correlation
and zero is one that is different statistically, or does this difference just occur due
to chance.
2
-
P-value: a p-value or “probability” value is calculated for a correlation. The
methods for calculating this p-value are beyond this course, but the interpretation
of a p-value is somewhat straightforward. The p-value for a correlation is the
probability that the data would result in that correlation value if in reality the
correlation was zero. For example, the correlation between exam and quiz
average was 0.856 (strong and positive) and had a p-value of 0.0001 which is very
small. What this means is this:
If the correlation between exam score and quiz average were zero, then
the probability that our sample data would produce a correlation of 0.856 or
higher is 0.0001 Very unlikely!
-
-
-
Interpreting this p-value: Again we have another number, the p-value, so how is
one to judge whether this p-value is “small enough”? To make such conclusion
we compare this p-value to some standard called a “level of significance” which
means that if the p-value is below, i.e. less than, this standard the relationship is
deemed statistically significant. That is, the correlation is statistically different
from zero. A value of 0.05 or 5% is a common value used for this level of
significance. In our example, with this p-value of 0.0001 being less than 0.05 we
would claim that the correlation of 0.856 is significantly different from zero.
What is going on here? To summarize, when we have two measurement
variables where we want to see if one is linearly related to the other and therefore
one variable can be used to predict the other we now have some statistical tools to
analyze this. Since initially we do not know if a linear relationship exists we start
with the correlation between the two variables being zero until we can show
otherwise. Next we calculate the correlation based on our sample data, but in
truth would you really expect the sample data correlation to be exactly zero even
if there is no linear relationship? Probably not. But since we don’t expect the
correlation to be exactly zero how can we tell if this difference is based on chance
(i.e. because we have sample data we already expect the correlation not to be
exactly zero so is this difference simply due to sampling?) or is this difference
representative of being different from zero statistically? To make this
determination we compare the p-value to 0.05: if the p-value is smaller then we
say the difference is a statistical one and thus have a statistical significant result.
If p-value is greater than 0.05 then we decide that this difference from zero is due
to chance and therefore cannot say that the two variables are statistically linearly
related.
Regression: Once a linear relationship between two measurement variables has
been determined, we can formulate this relationship by the use of a line equation.
o Recall, possibly!, from algebra the following: y = b + mx where:
 b is the y-intercept: where the line crosses the y-axis
 m is the slope of the line: rise over run?
o For example, since the one centimeter is equal to 2.54 inches, if I gathered
everyone’s height in inches and then converted these to centimeters the
resulting line would be: cm = 0 + 2.54inches. Here the slope means for
every increase in one inch predicted centimeters increases by 2.54 cm.
3
-
Regression Line: In statistics, our data very rarely falls in a deterministic
relationship. Instead of falling on a straight line the data is “scattered” about.
However, if we can demonstrate that a significant linear relationship exists (via
Scatterplot and correlation), then we can fit a line to this data. How the yintercept and slope are calculated are beyond this course; you will simply be
given these values – i.e. the line equation. However, you are to be expected how
to interpret the line and make predictions. From the entire data set of our exam 1
and average quiz scores the correlation between this two variables was 0.856 with
a p-value of 0.0001: indicating a strong, positive linear relationship. the
regression equation is:
o Exam01 = 33.1 + 0.6 QuizAverage
-
-
-
Interpretations of this line: The y-intercept would mean that for a quiz average
of 0 the line would intersect the y-axis at 33.1% for exam 1. This is possibly (the
average of a 0) if the student did not take any of the quizzes, or scored a 0 on
each. By the way, this could also be used to indicate a red flag for a student’s
performance. Imagine if a student did not take any of the quizzes but scored very
well on the exam (say 70% or better). This could result in the student being
questioned. As to the interpretation of the slope of 0.60: we would say that an
increase of 1% in quiz average the expected or predicted exam score would
increase by 0.60% or roughly by ½ a percentage point.
Predicting a response value: With the regression line we can predict an expected
or mean response for a specific value of the explanatory variable by plugging into
the equation this value of “X”. For instance, from our equation for a quiz average
of 100% we would predict an exam1 score of: 33.1 + 0.60*100 = 93.1%
Extrapolation: One important concept to keep in mind is that the regression
equation is based on the sample data, specifically the range of such values. To
use the equation for values of the predictor variable outside the range of sample
values is called extrapolation. Extrapolation can lead to erroneous results.
Consider for example a study done to predict the response (outcome) Baby
Weight based on the predictor variable of age in months from ages 0 (birth
weight) to 22 months. Say such data produced a regression equation of:Weight =
6 + 0.9Age. Now what I was to plug in my age in months (552). My predicted
weight would be about 503 pounds!
Notes about Correlation
-
-
Correlation is unit free - e.g. if we had the correlation between weight in pounds
and height in inches and converted the heights to centimeters the correlation
would be unchanged.
Correlation can only range from negative one to positive one and the sign only
indicates the direction of this relationship. The value indicates the strength. For
example a correlation of negative 0.9 would indicate a stronger linear relationship
than a correlation of positive 0.8
4
Outliers: There are generally two types of outliers: those observations that “stray” from
the general distribution of the data and those observations that influence the relationship
between the two variables. The latter is called an “influential outlier”. From the
graphic below both Points A and B would be outliers, but Point B would be influential.
Why? Because if you remove Point B the correlation b/w X and Y would be greatly
affected as would the regression of X and Y. However, for Point A the removal would
improve both. For B, with the point the regression equation has a positive slope (i.e.
positive correlation) and would go straight through B. If removed, the slope would now
be negative. In general, influential outliers are observations that are outside the range of
the remaining X-observations (i.e. the horizontal axis).
5
Download