Introduction to Methods for Two Numeric Variables In this set of

advertisement
Introduction to Methods for Two Numeric Variables
In this set of notes we will discuss methods used for comparing two ________________ variables. We
will introduce the concept of correlation and the basic idea behind simple linear regression.
Example: Lions’ age and proportion of nose blackness
It has been suggested that the amount of black pigmentation on the nose of a male lion increases with
age. Whitman et al. (2004) measured the proportion of black on the nose of lions of known ages in
Tanzania, East Africa. These proportions were obtained from photos and digitally analyzed and a subset
of the original data is given in the file Lions.jmp on the course website.
Close-up colour photographs were taken of known-aged lions from the Serengeti National Park and Ngorongoro Crater, Tanzania,
between 1999 and 2002. Each photograph was first digitized at high resolution into a .tif file, and the fleshy part of the nose (‘nose tip’)
from each image was excised using Adobe Photoshop 4.01 LE. Then, the Spatial Analyst extension of ESRI Arcview 3.2 was used to
rasterize each cut-out nose tip and assign each newly created ‘grid’ a range of colour values. By limiting the colour values to either ‘black’
or ‘not black’, the nasal pigmentation pattern was ‘mapped’ and quantified for the percentage of readable pixels that contained ‘black’.
Excised photo of nose tip
Identification photograph of a 3-yr-old Serengeti male
GIS rendering of nose colouration
A portion of the data is given below:
1
Pearson Product-moment Correlation Coefficient
The Pearson product-moment correlation coefficient (denoted by _____) is used to _________________
and ______________ the relationship between two numeric variables. It is appropriate to use when the
following are true:

The relationship between the two variables is a __________________ relationship.

Both variables are measured on _______________ or _________ scales.

Both variables are _____________ distributed.
The formula for the Pearson product-moment correlation coefficient is given below.
n
 (x
r=
i
- x)(yi - y)/(n - 1)
i=1
s.d.(x)s.d.(y)
Back to the example:
The mean and standard deviation of both Age and Proportion of nose blackness can be calculated in
JMP. Choose Analyze  Distribution and put both Age and Proprotion_Nose_Blackness in the Y,
Columns box. JMP will return the following output.
2
Calculations behind the correlation coefficient:
3
We can also use JMP to directly calculate the Pearson correlation coefficient. Choose Analyze 
Multivariate Methods  Multivariate. Then put both Age and Proportion of Nose Blackness in the Y,
Columns box and click OK. You should get the following output.
4
Interpreting the Pearson correlation coefficient:
1. A __________________ correlation indicates a ___________________ association between the
two numeric variables and a ___________________ correlation indicates a _________________
association.
2. The correlation coefficient is ALWAYS between _____ and _____ (-1 ≤ r ≤ 1).
a. Values near _____ indicate that a very __________ relationship exists.
b. Values close to _____ indicate a very strong _______________ relationship exists.
c. Values close to _____ indicate a very strong _______________ relationship exists.
Questions:
1. What does the correlation coefficient say about the direction of the relationship between a
lion’s age and the proportion of nose blackness?
2. What does the correlation coefficient say about the strength of the relationship between a lion’s
age and the proportion of nose blackness?
Caution regarding Pearson’s correlation coefficient
You should never use a correlation coefficient without also looking at the scatterplot of the data. Why?
Consider the data in the file Anscombe_Example.jmp on the course website.
Variables
Pearson Correlation Coefficient (r)
Scatterplot
X1, Y1
5
Variables
Pearson Correlation Coefficient (r)
Scatterplot
X2, Y2
X3, Y3
X4, Y4
To avoid misinterpreting a correlation, always accompany the correlation coefficient with a scatterplot
of the data and make sure the assumptions behind the Pearson product-moment correlation are met!
6
Introduction to Simple Linear Regression
Next, we will explore the basic idea behind regression analysis. A simple linear regression model
describes the relationship between a numeric ________________ variable (y) and a single
_______________ variable (x).
Response Variable: The ____________ variable, or the variable to be modeled.
Predictor Variable: The ______________ variable used as a predictor of the response.
The concept of conditioning
Once again, consider the data in the file Lions.jmp from the course website. We have already
investigated the _____________________ distribution of the Proportion of Nose Blackness (i.e. without
regard to age).
However, the population of the Proportion of Nose Blackness actually consists of several
subpopulations. For example, there is one subpopulation for each level of Age. The distribution of the
Proportion of Nose Blackness in each subpopulation is called the _____________________ distribution.
Goal of Regression: To understand how the conditional distribution of the response variable (y) varies
across the subpopulations determined by the possible values of the predictor(s).
7
Regression Notation: y | (x = x )
This represents the response in the subpopulation where the predictor is fixed at _____. For example,
the Proportion of Nose Blackness | (Age = 2) refers to only the subpopulation of two year-old lions.
When the particular value of Age is not an issue, we may more generally refer to the distribution of
Proportion of Nose Blackness | Age.
Mean and variance functions
Just like any distribution, the conditional distributions of the subpopulation have a mean and a variance.
Moreover, since the researchers expect the Proportion of Nose Blackness to change with Age, we expect
the means and variances of the conditional distributions to __________ with Age as well.

Mean Function –

Variance Function –
Calculating the summary statistics for each group
To calculate the summary statistics for each group using JMP, choose Analyze  Distribution, and enter
the following:
Click OK and JMP will return the following.
8
Questions:
3. What is Ê (Proportion of Nose Blackness | Age = 2)?
4. What is Ê (Proportion of Nose Blackness | Age = 3)?
5. What is Ê (Proportion of Nose Blackness | Age = 4)?
9
6. What is Ê (Proportion of Nose Blackness | Age = 7)?
7. What is V â r(Proportion of Nose Blackness | Age = 3)?
8. What is V â r(Proportion of Nose Blackness | Age = 7)?
To identify trends in the mean function, we can plot and connect the conditional means on the
scatterplot. Note this scatterplot also displays trends in the variance function.
Though the above plot gives insight into the conditional distributions, a straight line model is typically
used instead to display this trend.
In the next set of notes we will discuss the method of fitting a straight line in detail.
10
Download