Introduction to Methods for Two Numeric Variables In this set of notes we will discuss methods used for comparing two ________________ variables. We will introduce the concept of correlation and the basic idea behind simple linear regression. Example: Lions’ age and proportion of nose blackness It has been suggested that the amount of black pigmentation on the nose of a male lion increases with age. Whitman et al. (2004) measured the proportion of black on the nose of lions of known ages in Tanzania, East Africa. These proportions were obtained from photos and digitally analyzed and a subset of the original data is given in the file Lions.jmp on the course website. Close-up colour photographs were taken of known-aged lions from the Serengeti National Park and Ngorongoro Crater, Tanzania, between 1999 and 2002. Each photograph was first digitized at high resolution into a .tif file, and the fleshy part of the nose (‘nose tip’) from each image was excised using Adobe Photoshop 4.01 LE. Then, the Spatial Analyst extension of ESRI Arcview 3.2 was used to rasterize each cut-out nose tip and assign each newly created ‘grid’ a range of colour values. By limiting the colour values to either ‘black’ or ‘not black’, the nasal pigmentation pattern was ‘mapped’ and quantified for the percentage of readable pixels that contained ‘black’. Excised photo of nose tip Identification photograph of a 3-yr-old Serengeti male GIS rendering of nose colouration A portion of the data is given below: 1 Pearson Product-moment Correlation Coefficient The Pearson product-moment correlation coefficient (denoted by _____) is used to _________________ and ______________ the relationship between two numeric variables. It is appropriate to use when the following are true: The relationship between the two variables is a __________________ relationship. Both variables are measured on _______________ or _________ scales. Both variables are _____________ distributed. The formula for the Pearson product-moment correlation coefficient is given below. n (x r= i - x)(yi - y)/(n - 1) i=1 s.d.(x)s.d.(y) Back to the example: The mean and standard deviation of both Age and Proportion of nose blackness can be calculated in JMP. Choose Analyze Distribution and put both Age and Proprotion_Nose_Blackness in the Y, Columns box. JMP will return the following output. 2 Calculations behind the correlation coefficient: 3 We can also use JMP to directly calculate the Pearson correlation coefficient. Choose Analyze Multivariate Methods Multivariate. Then put both Age and Proportion of Nose Blackness in the Y, Columns box and click OK. You should get the following output. 4 Interpreting the Pearson correlation coefficient: 1. A __________________ correlation indicates a ___________________ association between the two numeric variables and a ___________________ correlation indicates a _________________ association. 2. The correlation coefficient is ALWAYS between _____ and _____ (-1 ≤ r ≤ 1). a. Values near _____ indicate that a very __________ relationship exists. b. Values close to _____ indicate a very strong _______________ relationship exists. c. Values close to _____ indicate a very strong _______________ relationship exists. Questions: 1. What does the correlation coefficient say about the direction of the relationship between a lion’s age and the proportion of nose blackness? 2. What does the correlation coefficient say about the strength of the relationship between a lion’s age and the proportion of nose blackness? Caution regarding Pearson’s correlation coefficient You should never use a correlation coefficient without also looking at the scatterplot of the data. Why? Consider the data in the file Anscombe_Example.jmp on the course website. Variables Pearson Correlation Coefficient (r) Scatterplot X1, Y1 5 Variables Pearson Correlation Coefficient (r) Scatterplot X2, Y2 X3, Y3 X4, Y4 To avoid misinterpreting a correlation, always accompany the correlation coefficient with a scatterplot of the data and make sure the assumptions behind the Pearson product-moment correlation are met! 6 Introduction to Simple Linear Regression Next, we will explore the basic idea behind regression analysis. A simple linear regression model describes the relationship between a numeric ________________ variable (y) and a single _______________ variable (x). Response Variable: The ____________ variable, or the variable to be modeled. Predictor Variable: The ______________ variable used as a predictor of the response. The concept of conditioning Once again, consider the data in the file Lions.jmp from the course website. We have already investigated the _____________________ distribution of the Proportion of Nose Blackness (i.e. without regard to age). However, the population of the Proportion of Nose Blackness actually consists of several subpopulations. For example, there is one subpopulation for each level of Age. The distribution of the Proportion of Nose Blackness in each subpopulation is called the _____________________ distribution. Goal of Regression: To understand how the conditional distribution of the response variable (y) varies across the subpopulations determined by the possible values of the predictor(s). 7 Regression Notation: y | (x = x ) This represents the response in the subpopulation where the predictor is fixed at _____. For example, the Proportion of Nose Blackness | (Age = 2) refers to only the subpopulation of two year-old lions. When the particular value of Age is not an issue, we may more generally refer to the distribution of Proportion of Nose Blackness | Age. Mean and variance functions Just like any distribution, the conditional distributions of the subpopulation have a mean and a variance. Moreover, since the researchers expect the Proportion of Nose Blackness to change with Age, we expect the means and variances of the conditional distributions to __________ with Age as well. Mean Function – Variance Function – Calculating the summary statistics for each group To calculate the summary statistics for each group using JMP, choose Analyze Distribution, and enter the following: Click OK and JMP will return the following. 8 Questions: 3. What is Ê (Proportion of Nose Blackness | Age = 2)? 4. What is Ê (Proportion of Nose Blackness | Age = 3)? 5. What is Ê (Proportion of Nose Blackness | Age = 4)? 9 6. What is Ê (Proportion of Nose Blackness | Age = 7)? 7. What is V â r(Proportion of Nose Blackness | Age = 3)? 8. What is V â r(Proportion of Nose Blackness | Age = 7)? To identify trends in the mean function, we can plot and connect the conditional means on the scatterplot. Note this scatterplot also displays trends in the variance function. Though the above plot gives insight into the conditional distributions, a straight line model is typically used instead to display this trend. In the next set of notes we will discuss the method of fitting a straight line in detail. 10