Psy 1191 Intro to Research Methods Dr. Launius 1 Psy 1191 Statistical Methods Workshop: Correlation Introduction: At some point in your career as a data analyst you may be asked to describe the association between variables or to predict one variable from another. Statistical techniques called correlation and linear regression allow us to do this. The purpose of Pearson's Correlation Coefficient is to indicate a linear relationship between two measurement variables. This means that if you have two sets of scores, you want to know: Does one score predict another? For example: Do your combined SAT scores predict your college GPA? Or why bother to take the SATs? Does stress predict how well you will do on an exam or other cognitive task? Might be good to know for people who have stressful jobs. Does a baby's birth weight predict how many colds it will have in infancy? Doctors and parents might want to know this. In all these cases, you want to know if one score is high, is the other also high? If one is low, is the other also low? That's why you take the SATs - it's supposed to predict college performance. Is it always the case? Not really - we'll show you how to evaluate such relationships! Looking at the pattern of numbers, how would you describe the relationship between SAT (X) and GPA (Y)? What happens when you have a lot more information? It is much harder to see a clear pattern when you have many data points. So, we need a way to summarize the information and describe the relationship. Here’s how we do this. From Wadsworth Publishing: http://www.wadsworth.com/psychology_d/templates/student_resources/workshops/res_methd/ Psy 1191 Intro to Research Methods Dr. Launius 2 Psy 1191 Statistical Methods Workshop: Correlation You want to make a simple graph that shows if there is a pattern to the two sets of scores. First, you arrange your data in columns - Like This: You pick one column for X (for the X axis) X is usually the score used to predict (SAT). You pick one column for Y. Y is usually the score you want to predict (GPA). Draw a graph (really - we use computers now!) Draw an X and Y set of axes. Plot the points Each SAT and GPA pair makes an (X,Y) pair If SAT and GPA are related, we would expect: Note: People above the mean of the GPA distribution are usually above the mean of the SAT distribution and vice versa. The scattergram indicates a correlation. It looks like the major axis of the ellipse (the line would be a good one to use to predict GPA from SAT). Let's say we made a scattergram of Height versus GPA. There's no relation there that we know of. So for any Height - people can have good or bad GPA's. For any SAT, our best guess would be the mean of the GPA's, Hence the horizontal line. I'd guess 2.0 for your GPA - whether you are tall or short. From Wadsworth Publishing: http://www.wadsworth.com/psychology_d/templates/student_resources/workshops/res_methd/ Psy 1191 Intro to Research Methods Dr. Launius 3 Psy 1191 Statistical Methods Workshop: Correlation So in one case (GPA and SAT) we have a positive correlation and the other example (GPA and Height) we have zero correlation. What about the values? We haven't calculated them yet - I just ball parked them. Let's look at GPA and amount of drinking (we survey students and measure the number of drinks per day vs. GPA - these are made up data.) The more you drink, the worse your GPA. This is a negative correlation. Big Idea: You get a tight, elongated elipse like this scattergram and you have a good predictive relationship and correlation! Calculating the Pearson's Product Moment Correlation Coefficient (r) 1. Think about Z scores. How do you know if you are doing well in a distribution as compared to another distribution? If you have a high Z score on the SATs and it predicts your GPA - you should have a high Z score on the GPAs. You would be in the top of each distribution. If your GPA stinks, I would expect your SATs to be not so hot. You would be below the mean on both and have negative z scores. 2. The Formula The correlation coefficient is calculated based on the following formula that uses Z scores: This means: Calculate everybody's Z score. Multiply each Zx by its corresponding Zy. Add up the result (multiplicand) for everyone. Divide by the number of people or observations. From Wadsworth Publishing: http://www.wadsworth.com/psychology_d/templates/student_resources/workshops/res_methd/ Psy 1191 Intro to Research Methods Dr. Launius 4 Psy 1191 Statistical Methods Workshop: Correlation So if we multiply your Z scores together, sum all these pairs and we get a positive sum we have a positive correlation. If you have a negative relationship you will have the sum of positive Zxs times negative Zys and vice versa. Add these up for a negative sum. If you have no correlation, then you get equal numbers of positive Zx times negative Zy, positive Zx times positive Zy, negative Zx times negative Zy, and negative Zx times positive Zy. Add these up and you get zero. For each of the scattergrams, determine whether the Pearson r will be positive, negative or zero. Explain why. The Values and Limits of the Pearson's Correlation Coefficient: Pearson's correlation coefficient (or r) can range from -1 to +1. No other value is possible. A value of zero (0.0) indicates that the variables are not related or perhaps have more complex or nonlinear relationships. Values close to -1 or +1 indicate strong predicative relationships. The sign indicates the direction of relationship (or its slope). Negative correlations would come from relationships with negative slope. Positive would represent a positive slope. Significance: You need to test the value you get for significance to see if it is not chance. It is possible that you pick a line by chance. For example, there is no relationship between Height and GPA. But in your sample, you choose by sheer luck smart short kids and dumb tall kids. Look below, the significance test should tell you if this is happening. From Wadsworth Publishing: http://www.wadsworth.com/psychology_d/templates/student_resources/workshops/res_methd/ Psy 1191 Intro to Research Methods Dr. Launius 5 Psy 1191 Statistical Methods Workshop: Correlation Nuance: Why is the maximum r = 1? Consider a correlation coefficient correlated between your height in inches and your height in feet. This is incredibly stupid to correlate (see our graph below). Obviously the relationship will be perfect. All the points are on the line. There is no spread to the scattergram. Thus your Zx will equal your Zy. That's because you stand in the same relative location in the height distribution no matter whether it is in feet or inches. Such a relationship gives you an r = 1.0 in the simple derivation below. Calculate r2: Squaring the correlation coefficient results in what is called the coefficient of determination or proportion of explained variance. For example if r = 0.6, the proportion of explained variance = 0.36. If r = -0.7, r2 = 0.49. Note these could be multiplied by 100 to produce the percent explained variance (36% and 49% respectively). What does this mean? It is assumed that someone's score (the one we want to predict) is made up of an explained component and unexplained component. Thus the total variance which summarizes how everyone varies from the mean is made up of the predicted deviations from the mean and the unpredicted deviations. r2 equals: [Explained Variance / Total Variance]. Pearson's Correlation Coefficient: Tells you if there is a linear relationship between two variables. Tells you how good the relationship is by seeing if r is close to 1 or -1 or r2 is close to 1.0. Tells you if the relationship is positive or negative by whether r is positive or negative. Can be used to calculate a straight line so you can predict one score from another. Every time you take a standardized test - someone is doing this to you. The "r" is out there! From Wadsworth Publishing: http://www.wadsworth.com/psychology_d/templates/student_resources/workshops/res_methd/