Lab3Correlation.doc

Statistics – Spring 2008 Lab #3 – Correlation Defined: Variables: Relationship: Example: Assumptions: The measure of the strength and direction of the linear relationship between two variables. IV is continuous, DV is continuous Relationship amongst variables Relationship between height and weight. Normality. Linearity. 1. Graphing - Scatterplot  The first step of any statistical analysis is to first graphically plot the data. In terms of correlation, graphical plots are called scatterplots. Scatterplots can show you visually the strength of the relationship between the variables, the direction of the relationship between the variables, and whether outliers exist.  Correlation is the measure of the strength and direction of the relationship between the variables. a. Correlations can vary between -1 and 1. b. Direction of the relationship can be either positive or negative. A positive relationship is indicated by a positive value (e.g., ranging from 0 to 1). A negative relationship is indicated by a negative value (e.g., ranging from 0 to -1). An example of a positive relationship is the relationship between height and weight. The higher the outcome on one variable, the higher the outcome on the other variable. An example of a negative relationship is the relationship between exercise and weight. The higher the outcome on one variable, the lower the outcome on the other variable. c. Strength of the relationship is measured from 0 to 1/-1. The farther the value is away from 0, the stronger the relationship. The approximate criteria for strength is 0 for no effect, .1 for a small effect, .3 for a medium effect, and .5 for a large effect. Notice those values can be either positive or negative, depending upon the direction of the relationship, so a .2 and -.2 relationship indicate the same strength, but different direction.  What does a scatterplot look like? Below are 9 scatterplots that show three examples of a positive relationship in the top row (perfect, strong, weak), three examples of a negative relationship in the middle row (perfect, strong weak), and three examples of no relationship. (graph taken from bla website) 1 (graph taken from wikipedia)    How do I graph a scatterplot? 1. Select Graphs --> Legacy Dialogs --> Scatter 2. Click “Simple”, and “Define” 3. Move appropriate variables into the “Y axis” and X axis” 4. Click OK. Output below is for two questions: “commit1” and “commit3”. Notice that there is a positive relationship. From this scatterplot, I would anticipate that the correlation between the two variables is positive and medium in size. What is the purpose of graphing the scatterplot? The purpose of graphing the scatterplot is to look at the relationship between the variables and determine if there are any problems/issues with the data or if the scatterplot indicates anything unique or interesting about the data, such as: a. How is the data dispersed? For example, in the scatterplot above, it appears all the scores are grouped in the top right quadrant. What does this imply about the questions and/or data in your study? It appears that subjects answered both commit1 and commit3 on the higher part of the scale. In other 2    words, most subjects feel that most people brought to trial did in fact commit the crime (commit1) and that most people convicted by juries did in fact commit the crime (commit3). Thus, when discussing these variables in your paper, just talking about the size and direction of the correlation does not tell the whole story. If you want to also discuss descriptive analysis of the data, you could talk about how the data are distributed at the high end of the scale. In other words, just presenting the correlational analysis (e.g., r = .35, p = <.001) may mislead the reader about an interesting distribution of the data. b. Are there outliers? A scatterplot is useful for “eyeballing” the presence of outliers. Just as a histogram is useful for “eyeballing” univariate outliers, the scatterplot is useful for “eyeballing” bivariate outliers. In a later section I describe how to statistically analyze whether or not bivariate outliers exist. c. Is the relationship linear? What is linearity? Linearity is a straight-line relationship between variables. Why is linearity important? Correlation and regression tests rests upon the assumption of linearity because they only capture linear relationships. Not all relationships are linear. Just as not all variables are normally distributed in the real-world, not all relationships are supposed to be linear. For example, there could be a nonlinear relationship for USA immigrants between length of residence in USA and depression. It is a U-shaped relationship. Depression levels starts high during the first few years of initial resettlement, then decreases for a while as they adjust to the new environment, and then increases again later in life. Another example of a nonlinear relationship is mortality and water consumption. Absence of water increases mortality, middle levels of water decreases mortality, but too much water increases mortality. Correlation and regression only capture linear relationships. For example, all correlations below have the same size and direction, r = .81. BUT only the top-left graph is appropriate for correlational analysis because the other three graphs depict data that can NOT be captured by the formulas for correlation and regression. (graph taken from wikipedia) 2. Bivariate (and multivariate outliers)  How do I identify bivariate and multivariate outliers? The procedure for identifying bivariate outliers is the same as for identifying multivariate outliers. The procedure is called Mahalanobis Distances, and it calculates the distance of particular scores from the center cluster of remaining cases. The procedure creates a new column at the end of the data file containing a calculated score for each subject. The newly calculated score is based upon the specific variables entered into the analysis. Thus, you could calculate many different Mahalanobis Distances where you enter different sets of variables into the analysis. For each analysis, a separate score for each subject is created in a new column at the end of the data file. The Mahalanobis Distances score for each subject is considered an outlier if it exceeds a “critical value”. a. The critical value is determined by a table at the back of most textbooks. You can also find the table at this webpage - http://www.ento.vt.edu/~sharov/PopEcol/tables/chisq.html 3      b. The table involves the “Rejection Regions” for a Chi-Square test. Remember back to the first day of class when we talked about probability distributions and “Rejection Regions”. The Rejection Regions for the chi-square test is based upon two factors: the probability level you set, and the degrees of freedom. We will talk later in-depth about these concepts, but for right now what you need to know is that: c. degrees of freedom for this test is equal to the number of variables under investigation. Thus, if you are analyzing a bivariate relationship, then degrees of freedom = 2. If you are analyzing 3 variables, then degrees of freedom = 3, and so forth d. the probability level we set for this test is p < .001. e. so, if you look at the table, you find the degrees of freedom, then scan to the right until you get to the column associated with 0.001. That is your critical value. For example, the critical value for a bivariate relationship is 13.82. f. Any Mahalanobis Distances score above that critical value is an outlier. Here is how to calculate Mahalanobis Distances scores: 1. Select Analyze --> Regression --> Linear 2. Move all the variables under investigation into the “Independents” box. In the “Dependent” box, move the subject number variable (numb). For example, if you are interested in the bivariate outlier analysis for “commit1” and “commit3”, you move both those variables into the “Independent” box, and move “numb” into the “Dependent” box. 3. Click “Save”, and click “Mahalanobis” 4. Click OK. The newly created variable is saved as “MAH_1” Output below is for “commit1” and “commit3”. In “Residuals Statistics” box, look for “Mahal. Distance”. Look at the “Maximum” score. If that number exceeds your critical value, then an outlier exists. In this case, with 2 variables, the critical value is 13.82. The “Maximum” is listed as 35.083. Thus, we have at least one outlier.  You identify the outlier(s) by sorting the data by this new variable “MAH_1”, and then scroll to the bottom of the list to find the highest valued scores. You can sort by: Data --> Sort Cases. In this case, we find 11 variables that have scores above 13.82. Notice, however, that multivariate outlier analysis is just as arbitrary as univariate outlier analysis. The determination for the threshold level is arbitrarily determined, just as the threshold level for univariate outliers as 1.5* IQR and 3*IQR is arbitrarily determined. Plus, the “eyeball” method of the scatterplot does show some differences when compared to the statistical method of using Mahalanobis Distances scores. For example, if you look at the scatterplot for our two variables (see above), can you identify which 11 subjects are the ones deemed outliers by the Mahalanobis Distances analysis? 4 3. Correlation  A correlation is easy to conduct: 1. Select Analyze --> Correlate --> Bivariate 2. Move all variables into the “Variable(s)” window. 3. Click OK.  Output below is for two questions “commit1” and “commit3”. The “Correlations” box tells you three pieces of information: n = sample size, pearson = size and direction of the relationship, Sig. = significance level. In essence, the “Pearson Correlation” tells you size and direction of the hypothetical line that can be drawn through the data; and “Significance” tells you the probability that the line is due to chance. More specifically, the “Significance” represents a test of whether the line is different from a flat line (e.g. a flat line would be represented by a Pearson correlation = 0). For the data below, there is a positive and medium relationship between the variables, and there is a p<.001 probability that the line is due to chance.    Another useful piece of information is R2 – the coefficient of determination. This is the amount of variance explained by another variable. R2 is not provided in the output, but you can calculate R2 by squaring the Pearson Correlation. In our example, therefore, 352 x .352 = .124. If you multiple this by 100, you converted the value into a percentage. Thus, in our example, commit1 explains 12.4% of the variance in commit3, and vice versa. This also means that 87.6% of the variance is unaccounted because 100-12.4 = 87.6. WRITE-UP: The report of a correlational study should include the strength of the relationship and the significance level. Some researchers also include the descriptive statistics of each variable. Some researchers also include the R2 a. “There was a positive correlation between the two variables, r = .35, p = <.001.” b. “There was a positive correlation between the belief about what percent of people brought to trial did in fact commit the crime (M = 78.39% SD = 16.33) and the belief about what percent of people convicted by juries did in fact commit the crime (M = 83.22% SD = 15.54) , r = .35, p = <.001.” c. “There was a positive correlation between the two variables, r = .35, p = <.001, with a R2 = .124.” EVALUATION: a. You evaluate correlational analysis by looking at the direction of the relationship between the variables. Is it in the same direction as the research hypothesis. b. You then look at the significance level. Is the relationship significant? Remember that significance is related to sample size. In small sample (n=30) you may have correlations that don’t reach significance, but if the sample size was larger (n=100), it would be significant. Also, remember that sample size does not typically affect the strength of the relationship, only the probability that the result was due to chance. c. You then look at the size of the relationship. Is it strong or weak? Just because the hypothesis is confirmed in the predicted direction does not indicate if the relationship between the variables is strong or important. Strength of the relationship is measured from 0 to 1/-1. The farther the value is away from 0, the stronger the relationship. The approximate criteria for strength is 0 for no effect, .1 for a small effect, .3 for a medium effect, and .5 for a large effect. Notice those values can be either positive or negative, depending upon the direction of the relationship, so a .2 and -.2 relationship indicate the same strength, but different direction. d. You can also look at R2. In terms of percentage of variance explained, small is 1%, medium is 9%, and large is 25%. 5 4. Correlation - Multiple  When you conduct correlations, you typically enter MANY variables simultaneously into the analysis, and the output provides all possible bivariate relationships. For example: 1. Select Analyze --> Correlate --> Bivariate 2. Move all variables into the “Variable(s)” window. 3. Click OK.  Output below is for the “forensic” items and “innocence” items. Notice the diagonal is always “1” because there is a perfect correlation between the same variable. Also notice that sample size is different for each bivariate relationship because the default in correlation is “pairwise” deletion. Also, notice that the matrix is a mirror of itself along the diagonal, so the information is depicted twice for each bivariate combination.   Just as you can have correlational output of multiple variables simultaneously, you can have scatterplots of multiple variables simultaneously. The only limitation is that if there are more than 3 variables simultaneously the scatterplots get so small as to be relatively useless. You conduct multiple scatterplots simultaneously by: 1. Select Graphs --> Legacy Dialogs --> Scatter 2. Click “Matrix”, and “Define” 3. Move appropriate variables into the “Matrix variables” box 4. Click “Options” and “exclude cases variable by variable” 5. Click OK. Output below is for the first three “forensic items”. 6 5. Correlation - Partial  Partial correlation is the relationship between two variables while controlling for a third variable. The purpose is to find the unique variance between two variables while eliminating the variance from a third variables. The diagram below from your textbook page 135 graphically represents the purpose of partial correlation.    You typically only conduct partial correlation when the third variable has shown a relationship to one or both of the primary variables. In other words, you typically first conduct correlational analysis on all variables so that you can see whether there are significant relationships amongst the variables, including any “third variables” that may have a significant relationship to the variables under investigation. In addition to this statistical pre-requisite, you also want some theoretical reason why the third variable would be impacting the results. How to conduct partial correlation: 1. Select Analyze --> Correlate --> Partial 2. Move variables into the “Variable(s)” window. 3. Move the variable you want to control for into the “Controlling” box 4. Click “Options” and click “Zero Order correlations” and click “Exclude cases pairwise” (by clicking “zero order correlations”, the output will show both the relationships amongst the variables while controlling for the third variable, and ALSO the relationships amongst the variables without controlling for the third variable. This is useful so that you can easily see the difference between controlling for the variable and not controlling for the variable.) 5. Click OK. Output below is for the relationship between “commit1” and commit3” while controlling for “prosecutor1”. 7    I included “prosecutor1” as the controlling variable because: (1) statistically, it shows significant relationship to both commit1 and commit3. You can see that significant relationship in the top part of the “Correlations” box below which presents the correlations without controlling for a third variable, (2) theoretically, it is possible that the reason why there is a positive correlation between commit1 and commit3 is because prosecutor1 asks “whom do you trust more, defense attorneys or prosecutors”, so it is possible the reason why subjects believe defendants brought to trial and convicted at trial are guilty (commit1 and commit3) is because they trust the prosecutor over the defense attorney. Thus, given this plausible (statistical and theoretical) third-variable relationship, it is interesting to note that controlling for “prosecutor1” did not lower the strength of the relationship between commit1 and commit3 by that much because the outcome while controlling for prosecutor1 was r = .341, p <.001. In other words, the relationship between commit1 and commit3 is NOT due to subjects trusting the prosecutor. You can conduct Partial Correlation with more than just 1 third-variable. You can include as many thirdvariables as you wish. 6. Correlation – Point-biserial Correlation  Point-biseral Correlations are conducted when one of the variables is dichotomous, which means it’s a categorical variable with only two categories, such as gender: male, female.  FYI – The Point-biserial Correlation is analogous to a “t-test”, which we will cover in later weeks. A “t-test” is conducted when you are interested in the relationship between a categorical independent variable (such as gender: male, female) and a continuous dependent variable (such as belief in the death penalty on 1-7 scale).  You conduct a Point-biserial Correlation the same way that you conduct a regular correlation: 1. Select Analyze --> Correlate --> Bivariate 2. Move all variables into the “Variable(s)” window. 3. Click OK.  The output, Write-up, and interpretation are the same as for a regular correlation.  FYI - If you want to analyze the tests from your classes using the Point-biserial Correlation, you would need to first create a new dichotomous variable (e.g., 1=answered correctly, 2=answered incorrectly). See “Lab2 – Descriptives” for “6. Transforming categorical variables into other categorical variables”. 8

Lab3Correlation.doc

Related documents

Products

Support

Lab3Correlation.doc

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib