Chapters 10 and 11 Terminology: - Measurement or Quantitative Variables: variables that involve measurement or counting: e.g. number of pages in a book (count), number of students in class (count), height (measured), weight (measured), exam score (measured). - Deterministic Relationship: One variable is directly related to another. For example, if we know your height in inches we can easily convert the height to centimeters by multiplying by 2.54 - Statistical Relationship: a relationship where a natural variability exists. Consider the first exam scores and mean quiz score. Although there is a relationship between the performance on both of these variables, students having the same mean quiz average did not achieve the same exam score. For instance, 4 students have a quiz average of 92.22 but had the following exam scores: 73.3, 86.7, 86.7, and 96.7 - Linear Relationship and Regression: if there exists a statistical relationship where one measurement variable reacts in a linear manner with a change in the second measurement variable then we can employ what is called in statistics regression methods to explore and explain this linear relationship. For instance, if we can show that a linear relationship exists between mean quiz scores and exam scores then we can use regression methods to explain and predict one variable based on the other. - Response, Outcome, Dependent variable versus Explanatory, Independent, Predictor variable: in regression the response or outcome or dependent variable is the variable we are interested in predicting or explaining using a second variable called the explanatory or independent or predictor variable. E.g. we want to predict exam scores (response/outcome) based on mean quiz scores (predictor/explanatory); explain the variation in weight (response/outcome) using height (predictor/explanatory) Determining Linear Relationship - Scatterplot: this is a plot of points for the combinations of the observations of the response and predictor variables putting the response on the Y or vertical axis and the predictor on the X or horizontal axis. - For the scatterplot below what can you see? i. Does the plot indicate any relationship? ii. If yes, is it linear? iii. If linear what direction? - From this plot you can see that there appears to be a linear relationship that is positive. That is, as Quiz Average increases so does Exam score. A negative relationship evolves when as the predictor variable increases the response decreases. (e.g. consider driving speed and travel time: the faster you drive (i.e. speed increases) travel time decreases. 1 Scatterplot of Exam 1 vs Quizzes Average 100 90 Exam 1 80 70 60 50 40 30 0 20 40 60 Quizzes Average 80 100 0 - - - Measuring this relationship: The scatterplot is helpful as it provides a picture of what is happening, but does leave room for various interpretations. To quantify a linear relationship between two measurement variables we use a statistical measure called correlation symbolized by r. Correlation: the measure of the strength and direction of a linear relationship between two measurement variables. A perfect positive linear relationship (i.e. the point of the scatterplot fall exactly in line with an increasing pattern) has a correlation of 1. Conversely, a perfect negative correlation has a value of negative one. No linear relationship has a correlation value of 0. Therefore the range of possible correlation values is: -1 ≤ r ≤ 1. Keep in mind that the sign of the correlation has nothing to do with the strength of the linear relationship, but only the direction. That is, -1 and 1 indicate the same strength of the linear relationship just in opposite directions. Thus a correlation of -0.9 would indicate a stronger linear relationship than a correlation of 0.2 Statistical Significance: The scatterplot and correlation can lead to varying interpretations; i.e. people may have differing opinions on the interpreting the plot and/or whether the correlation value indicates a strong or weak linear relationship. In an effort to remove these differences researchers the concept of statistical significance is used. This concept, if used on the same set of data should direct researchers to the same conclusion about the data. Once a correlation is found, one can compare this value to zero to see if the difference between the correlation and zero is one that is different statistically, or does this difference just occur due to chance. 2 - P-value: a p-value or “probability” value is calculated for a correlation. The methods for calculating this p-value are beyond this course, but the interpretation of a p-value is somewhat straightforward. The p-value for a correlation is the probability that the data would result in that correlation value if in reality the correlation was zero. For example, the correlation between exam and quiz average was 0.856 (strong and positive) and had a p-value of 0.0001 which is very small. What this means is this: If the correlation between exam score and quiz average were zero, then the probability that our sample data would produce a correlation of 0.856 or higher is 0.0001 Very unlikely! - - - Interpreting this p-value: Again we have another number, the p-value, so how is one to judge whether this p-value is “small enough”? To make such conclusion we compare this p-value to some standard called a “level of significance” which means that if the p-value is below, i.e. less than, this standard the relationship is deemed statistically significant. That is, the correlation is statistically different from zero. A value of 0.05 or 5% is a common value used for this level of significance. In our example, with this p-value of 0.0001 being less than 0.05 we would claim that the correlation of 0.856 is significantly different from zero. What is going on here? To summarize, when we have two measurement variables where we want to see if one is linearly related to the other and therefore one variable can be used to predict the other we now have some statistical tools to analyze this. Since initially we do not know if a linear relationship exists we start with the correlation between the two variables being zero until we can show otherwise. Next we calculate the correlation based on our sample data, but in truth would you really expect the sample data correlation to be exactly zero even if there is no linear relationship? Probably not. But since we don’t expect the correlation to be exactly zero how can we tell if this difference is based on chance (i.e. because we have sample data we already expect the correlation not to be exactly zero so is this difference simply due to sampling?) or is this difference representative of being different from zero statistically? To make this determination we compare the p-value to 0.05: if the p-value is smaller then we say the difference is a statistical one and thus have a statistical significant result. If p-value is greater than 0.05 then we decide that this difference from zero is due to chance and therefore cannot say that the two variables are statistically linearly related. Regression: Once a linear relationship between two measurement variables has been determined, we can formulate this relationship by the use of a line equation. o Recall, possibly!, from algebra the following: y = b + mx where: b is the y-intercept: where the line crosses the y-axis m is the slope of the line: rise over run? o For example, since the one centimeter is equal to 2.54 inches, if I gathered everyone’s height in inches and then converted these to centimeters the resulting line would be: cm = 0 + 2.54inches. Here the slope means for every increase in one inch predicted centimeters increases by 2.54 cm. 3 - Regression Line: In statistics, our data very rarely falls in a deterministic relationship. Instead of falling on a straight line the data is “scattered” about. However, if we can demonstrate that a significant linear relationship exists (via Scatterplot and correlation), then we can fit a line to this data. How the yintercept and slope are calculated are beyond this course; you will simply be given these values – i.e. the line equation. However, you are to be expected how to interpret the line and make predictions. From the entire data set of our exam 1 and average quiz scores the correlation between this two variables was 0.856 with a p-value of 0.0001: indicating a strong, positive linear relationship. the regression equation is: o Exam01 = 33.1 + 0.6 QuizAverage - - - Interpretations of this line: The y-intercept would mean that for a quiz average of 0 the line would intersect the y-axis at 33.1% for exam 1. This is possibly (the average of a 0) if the student did not take any of the quizzes, or scored a 0 on each. By the way, this could also be used to indicate a red flag for a student’s performance. Imagine if a student did not take any of the quizzes but scored very well on the exam (say 70% or better). This could result in the student being questioned. As to the interpretation of the slope of 0.60: we would say that an increase of 1% in quiz average the expected or predicted exam score would increase by 0.60% or roughly by ½ a percentage point. Predicting a response value: With the regression line we can predict an expected or mean response for a specific value of the explanatory variable by plugging into the equation this value of “X”. For instance, from our equation for a quiz average of 100% we would predict an exam1 score of: 33.1 + 0.60*100 = 93.1% Extrapolation: One important concept to keep in mind is that the regression equation is based on the sample data, specifically the range of such values. To use the equation for values of the predictor variable outside the range of sample values is called extrapolation. Extrapolation can lead to erroneous results. Consider for example a study done to predict the response (outcome) Baby Weight based on the predictor variable of age in months from ages 0 (birth weight) to 22 months. Say such data produced a regression equation of:Weight = 6 + 0.9Age. Now what I was to plug in my age in months (552). My predicted weight would be about 503 pounds! Notes about Correlation - - Correlation is unit free - e.g. if we had the correlation between weight in pounds and height in inches and converted the heights to centimeters the correlation would be unchanged. Correlation can only range from negative one to positive one and the sign only indicates the direction of this relationship. The value indicates the strength. For example a correlation of negative 0.9 would indicate a stronger linear relationship than a correlation of positive 0.8 4 Outliers: There are generally two types of outliers: those observations that “stray” from the general distribution of the data and those observations that influence the relationship between the two variables. The latter is called an “influential outlier”. From the graphic below both Points A and B would be outliers, but Point B would be influential. Why? Because if you remove Point B the correlation b/w X and Y would be greatly affected as would the regression of X and Y. However, for Point A the removal would improve both. For B, with the point the regression equation has a positive slope (i.e. positive correlation) and would go straight through B. If removed, the slope would now be negative. In general, influential outliers are observations that are outside the range of the remaining X-observations (i.e. the horizontal axis). 5