** STA 1020 - Part 2 (24/Oct/13) ** MATERIAL FOR EXAM #2 Contents Exam 2 of 3: Organizing Data STA 1020 Quizzes every chapter and then Second Partial Exam Fall 2013 Section 09 MWF 10:40-11:35 0035 State Chapter 10 - Graphs, Good and Bad Chapter 11 - Displaying Distributions with Graphs Instructor: Dr. J.L. Menaldi Chapter 12 - Describing Distributions with Numbers Textbook - Statistics: Concepts and Controversies, by David S. Moore and William I. Notz, 2013, W.H. Freeman & Company [8th ed] Class Link: http://www.math.wayne.edu/˜menaldi/teach/13f1020.htm Chapter 13 - Normal Distributions Chapter 14 - Describing Relationships: Scatterplots and Correlations Chapter 15 - Relationships: Regression, Predictions and Causation Chapter 16 - The Consumer Priced Index and Government Statistics – skipped! “Statistics” is the Science of collecting, describing and interpreting data... It is said that “Probability” is the vehicle of Statistics, i.e., if were not for the laws of probability, the theory of statistics would not be possible JLM (WSU) STA 1020 Ch10 - Graphs, Good and Bad 1 / 114 JLM (WSU) Thought Questions. . . STA 1020 Ch10 - Graphs, Good and Bad Part 2: Figures won’t lie, but liars will figure . . . , beware! Chapter 10 2 / 114 Data Tables . . . The table summarize data. Table 10.1 Education of people 25 years and over, 2006 Level of education Number of persons (thousands) Percent Less that high education 27,896 14.5 High school graduate 60,989 31.7 Some college, no degree 32,611 17.0 Associate’s degree 16,760 8.7 Bachelor’s degree 35,153 18.3 Advanced degree 18,567 9.7 Total 191,884 100.0 What is confusing or misleading about the following graph? Source: Census Bureau, Education Attainment in the United States: 2006 Ex1: How well educated are adults? Attention to details! Labels clear and everywhere. Do not forget the source. Ex2: Roundoff errors. . . 27896 + · · · + 18567 = 191885 Our eyes react to the area of the pictures! JLM (WSU) STA 1020 Ch10 - Graphs, Good and Bad 3 / 114 Pie charts show how a whole is divided into parts. Wedges within the circle represent the parts, with the angle spanned by each wedge in proportion to the size of that part, e.g., 18.3% of those in this age group have a bachelor’s degree but not an advanced degree 0.183 × 360 = 66 degrees. Pie charts can compare quantities that are parts of a whole STA 1020 STA 1020 Ch10 - Graphs, Good and Bad Pie chart of the distribution of level of education among persons aged 25 years and over JLM (WSU) JLM (WSU) Pie charts 4 / 114 Bar graphs Bar graph of the distribution of level of education among persons aged 25 years and over The distribution of a variable tells us what values it takes and how often it takes these values. Bar graphs compare quantities, not necessarily parts of a whole 5 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 1 / 19 6 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch10 - Graphs, Good and Bad Ex3 High taxes? Ch10 - Graphs, Good and Bad Ex4 Beware of pictograms Recall: Our eyes react to the area of the pictures! To magnify a picture, the artist must increase both height and width to avoid distortion. This create a misleading graph. Figure 10.5 A pictogram, for Example 4. Figure 10.4 Percentage of gross wage earnings paid in income tax and employee Social Security contributions in eight countries in 2006, for Example 3. These percentages are for single individuals without children at the income level of the average worker. (Data from the Organization of Economic Cooperation and Development) JLM (WSU) STA 1020 Ch10 - Graphs, Good and Bad This variation of a bar graph is attractive but misleading 7 / 114 JLM (WSU) Another misleading graph STA 1020 Ch10 - Graphs, Good and Bad 8 / 114 Changes over time A line graph of a variable plots each observation against the time at which it was measured. A categorical variable places an individual into one of several groups or categories. Always, time goes into the horizontal scale! Connect the data points by lines to display the change over time A quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense. Figure 10.6 A line graph of the average cost of regular unleaded gasoline each week from January 3, 2000, to January 21, 2008, for Example 5. (Bureau of Labor Statistics) JLM (WSU) STA 1020 Ch10 - Graphs, Good and Bad 9 / 114 JLM (WSU) Line graphs STA 1020 Ch10 - Graphs, Good and Bad 10 / 114 Scales Changes over time Look for an overall pattern (trend) Look for patterns that repeat at known regular intervals (seasonal variations) Look for any striking deviations that might indicate unusual occurrences ....................................................................... A pattern that repeats itself at known regular intervals of time is called seasonal variation Many series of regular measurements over time are seasonally adjusted, i.e., the expected seasonal variation is removed before the data are published Figure 10.7 The effect of changing the scales in a line graph, for Example 6. Both graphs plot the same data, but the right-hand graph makes the increase appear much more rapid JLM (WSU) STA 1020 11 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 2 / 19 12 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch10 - Graphs, Good and Bad Ex7 Getting rich A Ch10 - Graphs, Good and Bad Figure 10.8 Percentage increase or decrease in the Standard & Poor’s 500 index of common stock prices, 1971 to 2003, for Example 7 JLM (WSU) STA 1020 Ch10 - Graphs, Good and Bad Ex7 Getting rich B Figure 10.9 Value at the end of each year, 1970 to 2003, of $1000 invested in the Standard & Poor’s 500 index at the end of 1970, for Example 7 13 / 114 JLM (WSU) Ex8 Rise in college Education STA 1020 Ch10 - Graphs, Good and Bad Figure 10.10 Chart junk: this graph is so cluttered with unnecessary ink that it is hard 14 / 114 Ex9 High Taxes? Changing the order of the bars has improved the graph in Figure 10.4 to see the data Figure 10.11 Percentage of gross wage earnings paid in income tax and employee Social Security contributions in eight countries in 2006, for Example 9 JLM (WSU) STA 1020 Ch10 - Graphs, Good and Bad 15 / 114 JLM (WSU) Making Good Graphs STA 1020 Ch10 - Graphs, Good and Bad 16 / 114 Exercise Ch10 10.10 College freshmen. A survey of college freshmen in 2001 asked what field they planned to study. The results: 12.6%, arts and humanities; 16.6%, business; 10.1%, education; 18.6%, engineering and science; 12.0%, professional; and 10.3%, social science. (a) What percentage of college freshmen plan to study fields other than those listed? (b) Make a graph that compares the percentages of college freshmen planning to study various fields. Title your graph Make sure labels and legends describe variables and their measurement units. Be careful with the scales used Make the data stand out. Avoid distracting grids, artwork, etc Pay attention to what the eye sees. Avoid pictograms and tacky effects Categorical and Quantitative Variables, Distributions, Pie Charts, Bar Graphs, Line Graphs, Techniques for Making Good Graphs. Recall that descriptive statistics consists of procedures used to summarize and describe the important characteristics of a set of measurements. Now it’s your turn. Read Case Study Evaluated. JLM (WSU) STA 1020 17 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 3 / 19 18 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch10 - Graphs, Good and Bad Exercise (answer) Ch10 Ch10 - Graphs, Good and Bad **Answers (a) The given percents add to 80.2%, so 19.8% were in other fields. (b) Either a bar chart or a pie chart would be appropriate; both are shown below. Multiple choice Ch10 A company database contains the following information about each employee: age, date hired, sex (male or female), ethnic group (Asian, black, Hispanic, etc.), job category (clerical, management, technical, etc.), and yearly salary. Which of the following lists of variables are all categorical? (a) age, sex, ethnic group. (b) sex, ethnic group, job category. (c) ethnic group, job category, yearly salary. (d) yearly salary, age, date hired. Answer: (b) .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Here is a table of the undergraduate enrollment at a large state university, broken down by class: Class Freshman Sophomore Junior Senior Non-degree Total Count of students 8,248 8,073 7,001 6,904 535 30,761 Percent of Students 26.8% 26.2% 22.8% 22.4% 1.7% 100% To make a correct graph of the distribution of students by class, you could use (a) a bar graph. (b) a pie chart. (c) a line graph. (d) (a) or (b), but not (c). Answer: (d) JLM (WSU) STA 1020 JLM (WSU) 19 / 114 Ch11 - Displaying Distribution with Graphs STA 1020 Ch11 - Displaying Distribution with Graphs 20 / 114 Data Tables Chapter 11 Table 11.1 Percentage of residents aged 65 and over in the 50 states, 2006 STA 1020 Fall 2013 Section 09 MWF 10:40-11:35 0035 State Instructor: Dr. J.L. Menaldi Textbook - Statistics: Concepts and Controversies, by David S. Moore and William I. Notz, 2013, W.H. Freeman & Company [8th ed] Class Link: http://www.math.wayne.edu/˜menaldi/teach/13f1020.htm “Statistics” is the Science of collecting, describing and interpreting data... It is said that “Probability” is the vehicle of Statistics, i.e., if were not for the laws of probability, the theory of statistics would not be possible State Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Percent 13.4 6.8 12.8 13.9 10.8 10.0 13.4 13.4 16.8 9.8 14.0 11.5 11.9 12.4 14.6 12.9 12.4 State Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Percent 12.2 14.6 11.6 13.3 12.5 12.1 12.4 13.3 13.8 13.3 11.1 12.4 12.9 12.4 13.1 12.2 14.6 State Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming Percent 13.4 13.2 12.9 15.2 13.9 12.8 14.2 12.7 9.9 8.8 13.3 11.6 11.5 15.3 13.0 12.2 Source: 2008 Statistical Abstract of the United Stated JLM (WSU) STA 1020 Ch11 - Displaying Distribution with Graphs 1 21 / 114 JLM (WSU) Ex1 Making histograms STA 1020 Ch11 - Displaying Distribution with Graphs 22 / 114 Histogram Divide the range of the data into classes of equal width. Be sure to specify the classes precisely so that each individual falls into exactly one class, i.e., classes are exclusive 2 Count the number of individuals in each class Class 6.0 to 7.0 to 8.0 to 9.0 to 3 6.9 7.9 8.9 9.9 Count 1 0 1 2 Class 10.0 to 11.0 to 12.0 to 13.0 to 10.9 11.9 12.9 13.9 Count 2 6 16 14 Class 14.0 to 14.9 15.0 to 15.9 16.0 to 16.9 Count 5 2 1 Draw the histogram. Mark on the horizontal axis the scale for the variable whose distribution you are displaying (e.g., “percentage of residents aged 65 and over”). The vertical axis contains the scale of counts (each bar represent a class). Be sure that the classes for a histogram have equal widths. There is not one right choice for the number of classes or class widths Figure 11.1 Histogram of the percentages of residents aged 65 and older in the 50 states, for Example 1. Note the outlier JLM (WSU) STA 1020 23 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 4 / 19 24 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch11 - Displaying Distribution with Graphs Interpreting histogram Ch11 - Displaying Distribution with Graphs Ex3 Tuition & Fees * In any graph of data, look for an overall pattern and also for striking deviations from that pattern * An outlier in any graph of data is an individual observation that falls outside the overall pattern of the graph * Ex2: shape (the distribution has a single peak?), roughly symmetric center spread (how?), outlier (how (the midpoint of the distribution is close to the peak?), many?) Overall pattern of a distribution center, spread and shape * A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other * A distribution is skewed to the right (or left) if the right (or left) side of the histogram (containing the half of the observations with larger values) extends much farther out than the left (or right) side Figure 11.2 Histogram of the tuition and fees charged by 121 Illinois colleges and universities in the 2004-2005 academic year, for Example 3. Overall description: Roughly symmetric and skewed to the right JLM (WSU) STA 1020 Ch11 - Displaying Distribution with Graphs 25 / 114 STA 1020 Ch11 - Displaying Distribution with Graphs STA 1020 Ch11 - Displaying Distribution with Graphs Figure 11.3 Histogram of the sample proportion p̂ for 1000 simple random samples from the same population, for Example 4. This is a symmetric distribution JLM (WSU) JLM (WSU) Ex4 Sampling again Figure 11.4 The distribution of word lengths used by Shakespeare in his plays, for Example 5. This distribution is skewed to the right 27 / 114 JLM (WSU) Stemplot STA 1020 Ch11 - Displaying Distribution with Graphs Histograms are not the only graphical display of distributions. For small data sets, a stemplot is quicker to make and presents more detailed information. 26 / 114 Ex5 Shakespeare’s words 28 / 114 Ex6 “65 and over” From Table 11.1, the whole-number part of the observation is the stem, and the final digit (tenths) is the leaf, i.e., the Alabama entry, 13.4 has stem 13 and leaf 4. Recall to sort the leaves at the very end Stem-and-Leaf Plots (for quantitative variables) 1 Separate each observation into a stem consistent of all but the final (rightmost) digit and leaf, the final digit. Stems may have as many digits as needed, but each leaf contains only a single digit 2 Write the stems in a vertical column with the smallest at the top, & draw a vertical line at the right of this column 3 Write each leaf in the row to the right of its stem, in increasing order out from the stem Stemplot look like Histograms turned on end Figure 11.6 Making a stemplot of the data in Table 11.1. Whole percents form the stems, and tenths of a percent form the leaves JLM (WSU) STA 1020 29 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 5 / 19 30 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch11 - Displaying Distribution with Graphs Tuition & Free Ch11 - Displaying Distribution with Graphs Example Weight Data 1 Figure 11.7 Stemplot of the Illinois tuition and fee data Choose the stems and the leave Data can be found at http://www.collegeillinois.com/en/collegefunding/costs.htm JLM (WSU) STA 1020 Ch11 - Displaying Distribution with Graphs 32 / 114 Example Weight Data 3 After sorting the leaves STA 1020 Ch11 - Displaying Distribution with Graphs 33 / 114 JLM (WSU) Example Weight Data 4 STA 1020 Ch11 - Displaying Distribution with Graphs Choose the classes JLM (WSU) STA 1020 Ch11 - Displaying Distribution with Graphs This is how you do it JLM (WSU) JLM (WSU) 31 / 114 Example Weight Data 2 34 / 114 Example Weight Data 5 ** Now it’s your turn. Read Case Study Evaluated ** STA 1020 35 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 6 / 19 36 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch11 - Displaying Distribution with Graphs Distributions Ch11 - Displaying Distribution with Graphs Exercise Ch11 11.4 Where do the young live? Figure 11.10 is a stemplot of the percentage of residents aged under 18 in each of the 50 states in 2006. As in Figure 11.6 (page 227) for older residents, the stems are whole percents and the leaves are tenths of a percent. (a) Utah has the largest percentage of young adults. What is the percentage for this state? (b) Ignoring Utah, describe the shape, center, and spread of this distribution. (c) Is the distribution for young adults more or less spread out than the distribution in Figure 11.6 for older adults? From left to right, from top to bottom: Symmetric Distributions Bell-Shaped, Symmetric Distributions Uniform, Asymmetric Distributions Skewed to the Left, and to the Right JLM (WSU) STA 1020 Ch11 - Displaying Distribution with Graphs Figure 11.6 37 / 114 Figure 11.10 JLM (WSU) Exercise (answer) Ch11 STA 1020 Ch11 - Displaying Distribution with Graphs 38 / 114 Multiple choice Ch11 To make a correct graph of the distribution of students by class, you could use (a) a bar graph. (b) a pie chart. (c) a line graph. (d) (a) or (b), but not (c). Answer: (d) .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . **Answers (a) Utah has 31.0% young adults. (b) Without Utah, the distribution is roughly symmetric, centered at about 24.2%, spread from 21.2% to 27%. (c) The distribution of young adults is less spread out than the distribution of older adults. A well-drawn histogram should have (a) bars all the same size. (b) no space between bars (unless a class has no observations). (c) a clearly marked vertical scale. (d) all of these. (e) (a) and (c), but not necessarily (b). Answer: (d) .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . You want to make a graph to display the distribution of salaries of the 1,500 professors at a very large university. The best choice is: (a) a histogram. (b) a line graph. (c) a pie chart. (d) a stemplot Answer: (a) Figure 11.6 (older adults) JLM (WSU) Figure 11.10 (young adults) STA 1020 JLM (WSU) 39 / 114 Ch12 - Describing Distribution with Graphs STA 1020 Ch12 - Describing Distribution with Graphs 40 / 114 Describing . . . center & spread Chapter 12 Number of home runs hit by Barry Bonds in his first 22 seasons STA 1020 Fall 2013 Section 09 MWF 10:40-11:35 0035 State Instructor: Dr. J.L. Menaldi Textbook - Statistics: Concepts and Controversies, by David S. Moore and William I. Notz, 2013, W.H. Freeman & Company [8th ed] Class Link: http://www.math.wayne.edu/˜menaldi/teach/13f1020.htm “Statistics” is the Science of collecting, describing and interpreting data... Season 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 Runs 16 25 24 19 33 25 34 46 37 33 42 Season 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 Runs 40 37 34 49 73 46 45 45 5 26 28 It is said that “Probability” is the vehicle of Statistics, i.e., if were not for the laws of probability, the theory of statistics would not be possible A graph and a few words give a good description of Barry Bonds’s home runs career. We need number that summarize the center and the spread of the distribution JLM (WSU) STA 1020 41 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 7 / 19 42 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch12 - Describing Distribution with Graphs Median Ch12 - Describing Distribution with Graphs Data Set: Number of home runs hit by Barry Bonds in his first 22 seasons 16 25 24 19 33 25 34 46 37 33 42 40 37 34 49 73 46 45 45 5 1 26 28 5 Arrange all observations in increasing order and locate the median M in the ordered list of observations 16 19 24 25 Median M 1 5 Arrange all observations in order, from the smallest to the largest 16 19 24 25 25 26 28 33 33 34 34 37 37 40 42 45 45 46 46 49 If the number of observations n is odd, the median is the center observation in the ordered list 3 If the number of observations n (= 22) is even, the median is the average of the two center observations in the ordered list: (34 + 34)/2 = 34 4 5 1 10 2 3 27 26 44 30 39 40 34 45 44 24 32 44 13 20 24 26 Q1 27 29 30 32 34 34 M 38 39 29 44 38 47 34 40 20 12 10 39 40 40 44 Q3 44 44 44 45 47 For n = 23 we have (n + 1)/2 = 12, so M = 34. To the left (or to the right) of the median there are 11 numbers, so (11 + 2)/2 = 6, i.e., Q1 = 26 and Q3 = 44 In the ordered list, the position of the median is (n + 1)/2, and the position of the first (third) quartile is (n + 1)/4 from the first (last) (if the position is not an integer, take the average between both adjacent places, could be a weighted average) JLM (WSU) STA 1020 Ch12 - Describing Distribution with Graphs 34 34 37 37 40 42 M 45 Q3 45 46 46 49 73 If you n is odd, e.g., suppose only 21 seasons (no season 1986 = 16 runs) 19 24 25 25 26 28 33 33 34 M 34 37 37 40 42 45 45 46 46 49 73 Q3 and Q3 = 45 JLM (WSU) STA 1020 Ch12 - Describing Distribution with Graphs Arrange all observations in increasing order and locate the median M in the ordered list of observations 12 33 3 43 / 114 39 33 The first (or third) quartile Q1 is the median of the observations whose position in the ordered list is to the left (or right) of the location of the overall median, i.e., Q1 is a number such that at most 25% (or 75%) of the data are smaller in value than Q1 and at most 75% (or 25%) are larger. Another case Data Set: Number of home runs hit by Hank Aaron in his first 23 seasons 13 28 Q1 = (25 + 26)/2 = 25.5 STA 1020 Ch12 - Describing Distribution with Graphs 26 Q1 The location of the median is the (n + 1)/2 (=11.5) “position” JLM (WSU) 25 Q1 2 73 2 Quartiles Q1 and Q3 44 / 114 Summary Numbers The five-number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest. In symbols, the five-number summary is min i.e., for Bonds 5 25 34 45 Q1 M Q3 max, 73 and for Aaron 10 26 34 44 47 A boxplot is a graph of the five-number summary A central box spans the quartiles. A line in the box marks the median. Lines extend from the box out to the smallest and largest observation 45 / 114 JLM (WSU) Boxplot STA 1020 Ch12 - Describing Distribution with Graphs 46 / 114 Ex3 Income inequality The Census Bureau Web site provides information on income distribution by race Figure 12.2 Boxplots comparing the yearly home run production of Barry Bonds (5 25 34 45 73) and Hank Aaron (10 26 34 44 47). Figure 12.3 Boxplots comparing the distributions of income among Hispanics, blacks, and whites. The ends of each plot are at 0 and at the 95% points in the distribution. Now it’s your turn: 12.2 Babe Ruth * Check Statistical Controversies: “Income Inequality” JLM (WSU) STA 1020 47 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 8 / 19 48 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch12 - Describing Distribution with Graphs Mean and standard deviation Ch12 - Describing Distribution with Graphs The mean x̄ (pronounced ”x-bar”) of a set of observations is their average. To find the mean of n observations, add the values and divide by n, i.e., (sum of observation)/n The standard deviation s measures the average distance of the observations from their mean. It is calculated by finding an average of the squared distances and then taking the square root. To find the standard If x1 , . . . , xn are the observed numerical values then Mean n x̄ = n s2 = Find the distance of each observation from the mean and square each of these distances 2 Average the distances by dividing their sum by n − 1. This average squared distance is called the variance 3 The standard deviation s is the square root of this average squared distance JLM (WSU) STA 1020 Ch12 - Describing Distribution with Graphs 25 24 19 33 25 34 46 37 33 42 i=1 i=1 49 / 114 JLM (WSU) Ex4 Finding x̄ and s 40 37 34 49 (x1 − x̄)2 + · · · + (xn − x̄)2 1 X = (xi − x̄)2 n−1 n−1 Standard Deviation v s u n (x1 − x̄)2 + · · · + (xn − x̄)2 u 1 X s= =t (xi − x̄)2 n−1 n−1 STA 1020 Ch12 - Describing Distribution with Graphs For the (Data Set) number of home runs hit by Barry Bonds in his first 22 seasons 16 x1 + x2 + · · · + xn 1X = xi n n i=1 Variance deviation of n observations: 1 In Formulae. . . 73 46 45 45 5 26 28 50 / 114 Ex4 Finding x̄ and s (cont) Figure 12.6 Barry Bonds’s home run counts, for Example 4, with their mean and the distance of one observation from the mean indicated. Think of the standard deviation as an average of these distances. We have n = 22 and x̄ = 16 + 25 + . . . + 28 762 = = 34.6, 22 22 (16 − 34.6)2 + (25 − 34.6)2 + · · · + (28 − 34.6)2 22 − 1 4139.12 = = 197.1 21 √ √ and finally s = s 2 = 197.1 = 14.04. s2 = ........................................................................................... * The standard deviation s measures spread about the mean x̄. Use s to describe the spread of a distribution only when you use x̄ to describe the center. * If s = 0 only when there is no spread. This happens only when all observations have the same value. So standard deviation zero means no spread at all. Otherwise s > 0. As the observations become more spread out about their mean, s gets larger Now it’s your turn! Hank Aaron’s home run x̄ and s JLM (WSU) STA 1020 Ch12 - Describing Distribution with Graphs 51 / 114 Ex5 Investing 101 JLM (WSU) STA 1020 Ch12 - Describing Distribution with Graphs Investors should think statistically (or not?). You can assess an investment by thinking about the distribution of (say) yearly return. Risk (or variability): 52 / 114 Ex6 Mean vs Median Figure 12.8 Stemplot of the salaries (in millions of dollars) of Los Angeles Lakers players, with median M = 2.7 and mean x̄ = 5.5. Treasury bills are riskier than treasury bonds. Stocks are even riskier (and you know why, right?) The distribution is skewed to the right and there are three outliers. If we drop the outliers, the mean for the other 10 players is only x̄ = 2.5 and the median decrease to M = 2.2. For instance, moving the highest salary from 19.5 to 195 would not change the median, but the mean will increase considerable. Figure 12.7 Stemplot of the yearly returns on common stocks for the 50 years 1950 to 1999, for Example 5. The returns are rounded to the nearest whole percent. The stems are 10s of percents and the leaves are single percents TRY Textbook Online / Quizzes / Statistical Applets (http://bcs.whfreeman.com/scc7e) JLM (WSU) STA 1020 53 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 9 / 19 54 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch12 - Describing Distribution with Graphs Choosing a summary Ch12 - Describing Distribution with Graphs The mean and standard deviation are strongly affected by outliers or by the long tail of a skewed distribution The median and quartiles are less affected, if the distribution is exactly symmetric then the mean x̄ and the median M are exactly equal The five-number summary (as it graph, a boxplot) is usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers Use x̄ and s only for reasonably symmetric distributions that are free of outliers Exercise Ch12 12.30 Mean x̄ and standard deviation s are not enough. The mean x̄ and standard deviation s measure center and spread but are not a complete description of a distribution. Data sets with different shapes can have the same mean and standard deviation. To demonstrate this fact, use your calculator to find x̄ and s for these two small data sets. Then make a stemplot of each and comment on the shape of each distribution. Data A: Data B: 9.14 6.58 8.14 5.76 8.74 7.71 8.77 8.84 9.26 8.47 8.10 7.04 6.13 5.25 3.10 5.56 9.13 7.91 7.26 6.89 4.74 12.50 The variance is the square of the standard deviation s ** Read Case Study Evaluated ** JLM (WSU) STA 1020 Ch12 - Describing Distribution with Graphs 55 / 114 JLM (WSU) Exercise (answer) Ch12 STA 1020 Ch12 - Describing Distribution with Graphs **Answers Both sets of data have the same mean and standard deviation (x̄ = 7.50 and s = 2.03). However, the two distributions are quite different: Set A is left-skewed, while set B is roughly uniform with a high outlier. – Data A – 3 1 4 7 5 6 1 7 2 8 1177 9 112 10 11 12 JLM (WSU) – Data B – 3 4 5 257 6 58 7 079 8 48 9 10 11 12 5 Check Textbook Portal: Statistical Applets . . . One Variable Statistical Calculator STA 1020 56 / 114 Multiple choice Ch12 Here are boxplots of the number of calories in 20 brands of beef hot dogs, 17 brands of meat hot dogs, and 17 brands of poultry hot dogs 1 The main advantage of boxplots over stemplots and histograms is: (a) boxplots make it easy to compare several distributions, as in this example. (b) boxplots show more detail about the shape of the distribution. (c) boxplots use the five-number summary, whereas stemplots and histograms use the mean and standard deviation. (d) boxplots show skewed distributions, whereas stemplots and histograms show only symmetric distributions. Answer: (a) 2 This plot shows that: (a) all poultry hot dogs have fewer calories than the median for beef and meat hot dogs. (b) about half of poultry hot dog brands have fewer calories than the median for beef and meat hot dogs. (c) hot dog type is not helpful in predicting calories, because some hot dogs of each type are high and some of each type are low. (d) most poultry hot dog brands have fewer calories than most beef and meat hot dogs, but a few poultry hot dogs have more calories than the median beef and meat hot dog. Answer: (d) 57 / 114 JLM (WSU) Ch13 - Normal Distributions STA 1020 Ch13 - Normal Distributions 58 / 114 Thought Questions. . . Chapter 13 STA 1020 Birth weights of babies born in the United States follow, at least approximately, a bell-shaped curve. What does that mean? Fall 2013 Section 09 MWF 10:40-11:35 0035 State What does it mean if a person’s SAT score falls at the 20th percentile for all people who took the test? Instructor: Dr. J.L. Menaldi Textbook - Statistics: Concepts and Controversies, by David S. Moore and William I. Notz, 2013, W.H. Freeman & Company [8th ed] Class Link: http://www.math.wayne.edu/˜menaldi/teach/13f1020.htm “Statistics” is the Science of collecting, describing and interpreting data... Many measurements in nature tend to follow a similar pattern. The pattern is that most of the individual measurements take on values that are near the average, with fewer and fewer measurements taking on values that are farther from the average in either direction. Describe what shape the distribution of such measurements would have It is said that “Probability” is the vehicle of Statistics, i.e., if were not for the laws of probability, the theory of statistics would not be possible JLM (WSU) STA 1020 59 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 10 / 19 60 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch13 - Normal Distributions Frequency Histogram Ch13 - Normal Distributions Histogram and. . . How to compare these two graphs? Figure 3.1 Draw 1000 SRSs of size 100 from the same population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [relative frequency!] Figure 13.1 A histogram and a computer-drawn curve. Both picture the distribution of the number of engineering doctorates earned by members of minority groups at 152 universities. This distribution is skewed to the right Figure 3.2 Draw 1000 SRSs of size 2527 from the same population as in Figure 3.1. JLM (WSU) STA 1020 Ch13 - Normal Distributions JLM (WSU) 61 / 114 . . . Computer-drawn curves STA 1020 Ch13 - Normal Distributions 62 / 114 Analyzing Data Density Curves versus (relative frequency) Histograms Always plot your data: make a graph, usually a histogram or a stemplot Look for the overall pattern (shape, center, spread) and for striking deviations such as outliers Choose either the five-number summary or the mean and standard deviation to briefly describe center and spread in numbers Sometimes the overall pattern of a large number of observations is so regular that we can describe it by a smooth curve Bell-Shaped Curve? Asymmetric Distributions? Normal Distribution? Figure 13.2 A histogram and a computer-drawn curve. Both picture the distribution of the sample proportion in 1000 simple random samples from the same population. This distribution is quite symmetric. Almost a normal curve! JLM (WSU) STA 1020 Ch13 - Normal Distributions 63 / 114 JLM (WSU) Ex1 Density Curves STA 1020 Ch13 - Normal Distributions 64 / 114 Center and Spread Figure 13.5 A perfectly symmetric Normal curve (distribution of sample proportions Figure 13.4 A histogram and a Normal Density Curve, for Example 1. (a) The area of the shaded bars in the histogram represents observations greater than 0.51. These make up 171 of the 1000 observations. (b) The shaded area under the Normal curve represents the proportion of observations greater than 0.51. This area is 0.1667 Figure 13.6 The mean of a density curve is the point at which it would balance JLM (WSU) STA 1020 65 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 11 / 19 66 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch13 - Normal Distributions Median and Mean Ch13 - Normal Distributions Normal Distributions The median of a density curve is the equal areas point, the point that divides the area under the curve in half The mean of a density curve is the balance point, or center of gravity, at which the curve would balance if made of solid materia The median and mean are the same for a symmetric density curve. They both lie at the center of the curve. The mean of a skewed curve is pulled away from the median in the direction of the long tail Both, median and mean are measures of the central tendency of distributions Distributions (density curves): the five-number summary helps to understand the shape, while the standard deviation measures spread Recall that Histograms are made from a table of either frequency distribution and/or relative frequency distribution Figure 13.7 Two Normal curves. The standard deviation fixes the spread of a Normal curve JLM (WSU) STA 1020 Ch13 - Normal Distributions 67 / 114 JLM (WSU) Normal Density Curves The normal curves are symmetric, bell-shaped curves that have these properties: x2 2 + x3 6 68 / 114 The 68-95-99.7 rule If the observation follows (approximately) a normal distribution then, approximately, A specific (theoretical) normal curve is completely described by giving its µ (mu) mean and its standard deviation σ (sigma), the density is given by (x − µ)2 1 f (x) = √ exp − 2σ 2 σ 2π where exp means “exponential”, i.e., exp(x) = e x = 1 + x + STA 1020 Ch13 - Normal Distributions + ··· The mean determines the center of the distribution. It is located at the center of symmetry of the curve 68% of the observations fall within one standard deviation of the mean 95% of the observations fall within two standard deviations of the mean 99.7% of the observations fall within three standard deviations of the mean This is known as the “68-95-99.7 rule” (or the Empirical Rule) for the normal distribution. The standard deviation determines the shape of the curve. It is the distance from the mean to the change-of-curvature points on either side Usually, x̄ and s are used for the sample mean and standard deviation, while µ and σ denote the population mean and standard deviation JLM (WSU) STA 1020 Ch13 - Normal Distributions 69 / 114 JLM (WSU) The 68-95-99.7 rule (cont) STA 1020 Ch13 - Normal Distributions 70 / 114 Ex2 The 68-95-99.7 rule Figure 13.9: If the height of women aged 18 to 24 is approximatively normal with mean 65 inches and standard deviation 2.5 inches then, the rule says about women’s height that Figure 13.8 The 68-95-99.7 rule for Normal distributions Now it’s your turn: Heights of young men JLM (WSU) STA 1020 71 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 12 / 19 72 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch13 - Normal Distributions Ex3/4 ACT vs SAT scores Ch13 - Normal Distributions * Jennie scored 600 on the SAT Math exam, and Gerald scored 21 on the ACT math part. SAT and ACT scores are approximatively normal with mean 500 and 18, and standard deviation 100 and 6, respectively. Who did better? The standard score for an observation x is z = (x − x̄)/s, and measures the relative standing of a measurement in a data set (check Wikipedia http://en.wikipedia.org/wiki/Standard score) * Jennie’s standard score is (600 − 500)/100 = 1.0 and Gerald’s (21 − 18)/6 = 0.5. Larger standard score yields better grade Ex5 Reverse Search In any Table B, we can find the percentile c for a given standard score z or in reverse, we can find the standard score z for a given percentile c. [Jennie z = 1, c = 84.13 ≈ 68/2 + 50 = 84] For instance, how high must a student score on the SAT to fall in the top 10% of all scores? That requires a score at or above the 90-th percentile, i.e., for c = 0.9032 we get z = 1.3 and for c = 0.8849 we get z = 1.2, say we take z = 1.3. This yields x = x̄ + z s = 500 + (1.3)(100) = 630. • [“symmetry”] The c-th percentile of a distribution (F (z) is the area until z) is a value such that c percent of the observations lie below it and the rest lie above, i.e., F −1 (c) • Table B * Jennie 84.13 and Gerald 69.15 • Table B w/2 digits JLM (WSU) STA 1020 Ch13 - Normal Distributions 73 / 114 JLM (WSU) Another Example • Another Table B STA 1020 Ch13 - Normal Distributions 74 / 114 More questions Health and Nutrition Examination Study of 1976-1980 (HANES). What proportion of men are less than 72.8 inches tall? rule • From Data: Heights of adults, ages 18-24 women: mean 65.0 in & standard deviation 2.5 in (100 − 68)/2 + 68 or 50 + 68/2 = 84. men: mean 70.0 in & standard deviation 2.8 in ....................................................................... Empirical Rule (68-95-99.7) 68% are between 62.5 and 67.5 inches women 95% are between 60.0 and 70.0 inches 99.7% are between 57.5 and 72.5 inches Ans: 84% or 84.13% (Table B) What proportion of men are less than 68 inches tall? Observation x = 68, standard score z = (68 − 70.0)/2.8 = −0.71. In Table B, we find c = 24.20 for x = −0.7. Ans: 24% or 23.87% (2-digit table) 68% are between 67.2 and 72.8 inches men 95% are between 64.4 and 75.6 inches 99.7% are between 61.6 and 78.4 inches -4 -3 -2 -1 0 +1 +2 +3 +4 -4 -3 -2 -1 0 +1 +2 +3 +4 Table B http://www.math.wayne.edu/˜menaldi/teach/others/Sta1020/table-percentile.pdf Two-digits Table . . . /p-values-table.pdf and . . . /p-values-table-alt.pdf or even this Table with comments . . . /p-values-table-triola.pdf ** Now it’s your turn. Read Case Study Evaluated ** JLM (WSU) STA 1020 Ch13 - Normal Distributions 75 / 114 JLM (WSU) Exercise Ch13 STA 1020 Ch13 - Normal Distributions 13.6 Random numbers. If you ask a computer to generate “random numbers” between 0 and 1, you will get observations from a uniform distribution. 76 / 114 Exercise (answer) Ch13 **Answers (a) The curve forms a 1 × 1 square, which has area 1. (b) The mean and median are both 0.5. (c) 10% (the region is a rectangle with height 1 and base width 0.1; hence the area is 0.1). (d) 30% (the region is a rectangle with height 1 and base width 0.9 − 0.6 = 0.3). Figure 13.12 shows the density curve for a uniform distribution. This curve takes the constant value 1 between 0 and 1 and is zero outside that range. Use this density curve to answer these questions. (a) Why is the total area under the curve equal to 1? (b) The curve is symmetric. What is the value of the mean and median? (c) What percentage of the observations lie between 0 and 0.1? (d) What percentage of the observations lie between 0.6 and 0.9? JLM (WSU) STA 1020 77 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 13 / 19 78 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch13 - Normal Distributions Multiple choice Ch13 Ch14 - Describing Relationships: Scatterplots and Correlation Suppose that the Blood Alcohol Content (BAC) of students who drink five beers varies from student to student according to a normal distribution with mean 0.07 and standard deviation 0.01. 1 The middle 95% of students who drink five beers have BAC between: (a) 0.06 and 0.08. (b) 0.05 and 0.09. (c) 0.04 and 0.10. (d) 0.03 and 0.11. Answer: (b) 2 What percent of students who drink five beers have BAC above 0.08 (the legal limit for driving in most states)? (a) 0.15%. (b) 2.5%. Answer: (d) (c) 5%. (d) 16%. (e) 32%. 3 What percent of students who drink five beers have BAC above 0.10 (the legal limit for driving in other states)? (a) 0.15%. (b) 2.5%. (c) 5%. (d) 1.5%. (e) 32%. Answer: (a) STA 1020 Fall 2013 Section 09 MWF 10:40-11:35 0035 State Instructor: Dr. J.L. Menaldi Textbook - Statistics: Concepts and Controversies, by David S. Moore and William I. Notz, 2013, W.H. Freeman & Company [8th ed] Class Link: http://www.math.wayne.edu/˜menaldi/teach/13f1020.htm “Statistics” is the Science of collecting, describing and interpreting data... It is said that “Probability” is the vehicle of Statistics, i.e., if were not for the laws of probability, the theory of statistics would not be possible JLM (WSU) STA 1020 Ch14 - Describing Relationships: Scatterplots and Correlation 79 / 114 Bivariate Data JLM (WSU) STA 1020 Ch14 - Describing Relationships: Scatterplots and Correlation Chapter 14 These are the values of two different variables that are obtained form the same population Both variables are qualitative (attribute) 80 / 114 Scatterplots A Scatterplot shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. Each individual in the data appears as the point in the plot fixed by the values of both variables for that individual. Both variables are quantitative (numerical) One variables is qualitative and the other is quantitative Two quantitative variables are seen as ordered pairs, sometimes called explanatory (or input, or independent) variable and response (or output, or dependent) variable. ....................................................................... Figure 14.2 Scatterplot of recession velocity against distance from the earth. Example 1: Hubble’s law and the Big Bang Investigate the relationship between “distance from the earth” and “recession velocity” (moving away from the observer) Always plot explanatory variable in the horizontal or x axis of the scatterplot Key evidence for the idea of the expanding universe, and rewinding, the “Big Bang” appears! JLM (WSU) STA 1020 Ch14 - Describing Relationships: Scatterplots and Correlation 81 / 114 Ex2 Health and Wealth Data from the World Bank. The explanatory variable is the GDP per person, the response variable is the life expectancy at birth. Three African nations are outliers. Figure 14.3 Scatterplot of the life expectancy of people in many nations against each nation’s gross domestic product per person. The overall pattern does not show that people in richer country live longer, but life expectancy tend to rise very quickly as GDP increases, then levels off. JLM (WSU) STA 1020 JLM (WSU) STA 1020 Ch14 - Describing Relationships: Scatterplots and Correlation 82 / 114 Examining a scatterplot Look for In any graph of data, look for the overall pattern and for striking deviations from that pattern. You can describe the overall pattern of a scatterplot by the form, direction, and strength of the relationship. An important kind of deviation is an outlier, an individual value that falls outside the overall pattern of the relationship. Two variables are positively (negatively) associated when above average values of one tend to accompany above-average values of the other and below average values also tend to occur together. The scatter plot slopes upward (downward) as we move from left to right 83 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 14 / 19 84 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch14 - Describing Relationships: Scatterplots and Correlation Archaeopteryx fossils Length in centimeters of . . . Ex3 Classifying fossils Femur Humerus 38 41 Ch14 - Describing Relationships: Scatterplots and Correlation 56 63 59 70 64 72 74 84 Recall x̄ and s If x1 , . . . , xn are the observed numerical values then Mean n x̄ = x1 + x2 + · · · + xn 1X = xi n n i=1 Figure 14.5 Scatterplot of the lengths of two bones in 5 fossil specimens of the extinct beast Archaeopteryx, for Example 3. Variance n s2 = The plot shows a strong, positive, straight-line association (x1 − x̄)2 + · · · + (xn − x̄)2 1 X = (xi − x̄)2 n−1 n−1 i=1 Standard Deviation v s u n (x1 − x̄)2 + · · · + (xn − x̄)2 u 1 X s= =t (xi − x̄)2 n−1 n−1 Actually, six archaeopteryx fossil specimens are known, but the humerus of the last fossil is missing. To continue . . . i=1 JLM (WSU) STA 1020 Ch14 - Describing Relationships: Scatterplots and Correlation 85 / 114 Ex4 Linear correlation The correlation describes the direction and strength of a straight-line relationship between two quantitative variables. The (coefficient of linear) correlation is usually written as r , 1 X xi − x̄ yi − ȳ , n−1 sx sy n r= JLM (WSU) where the roles of the explanatory variable x and response variable y are symmetric and unit independent (i.e., r does not change if x and y are exchanged, or the unit is changed) i = * Like the mean and the standard deviation, the correlation is strongly affected by a few outlying observations JLM (WSU) STA 1020 STA 1020 4 5 64 72 74 84 Next the standard scores z = (x − x̄)/sx and z = (y − ȳ )/sy for each observation, i.e., for i = 1 we have (38 − 58.2)/13.20 = −1.530 and (41 − 66.0)/15.89 = −1.573, and we continue up to the last one i = 5 to get (74 − 58.2)/13.20 = 1.197 and (84 − 66.0)/15.89 = 1.133 3 Finally we add all, i.e., n = 5 and r = (−1.530)(−1.573) + · · · + (1.197)(1.133) /4 = 0.994 JLM (WSU) STA 1020 88 / 114 Relationships Statistical versus Deterministic Relationships Figure 14.8 Moving one point reduces the correlation from r = 0.994 to r = 0.640. JLM (WSU) 3 59 70 2 Ch14 - Describing Relationships: Scatterplots and Correlation Figure 14.7 Patterns closer to a straight line have correlations closer to 1 or -1. 2 56 63 First we calculate the mean and the standard deviation for explanatory variable x and response variable y , i.e., x̄ = 58.2, sx = 13.20, ȳ = 66.0, sy = 15.89 87 / 114 Other Scatterplots 1 38 41 1 Note that r is always a number between −1 and 1. * Correlation does not describe curved relationships between variables, no matter how strong they are. 86 / 114 Ex4 Calculating correlation Archaeopteryx fossils (cont.) Femur Humerus i=1 Ch14 - Describing Relationships: Scatterplots and Correlation STA 1020 Ch14 - Describing Relationships: Scatterplots and Correlation (Distance) = (Time) × (Speed) 89 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) (Income) ≈ a + b × Assets STA 1020 15 / 19 90 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch14 - Describing Relationships: Scatterplots and Correlation Statistical Significance Ch14 - Describing Relationships: Scatterplots and Correlation Thought Questions. . . Assume you are doing a study . . . and you find that . . . A strong relationship seen in the sample may indicate a strong relationship in the population The sample may exhibit a strong relationship simply by chance and the relationship in the population is not strong or is zero. The observed relationship is considered to be statistically significant if it is stronger than a large proportion of the relationships we could expect to see just by chance “Statistical significance” does not imply the relationship is strong enough to be considered “practically important” For all cars manufactured in the U.S., there is a positive correlation between the size of the engine and horsepower There is a negative correlation between the size of the engine and gas mileage. Is this what you expected? What does it mean for two variables to have a positive correlation or a negative correlation? Do you expect a correlation between quality and price? Outliers? Even weak (strong) relationships may (not) be labeled statistically significant if the sample size is very large (small) JLM (WSU) STA 1020 Ch14 - Describing Relationships: Scatterplots and Correlation 91 / 114 Thought Questions. . . JLM (WSU) STA 1020 Ch14 - Describing Relationships: Scatterplots and Correlation 14.8 & 14.10 Calories and salt in hot dogs. What type of correlation would the following pairs of variables have positive, negative, or none? 1 Temperature during the summer and electricity bills 2 Temperature during the winter and heating costs 3 Number of years of education and height 4 Frequency of brushing and number of cavities 5 Number of churches and number of bars in cities in your state 6 Height of husband and height of wife 92 / 114 Exercise Ch14 (14.8) Figure 14.11 shows the calories and sodium content in 17 brands of meat hot dogs. Describe the overall pattern of these data. In what way is the point marked A unusual? (14.10) Is the correlation r for the data in Figure 14.11 near -1, clearly negative but not near -1, near 0, clearly positive but not near 1, or near 1? Explain your answer. ** Now it’s your turn. Read Case Study Evaluated ** 14.12 Outliers and correlation. Figure 14.10 contains outliers marked A, B, and C. In Figure 14.11 the point marked A is an outlier. Removing the outliers will increase the correlation r in one figure and decrease r in the other figure. What happens in each figure, and why? JLM (WSU) STA 1020 Ch14 - Describing Relationships: Scatterplots and Correlation 93 / 114 Exercise Ch14 (cont.) JLM (WSU) STA 1020 Ch14 - Describing Relationships: Scatterplots and Correlation 14.12 Outliers and correlation. 94 / 114 Exercise (answer) Ch14 **Answers (14.8) The association is roughly linear and positive (high calories tend to go with high sodium, and low tends to go with low). Point A is a hot dog brand which is well below average in both calories and sodium. (14.10) This shows a fairly strong positive association, so r should be reasonably close to 1. Note: In fact, r = 0.863. In this case, point A makes the correlation higher, because its presence makes the scatterplot appear more linear. (With point A removed, the correlation drops slightly to 0.834.) Figure 14.10 Figure 14.11 Q: Figure 14.10 contains outliers marked A, B, and C. In Figure 14.11 the point marked A is an outlier. Removing the outliers will increase the correlation r in one figure and decrease r in the other figure. What happens in each figure, and why? JLM (WSU) STA 1020 (14.12) The correlation increases when A, B, and C are removed from Figure 14.10, because their presence makes the plot look less linear. The correlation decreases when A is removed from Figure 14.11, because that plot looks more linear with A. (That is, if we drew a line through that scatterplot, there is a less relative scatter about that line with point A than without.) 95 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 16 / 19 96 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch14 - Describing Relationships: Scatterplots and Correlation Multiple choice Ch14 Ch15 - Relationships: Regression, Prediction and Causation The stock market did well during the 1990s. Here are the percent total returns (change in price plus dividends paid) for the Standard & Poor’s 500 stock index: Year Return 1 2 1990 -3.1 1991 30.5 1992 7.6 1993 10.1 1994 1.3 1995 37.6 1996 23.0 1997 33.4 1998 28.6 The correlation of U.S. stock returns with overseas stock returns during these years was about r = 0.4. This tells you that: (a) when U.S. stocks rose, overseas stocks also tended to rise, but the connection was not very strong. (b) when U.S. stocks rose, overseas stocks rose by almost exactly the same amount. (c) when U.S. stocks rose, overseas stocks tended to fall, but the connection was not very strong. (d) nothing, because this is not a Answer: (a) possible value of r. Stock returns are measured in percent. What are the units of the mean, the median, the quartiles, the standard deviation, and the correlation between U.S. and overseas returns? (a) all are measured in percent. (b) all are measured in percent except the standard deviation, which is measured in squared percent. (c) all are measured in percent except the correlation, which is a number that has no units. (d) all are measured in percent except the correlation, which is measured in squared percent. Answer: (c) JLM (WSU) STA 1020 Ch15 - Relationships: Regression, Prediction and Causation STA 1020 1999 21.0 Fall 2013 Section 09 MWF 10:40-11:35 0035 State Instructor: Dr. J.L. Menaldi Textbook - Statistics: Concepts and Controversies, by David S. Moore and William I. Notz, 2013, W.H. Freeman & Company [8th ed] Class Link: http://www.math.wayne.edu/˜menaldi/teach/13f1020.htm “Statistics” is the Science of collecting, describing and interpreting data... It is said that “Probability” is the vehicle of Statistics, i.e., if were not for the laws of probability, the theory of statistics would not be possible 97 / 114 Regression JLM (WSU) STA 1020 Ch15 - Relationships: Regression, Prediction and Causation Chapter 15 Archaeopteryx (cont.) A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. 98 / 114 Ex1 & 3 Regression Equation i = Femur Humerus 1 38 41 2 56 63 3 59 70 4 64 72 5 74 84 6 50 ? The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. With the help of calculus, we obtain the equation of the least-squares regression line, namely, y = a + bx, where the slope b = r sy /sx and the intercept a = ȳ − bx̄ Usually, with the help of a computer (or calculator) we find the means x̄ and ȳ , the standard deviations sx and sy , the correlation coefficient r , and a, b. (humerus) = (−3.66) + (1.197) × (femur), JLM (WSU) STA 1020 Ch15 - Relationships: Regression, Prediction and Causation 99 / 114 Understanding Prediction JLM (WSU) (−3.66) + (1.197)(50) = 56.2. STA 1020 Ch15 - Relationships: Regression, Prediction and Causation Prediction is based on fitting some “model” to a set of data (prophecy?), it works best when the model fits data closely, and outside the range of available data is risky The square of the correlation r 2 is the proportion of the variation in the values of y that is explained by the least-squares regression of y on x Ex 5 Using r 2 : For Ex 1 (5 fossils) we have r = 0.994 so r 2 = (0.994)2 = 0.988, i.e., only a 1.2% of the variation of y is not explained by the variation of x Ans: 56.2 cm 100 / 114 Ex6 Causation Statistics and causation Ex 6: Does TV extend life? Measure the number of TV sets per person x and the life expectancy y for the world’s nations. There is a high positive correlation: nations with many TV sets have higher life expectancies. A lurking variable (national wealth) Figure 15.2 A weaker straight-line pattern. The data are the percentage in each state who voted Democratic in the two Reagan presidential elections. r = 0.704, r 2 = 0.498, i.e., a 50.2% not explained! Read Ex 7 Obesity in mothers and daughters A strong relationship between two variables does not always mean that changes in one variable cause changes in the other. The relationship between two variables is often influenced by other variables lurking in the background. The best evidence for causation comes from randomized comparative experiments. JLM (WSU) STA 1020 101 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 17 / 19 102 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch15 - Relationships: Regression, Prediction and Causation Causation Ch15 - Relationships: Regression, Prediction and Causation Figure 15.5 Some explanations for an observed association. A dashed line shows an association. An arrow shows a cause-and-effect link. Variable x is explanatory, y is a response variable, and z is a lurking variable. Ex8 SAT scores. . . High scores “x” on the SAT exams in high school certainly do not cause high grades “y ” in college. A moderate association (say r 2 about 27%) is no doubt explained by common response variable such as academic ability, study habits and staying sober (any of these are lurking variables “z”). * Prediction does not requires causation. The observed relationship between two variables may be due to direct causation, common response, or confounding. Two or more of these factors may be present together. An observed relationship can, however, be used for prediction without worrying about causation as long as the patterns found in past data continue to hold true. JLM (WSU) STA 1020 Ch15 - Relationships: Regression, Prediction and Causation STA 1020 JLM (WSU) STA 1020 Ch15 - Relationships: Regression, Prediction and Causation From past natural disasters, a strong positive correlation has been found between the amount of aid sent and the number of deaths. Would you interpret this to mean that sending more aid causes more people to die? Explain. JLM (WSU) ** Now it’s your turn. Read Case Study Evaluated ** 103 / 114 Thought Questions. . . From a long-term study on several families, researchers constructed a scatterplot of the cholesterol level of a child at age 50 versus the cholesterol level of the father at age 50. You know the cholesterol level of your best friend’s father at age 50. How could you use this scatterplot to predict what your best friend’s cholesterol level will be at age 50? Ch15 - Relationships: Regression, Prediction and Causation Evidence of causation: strong association (i.e., association between smoking and lung cancer is very strong), consistent (many studies of different kind of people in may countries link smoking to lung cancer), higher doses yield stronger response (people who smoke more cigarettes per day or who smoke over a longer period get lung cancer more often), alleged cause is plausible (experiment with animals show that tars from cigarettes smoke do cause cancer), etc. Studies have shown a negative correlation between the amount of food consumed that is rich in beta carotene and the incidence of lung cancer in adults. Does this correlation provide evidence that beta carotene is a contributing factor in the prevention of lung cancer? Explain. A scatterplot of number of bicycles sold versus number of bank robberies in the United States for each year over the past century would show a very strong positive correlation. Why would this be true? Does an increase in one cause an increase in the other? 105 / 114 More Examples JLM (WSU) STA 1020 Ch15 - Relationships: Regression, Prediction and Causation Prediction via Regression Line 104 / 114 Thought Questions (cont). . . 106 / 114 A Caution Beware of Extrapolation: Sarah’s height was plotted against her age (Hand, et al., A Handbook of Small Data Sets, London: Chapman and Hall) The regression equation is y = 3.6 + 0.97x (where y is the average age of all husbands who have wives of age x) For all women aged 30, we predict the average husband age to be 32.7 years: 3.6 + (0.97)(30) = 32.7 Regression line: y = 71.95 + 0.383 x Suppose we know that an individual wife’s age is 30. What would we predict her husband’s age to be? The square of the correlation r 2 measures the usefulness of regression prediction, e.g., if r = ±1 or r 2 = 1 then the regression line explains all (100%) of the variation in y if r = 0.7 or r 2 = 0.49 then the regression line explains almost half (50%) of the variation in y Can you predict her height at age 42 months? Height at age 42 months? y = (71.95) + (0.383)(42) = 88 cm. Can you predict her height at age 30 years (360 months)? Height at age 30 years? y = (71.95) + (0.383)(360) = 209.8 cm. She is predicted to be 6’ 10.5” at age 30. [Could be possible?] JLM (WSU) STA 1020 107 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 18 / 19 108 / 114 ** STA 1020 - Part 2 (24/Oct/13) ** Ch15 - Relationships: Regression, Prediction and Causation Again. . . Ch15 - Relationships: Regression, Prediction and Causation Correlation does not imply causation, and two variables may be related if Explanatory variable causes change in response variable Response variable causes change in explanatory variable Explanatory variable may have some cause, but is not the sole cause of changes in the response variable Confounding variables may exist Both variables may result from a common cause (such as, both variables changing over time) JLM (WSU) STA 1020 Both Variables are Changing Over Time [both divorces and suicides have increased dramatically since 1900. (explanatory)] Are divorces causing suicides or are suicides causing divorces? The population has increased dramatically since 1900 (causing both to increase) Better to investigate: Has the rate of divorce or the rate of suicide changed over time? STA 1020 JLM (WSU) STA 1020 Ch15 - Relationships: Regression, Prediction and Causation Common Response (both variables change due to common cause) [divorce among men (explanatory)] and [percent abusing alcohol (response)] Both may result from an unhappy marriage JLM (WSU) Response causes Explanatory: [Hotel advertising dollars (explanatory)] and [occupancy rate (response)] Positive correlation? (more advertising leads to increased occupancy rate?) No, lower occupancy leads to more advertising 109 / 114 Imagining Examples. . . (cont) Confounding Variables: [meditation (explanatory)] and [aging (measurable aging factor) (response)] General concern for one’s well being may be confounded with decision to try meditation Ch15 - Relationships: Regression, Prediction and Causation Explanatory causes Response: [pollen count from grasses (explanatory)] and [percentage of people suffering from allergy symptoms (response)]; or [amount of food eaten (explanatory)] and [hunger level (response)] Explanatory is not Sole Contributor: [Consumption of barbecued foods (explanatory)] and [Incidence of stomach cancer (response)] Barbecued foods are known to contain carcinogens, but other lifestyle choices may also contribute The correlation may be merely a coincidence Ch15 - Relationships: Regression, Prediction and Causation Imagining Examples. . . 15.6 & 15.8 IQ and the school GPA. Figure 14.10 (page 302) plots school grade point average (GPA) against IQ test score for 78 seventh-grade students. There is a roughly straight-line pattern with quite a bit of scatter. The correlation between these variables is r = 0.634. What percentage of the observed variation among the GPAs of these 78 students is explained by the straight-line relationship between GPA and IQ score? What percentage of the variation is explained by differences in GPA among students with similar IQ scores? 15.8. The least-squares line for predicting school GPA from IQ score, based on the 78 students plotted in Figure 14.10, is GPA = −3.56 + (0.101)(IQ). Explain in words the meaning of the slope b = 0.101. Then predict the GPA of a student whose IQ score is 115. 111 / 114 Exercise (answer) Ch15 JLM (WSU) STA 1020 Ch15 - Relationships: Regression, Prediction and Causation **Answers 15.6. Of the observed variation among the GPAs of these 78 students, the percent explained by the straight-line relationship between GPA and IQ score is r 2 = (0.634)2 = 0.402 = 40.2%. The rest of the variation (59.8%) is due to differences in GPA among students with similar IQ scores. 15.8. The slope b = 0.101 means that we expect GPA to increase by about 0.101 points for every one-point increase in IQ (and GPA drops by about 0.101 for every one-point decrease in IQ). For an IQ of 115, we predict a GPA of −3.56 + (0.101)(115) = 8.055. 110 / 114 Exercise Ch15 112 / 114 Multiple choice Ch15 Consider a large number of countries around the world. There is a positive correlation between the number of Nintendo games per person x and the average life expectancy y . Does this mean that we could increase the life expectancy in Rwanda by shipping Nintendo games to that country? (a) Yes: the correlation says that as the number of Nintendo games per person goes up, so does life expectancy. (b) No: if the correlation were negative we could accept that conclusion, but this correlation is positive. (c) Yes: positive correlation means that if we increase x, then y will also increase. (d) No: the positive correlation just shows that richer countries have both more Nintendo Answer: (d) games per person and higher life expectancies. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suppose that the correlation between the scores of students on Exam 1 and Exam 2 in a statistics class is r = 0.7. One way to interpret r is to say what percent of the variation in Exam 2 scores can be explained by the straight-line relationship between Exam 2 scores and Exam 1 scores. This percent is about (a) 0.49%. (b) 70%. (c) 49%. (d) 30%. Answer: (c) JLM (WSU) STA 1020 113 / 114 http://www.math.wayne.edu/˜menaldi/teach/ JLM (WSU) STA 1020 19 / 19 114 / 114