Chapter 3: Data Description – Numerical Methods Learning Objectives Upon successful completion of Chapter 3, you will be able to: • • • • Summarize data using measures of central tendency, such as the mean, median, mode and midrange. Describe data using measures of variation, such as the range, variance, and standard deviation. Identify the position of a data value in a data set, using various measures of position, such as percentiles, deciles, and quartiles. Use the techniques of exploratory data analysis, including boxplots and five-number summaries, to discover various aspects of data. I. Basic Vocabulary A. Statistics vs. Parameter • • A statistic is a numerical characteristic or numerical summary obtained by using the data values from a sample. A parameter is a numerical characteristic or numerical summary obtained by using all the data values for the entire population. B. Numerical Summaries of Quantitative Data 1. Measures of the average or center: mean, median, mode, and midrange. 2. Measures of variation (spread, variability, or dispersion): range, variance, and standard deviation. THE GENERAL ROUNDING RULE: Always round to one more place than the data when the final answer is computed. C. Notation for numerical summaries indicate if it is a parameter or a statistic N: population size n: sample size • population mean sample mean Note: The mean is found the same way for the sample or population • Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 1 population variance population standard deviation sample variance sample standard deviation population proportion sample proportion is a value of the variable or an answer to the question asked II. Measures of the Center A. Mean I. Mean (ungrouped data) – To calculate the mean, take the sum of all data values, and then divide by the number of values: Sample Mean Population Mean Note: The mean is found the same way for the sample and population. Example: Mean I 3. 5 10 15 20 25 22 Note: The answer is rounded t one more place than the data. a) Example: Mean II 5.0 10.1 15.3 20.2 57.3 Note: The answer is rounded t one more place than the data. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 2 II. Mean for Grouped Data – To calculate the mean of grouped data, 1. Use the class midpoint (𝑥𝑖 ) for each class 2. Use the class frequency for each class (f) with the formula a) Example: Mean for Grouped Data Class 7 – 21 22- 36 37 – 51 Totals Midpoint(x) 14 29 44 Frequency (f) 3 5 4 12 xf 42 145 176 363 Note: The answer is rounded t one more place than the data. III. Weighted Mean – A weighted mean has an additional factor or weight for each class. a) Example: Weighted Mean PSU Grade Point Average (GPA) – grades are weighted by their quality points. Course English Stat History Totals Credit (w) 3.0 4.0 3.0 10.0 Dr. Janet Winter, jmw11@psu.edu Grade (x) B+ 3.5 A 4.0 C 2.0 Stat 200 xw 10.5 16.0 6.0 32.5 Page 3 �) B. Median (MD, 𝒙 I. Median – To calculate the median, place the data in increasing order and find a value in the center of the ordered list. II. Median: • The middle value in an ordered list of data. • It is the value with the same number of data values above and below it. • Used for data sets with outliers. • In the absence of outliers, use the mean. Process: 1. Order the data values from the smallest to the largest. 2. When the sample size n is odd, the median is the data located in the exact middle. 3. When the sample size n is even, there are two data values in the middle. The median is the average of the two data values in the middle. Example: Example data: 14 23 13 54 67 12 4 Ordered data: 4 12 13 14 23 54 67 Answer: 14 is the data value in the center Example: Median Example data: 14 23 13 54 67 12 4 Ordered data: 4 12 13 14 23 54 67 90 90 The middle value is between 14 and 23. Solve for the average: (14 + 23)/2 = 37/2 = 18.5 MD = 18.5 (round to one more place or tenths) C. Mode • • • • • • For ungrouped data, the mode is the value that occurs most often in a data set. For grouped data, the mode is the class with the highest frequency and is called the modal class. Bi-modal – Two classes each with the largest frequency OR Two data value each with the largest frequency. No mode- no value is repeated. Multi modal- more than two data values or more than two classes with the same greatest frequency. No symbol for the mode. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 4 Question 1 When a person says that the average age of a group of workers is 35, the average a) b) c) d) is the mean of the ages. is the median of the ages. could be either the mean or the median of the ages. do not know. Question 2 If we are taking a test and we wish to score in the upper half of the students, then we wish to be higher than the a) b) c) d) is the mean of the ages. is the median of the ages. could be either the mean or the median of the ages. do not know. D. Midrange (MR) I. Midrange (MR) : • The value in the middle of the range • The value midway between the lowest and highest data values II. Example: Midrange Find the midrange for: 2, 13, 1, 25, 45, 67, 90 1. Order the data: 1 2 13 25 45 67 90 2. The average of 1 (lowest) and 90 (highest) 3. (1 + 90)/2 = 45.5 (round to one more place or tenths) Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 5 E. Comparison of Mean, Median, Mode, and Midrange Takes Every Affecte Advantages Measure How Value d by and of Definition Existence Common? into Extreme Disadvantage Center Account Values? s ? Mean Median Mode middle value most frequent data value Midrange used throughout this book; works well with many statistical methods often a good choice if there are some extreme values most familiar “average” always exists yes yes commonly used always exists no no sometimes used might not exist; may be more than one mode no no appropriate for data at the nominal level rarely used always exists no yes very sensitive extreme values General Comments: • For a data collection that is approximately symmetric with one mode, the mean, median, mode, and midrange tend to be about the same. • For a data collection that is obviously asymmetric, it would be good to report both the mean and median. • The mean is relatively reliable. That is, when samples are drawn from the same population, the sample means tend to be more consistent than the other measures of center (consistent in the sense that the means of samples drawn from the same population don’t vary as much as the other measure of center. (Triola & Triola, 2006) Question 3 Which measures of the center are influenced by outliers? a) b) c) d) e) Mean Median Mode Midrange Both A & D Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 6 Question 4 If we tally the votes in an election, then the winner would be the candidate corresponding to a) b) c) d) the mean of the number of votes. the median of the number of votes. the mode of the number of votes. do not know. III. Shapes of Distributions A. Symmetrical • Symmetrical shapes have evenly distributed data values on both side of the mean. • Mean median and mode are all equal. B. Positively skewed or Right skewed • Positively skewed or right skewed shapes have the majority of data values fall to the left of the mean and cluster at the lower end of the distribution, with the tail to the right. • The mean and median are to the right of the mode. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 7 C. Negatively skewed or Left skewed • Negatively skewed or left skewed shapes have the majority of the data values fall to the right of the mean and cluster at the upper end of the distribution, with the tail to the left. • The mean and median are to the left of the mode. IV. Measures of Spread – (dispersion or variability) A. Types 1. Range – the highest value minus the lowest value in a data set (R) 2. Variance 3. Standard Deviation Question 5 An entertainment event advertises that people ages 1 to 100 would enjoy the event. The advertisement specifically describes a set of people with a) b) c) d) a large number of ages. a large range of ages. a large mean of ages. do not know. B. Variance I. Population Variance ( ) - To calculate population variance: 1. Find the mean. 2. Subtract the Mean from each data value 3. Square each difference 4. Divide the sum by the number of data values Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 8 II. Sample Variance ( ) – To calculate sample variance: 1. Find the mean. 2. Subtract the Mean from each data value 3. Square each difference 4. Divide the sum by the number of data values minus one ∑(𝑥𝑖 − 𝑥̅ )2 𝑠 = (𝑛 − 1) 2 III. Example: Sample Variance Data: 3 7 6 4 x 3 7 6 4 Total 3 – 5 = -2 7–5=2 6–5=1 4 – 5 = -1 4 4 1 1 10 Note: Round to one more place than the original data. C. Standard Deviation I. Population Standard Deviation (σ) – The population standard deviation (σ) is the square root of the population variance (σ2). Same Rounding rule: Round the final answer to one more decimal place than the original data. II. Sample Standard Deviation (s) a) Deviation Formula – the sample standard deviation (s) is the square root of the sample variances (s2). 𝑠= � (𝑥𝑖 −𝑥̅ )2 (𝑛−1) Same Rounding rule: Round the final answer to one more decimal place than the original data. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 9 b) Computational Formula Note: This is used for better accuracy when the mean has several decimal points and folks are more likely to ignore those decimals. Process: 1. Find the sum of all of the data values 2. Find the sum of the squared data values 3. Multiply the sum of the squared data values by the number of data values 4. Square the sum of the data values in step 1 5. Subtract step 4 answer from the step 3 answer: 6. Divide the difference in step 5 by the n times (n – 1) 7. Take the square root of the quotient c) Example: Standard Deviation (again with the computational formula) Data: 3 7 6 4 x 3 7 6 4 Total 20 x2 9 49 36 16 110 D. Range Rule of Thumb A rough estimate of the standard deviation is: Where range is highest data value minus lowest data value. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 10 E. Standard Deviation for Grouped Data: Use the class midpoints and frequencies 𝑠= � (𝑥𝑖 −𝑥̅ )2 𝑓𝑖 (𝑛−1) I. Example: Standard Deviation for Grouped Data Data: Midpoint 14 29 44 Frequency (f) 3 5 4 12 xf 42 145 176 363 x2f 588 4205 7744 12537 Question 6 If we know the variance of a set of data, then to calculate the standard deviation of this data a) is a long process because of the many operations needed. b) is a short process because the standard deviation is equal to the variance. c) is a short process because the standard deviation is the square root of the variance. d) do not know. F. Uses for Variance and Standard Deviation 1. Measures of spread, variability, and consistency. 2. To complete inferential statistics. 3. To understand data distributions using Chebyshev’s theorem and the Empirical Rule. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 11 G. Coefficient of variation (cvar): Comparing Standard Deviations for Different Distributions • • • To compare standard deviations for different distributions, use the coefficient of variation. The coefficient of variation is the standard deviation divided by the mean and multiplied by 100%. It is free of measurement units. 𝑐𝑣𝑎𝑟 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛 × 100% I. Example: Comparing Standard Deviations for Different Distributions I The mean of the number of sales of cars over a 3 month period is 87, and the standard deviation is 5. The mean of the commissions is $5225, and the standard deviation is $773. Compare the variations of the two. Sales Commissions The commissions are more variable than the sales. II. Example: Comparing Standard Deviations for Different Distributions II John took two tests last week. The average for the history test was 61.3 and the standard deviation was 2.94. The average for the math test was 81.5 and the standard deviation was 3.14. Compare the variation for the two tests. History Test Math Test The history test is more variable than the math test. V. Calculator A. TI-83 Key Strokes to Clear Lists ALWAYS clear out Lists before entering data. 1. STAT 2. CLRLIST (L1, L2, L3, …) Use second function 1 for L1, second function 2 for L2 etc. Be sure to include commas and end with parentheses. 3. ENTER Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 12 B. TI-83 Key Strokes to Enter Data Enter data into a Cleared List. 1. STAT 2. EDIT 3. Enter the data in the lists as need pressing ENTER after each data value. C. TI-83 Basic Statistics for Ungrouped Data 1. 2. 3. 4. 5. Clear L1 and enter the data in L1 STAT CALC 1 – VARIABLE STATS L1 ENTER D. TI-83 Basic Statistics for Grouped Data 1. 2. 3. 4. 5. 6. 7. 8. Clear L1, L2 STAT EDIT Enter midpoints in L1 and enter their corresponding frequencies in L2 STAT CALC I-variable stats L1, L2 Check that n is the sum of the frequencies VI. Rules For Data Distribution • • For all data sets, use Chebyshev’s Theorem. For bell-shaped or approximately normally distributed data sets, use the Empirical Rule (68 - 99 - 99.75 Rule) A. Chebyshev’s Theorem for All Distributions For any distribution, the proportion of values from a data set that will fall within k standard deviations of the mean will be at least: 1– 1/k2, where k is a number greater than 1. I. Process Select values for k and compute 1 – 1/k2 k 2.1 2 3.3 1 – 1/k2 1 1− = .7732 2.12 1 1 − 2 = .75 2 1 1− = .90817 3.32 Dr. Janet Winter, jmw11@psu.edu Interpretation 77.32% of the data is within 2.1 standard deviations of the mean or would be in the interval (𝑋� − 2.1 𝑆, 𝑋� + 2.1 𝑆) 75% of the data is within 2 standard deviations of the mean or would be in the interval (𝑋� − 2 𝑆, 𝑋� + 2 𝑆) 90.82% of the data is within 3.3 standard deviations of the mean or would be in the interval (𝑋� − 3.35 𝑆, 𝑋� + 3.35 𝑆) Stat 200 Page 13 II. Example: Chebyshev’s Theorem for All Distributions - with k = 2 The mean price of houses in a certain neighborhood is $150,000, and the standard deviation is $10,000. Find the price range for which at least 75% of the houses will sell. Using the table from the previous example, k=2. $150,000 +2($10,000) = $150,000 + $20,000 = $170,000 $150,000 +2($10,000) = $150,000 – $20,000 = $130,000 75% of the houses cost between $130,000 and $170,000 B. Empirical Rule for Bell Shaped Distributions • • • Approximately 68% of data values fall within one standard deviation of the mean. Approximately 95% of the data values fall within two standard deviations of the mean. Approximately 99.75% of the data values fall within three standard deviations of the mean. VII. Measures of Position or Relative Standing Measures of position are the relative positions of one data value in comparison with the entire set of data values. • Z-score • Percentiles • Quartiles • Deciles Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 14 A. Standard Scores (used to compare data values between two groups) To compare data values, subtract the mean from the data value and divide by the standard deviation. I. Z-score Forumlas For samples: For populations: II. Example: Z-score I A student scored 75 on a calculus test that had a mean of 50 and a standard deviation of 10; she scored 80 on a history test with a mean of 75 and a standard deviation of 6.1. Compare her relative positions on the two tests. The second z-score is larger. Thus, the 75 in calculus is a better grade as a standard score or compared to the classmates than the 80 on the history test. Note: z-scores are always given to two-place accuracy. III. Understanding Z-score a) Z-scores have a mean of 0 and a standard deviation of 1. b) A z-score is the number of standard deviations a value is away from the mean for a specific distribution. c) d) Ordinary and Unusual z-scores • Ordinary values: -2 < z < 2 • Unusual values: z < -2 or z> 2 e) Whenever a value is less than the mean, its corresponding z-score is negative. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 15 f) Example: Z-score II Using the information below, compare Joe’s height of 78 inches to Susan’s height of 73 inches. Men have heights with a mean of 69.0 inches and a standard deviation of 2.8 inches. Women have heights with a mean of 63.6 inches and a standard deviation of 2.5 inches. Joe: z = (78 – 69)/2.8 = 3.21 Susan: z = (73 – 63.6)/2.5 = 3.76 Susan is taller compared to other women than Joe compared to other men. B. Percentiles (the position of a data value within its group) A percentile, P, is an integer between 1 and 99 such that P% of the data values are less than or equal to the value and (100 – P)% of the data values are greater than or equal to the value. I. Given a data value x, find the percentile P 1. Count the number of data values below x 2. Add .5 3. Divide the sum by the number of data values n 4. Multiply by 100% 5. Round to an integer using regular rounding rules II. Given the percentile P, find the data value x • n: the total number of data values • p: the percentile • c: used to find the position of the data value 1. Order the data lowest to highest 2. To find the position of the data value x, let: c = (n p)/100 3. To find the data value, use the position value c • • If c is not a whole number, round to the next larger whole number. Starting at the lowest data value, count to the number that corresponds to the rounded up value of c. If c is a whole number, use the value halfway between the c th and (c + 1) st values when counting up from the lowest value. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 16 III. Example: Percentiles I Find the value corresponding to the 13th percentile. Unordered Data: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10 Ordered data: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20 = c • • • n⋅ p = 100 10 ⋅13 = 1.3 100 Since c is not a whole number, round up to 2. Start at the lowest score and count to the second value, which is 3. 3 is the 13th percentile value. IV. Example: Percentiles II A teacher gives a 20-point test to 10 students. The scores are shown below. Find the percentile rank of a score of 12. Unordered Data: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10 Ordered Data: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20 Percentile = 6 + 0.5 ⋅100% = 65th percentile 10 C. Quartiles Divide the order list of data values into four groups. • Q1 is the same as the 25th percentile • Q2 is the 50th percentile or the median • Q3 is the 75th percentile Question 7 If a botanist measures the length of flower petals and finds that 75% of the lengths are 1.5 cm or longer, then 1.5 is a) the f the first quartile of lengths of petals. b) the 25th percentile of lengths of petals. c) both of the above. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 17 d) do not know. D. Deciles Deciles divide the distribution into 10 groups. They are denoted by D1, D2, …, D10. How do deciles related to percentiles? Question 8 The percentile that corresponds to the mean is a) the 50th percentile. b) the 100th percentile. c) no particular percentile corresponds to the mean. d) do not know. VIII. Exploratory Data Analysis A. Introduction I. Purpose: • Examine data patterns when the mean is affected by outliers. • Find gaps in the data. • Find patterns. • Compare data sets. • Identify outliers (values located far away from other values) II. Exploratory Data Analysis is… 1. Five-number summary 2. Box plot B. Five-Number Summary A five-number summary is a list of: • The lowest value of data set (L or minimum) • Q1 (25th percentile) • The median (MD or 50th percentile) • Q3 (75th percentile) • The highest value of data set (H or maximum) A box plot is a graphical representation of a five-number summary on a scaled axes. Be sure the box is above the scaled line and drawn to scale (see example in the text and section x). Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 18 Question 9 A box plot can be drawn from data in a stem and leaf plot by a) counting the values in the stem and leaf plot to determine the five number summary. b) adding the values in the stem and leaf plot to determine the five number summary. c) graphing only the stems and not the leaves from the stem and leaf plot. d) do not know. IX. Outliers • • • • • An outlier is an extremely high or an extremely low data value when compared with the rest of the data values. Can be the result of measurement or observational error. Outliers can also indicate something else in the data. Can have a dramatic affect on the mean Can have a dramatic affect on the standard deviation Can have a dramatic affect on the scale of the histogram so that the shape of the distribution is obscured. A. Outliers for Normally Distributed Data Any data value more than three standard deviations away from the mean is considered an outlier. B. Outliers for Other Distributions 1. 2. 3. 4. Arrange the data in order Find Quartile 1 and Quartile 3 Find the inter-quartile range: IQR = Q3 – Q1 Outliers are: • Any data value larger than Q3 + 1.5 (IQR) • Any data value smaller than Q1 - 1.5 (IQR) Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 19 X. Box Plots (Box and Whisker Plots) Scaled graph of the five number summary Process: 1. Find the 5-number summary (minimum, Q1, Q2, Q3, and maximum) 2. Construct a horizontal scale that includes the minimum and the maximum data. Start the scale at or below the lowest data values and end it slightly above the largest data value. 3. Construct a rectangle “floating” above the line with the left end at Quartile 1 and the right end at Quartile 3. 4. Construct a vertical line segment inside the box at the median. 5. Construct a horizontal line segment from the center of the lower vertical box edge to the lowest data value that is not an outlier. Construct a second horizontal line segment from the cent of the upper vertical box edge to the highest data value that is not an outlier. 6. Graph mild outliers with a solid dot. Graph extreme outliers with an open dot. XI. Summary • Histograms, frequency polygons and ogives are used for quantitative data organized in a grouped frequency distribution. • Pareto charts and bar graphs are frequency graphs for qualitative variables. • Time series graphs are used to show a pattern or trend that occurs over time. • Pie graphs are used to show the relationship between the parts and the whole for qualitative or categorical data. • Data can be organized in meaningful ways using frequency distributions and graphs. • In descriptive statistics, we use all of these numerical and graphical techniques with sampling methods to collect, organize, summarize, and present data. • Data is organized for interpretation and inference Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 20 Answer: Question 1 When a person says that the average age of a group of workers is 35, the average C – could be either the mean or the median of the ages. Answer: Question 2 If we are taking a test and we wish to score in the upper half of the students, then we wish to be higher than the B – the median of the test scores. Answer: Question 3 Which measures of the center are influenced by outliers? E – both A & D. Answer: Question 4 If we tally the votes in an election, then the winner would be the candidate corresponding to C – the mode of the number of votes. Answer: Question 5 An entertainment event advertises that people ages 1 to 100 would enjoy the event. The advertisement specifically describes a set of people with B – a large range of ages. Answer: Question 6 If we know the variance of a set of data, then to calculate the standard deviation of this data C – is a short process because the standard deviation is the square root of the variance. Answer: Question 7 If a botanist measures the length of flower petals and finds that 75% of the lengths are 1.5cm or longer, then 1.5 is C – both A & C. Answer: Question 8 The percentile that corresponds to the mean is C – no particular percentile corresponds to the mean. Answer: Question 9 A box plot can be drawn from data in a stem and leaf plot by A – counting the values in the stem and leaf plot to determine the five number summary. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 21