Descriptive Statistics: An Overview of Statistics and Probability | STAT 211 Statistics Texas A&M University (A&M) 29 pag. Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Statistics 211 ✬ ✩ [Student] 1 An Overview of Probability and Statistics 1.1 What is Statistics? In short: Analysis of data (all stages). Common perceptions of statistics: The above are “descriptions of the world” using numbers. This is a part of Statistics (very visible), but statistics deals with more than “describing phenomena” Examples of Statistics: Polio Vaccine In the 1950’s Polio was a serious disease that affected countless people (mostly children). In 1954 • 401,974 children vaccinated. • 201,229 with a trial vaccine and • 200,745 with a placebo. There where a total of : Polio cases. for placebo versus for vaccine. Question: ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 1 Statistics 211 ✬ ✩ [Student] Unemployment We desire to know the unemployment rate. Problem: How do you find the answer? How accurate are the results? Stress Traffic lights are installed to aid in merging into the Interstate (I-75 in Tampa, FL). • Stress level of drivers is measured before the lights: • After the lights it was . . Question: 1.2 Branches of Statistics Descriptive (deductive) statistics. Statistical methods that summarize and describe the prominent features of data. Inferential (inductive) statistics. Statistical methods that generalize from a sample to a population. Population The entire collection of individuals objects or measurements about which information is desired. Sample A part or subset of the population. ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 2 Statistics 211 ✬ ✩ [Student] Examples: • Polio: • Unemployment: • Stress: A population or sample is not static, but depends upon the definition of the problem. Most of the samples in this course will be —a sample that is randomly chosen from the population. Historically statistics were far more important than statistics. Nowadays the reverse is true. 1.3 Data Definitions Categorical vs. Numerical Categorical Observations that are only classified into groups. Examples: Numerical Observations that have a numerical quality about them. Examples: ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 3 Statistics 211 ✬ Classify the following: State of birth Weight on birth Date of birth Zip code ✩ [Student] Discrete vs. Continuous Discrete A variable is discrete if it can assume only a countable number of possible values. Examples: Continuous A variable is continuous if it can assume an uncountable number of values. Examples: There will usually be practical limitations on the accuracy any continuous variables has. Data Sets Univariate Data set consists of Bivariate . . . variable. variables. Multivariate . . . more than variables. Example of a multivariate data set: ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 4 Statistics 211 ✬ ✩ [Student] 2 Pictorial and Tabular Methods in Descriptive Statistics Consider the Following Data Set: (Chp 1, #10) The concentration of suspended solids in river water is an important environmental characteristic. The paper “Water Quality in Agricultural Watershed: Impact of Riparian Vegetation During Base Flow” (Water Resources Bull., 1981, pp. 233-239) reported on concentrations (in parts per million, or ppm) for several different rivers. Suppose the following 50 observations had been obtained for a particular river. 55.8 45.9 83.2 75.3 60.7 60.9 39.1 40.0 71.4 77.1 37.0 35.5 31.7 65.2 59.1 91.3 56.0 36.7 52.6 49.5 65.8 44.6 62.3 58.2 69.3 42.3 71.7 47.3 48.0 69.8 33.8 61.2 94.6 61.8 64.9 60.6 61.5 56.3 78.8 27.1 76.0 47.2 30.0 39.8 87.1 69.0 74.5 68.2 65.0 66.3 Question: What does this data tell us about the concentration of suspended solids? First few steps in analyzing a data set: 1. Organize and summarize the data. 2. Find the center of the data. 3. Examine the spread of the data. ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 5 Statistics 211 ✬ ✩ [Student] 2.1 Stem and Leaf Display A compact and descriptive method of organizing data without losing any information in the data. • Leading digits are stems • Trailing digits are leaves. • Indicate units somewhere on the display. • Option: Sort the leaves. • Comparative stem & leaf. • Repeat stems if need be. Advantages: • No loss of information. • Easy to do for small data sets. Disadvantages: • Time consuming for large data sets (by hand) • Cannot be used for categorical data. • Very space consuming for large data sets. ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 6 Statistics 211 ✬ ✩ [Student] A Stem-and-leaf display of the solids data set: 55.8 45.9 83.2 75.3 60.7 60.9 39.1 40.0 71.4 77.1 37.0 35.5 31.7 65.2 59.1 91.3 56.0 36.7 52.6 49.5 65.8 44.6 62.3 58.2 69.3 42.3 71.7 47.3 48.0 69.8 33.8 61.2 94.6 61.8 64.9 ✫ Chapter 1: Descriptive Statistics 60.6 61.5 56.3 78.8 27.1 76.0 47.2 30.0 39.8 87.1 69.0 74.5 68.2 65.0 66.3 ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 7 Statistics 211 ✬ Stem-and-leaf display of the solids data set with sorted leaves: 2 3 4 5 6 7 8 9 : : : : : : : : 7 0245779 002567789 366689 111112255566899 01245679 37 15 ✩ [Student] units: ppm Stem-and-leaf display with multiple leaf values on a stem: 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 : : : : : : : : : : : : : : : 7 024 5779 002 567789 3 66689 1111122 55566899 0124 5679 3 7 1 5 units: ppm ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 8 Statistics 211 ✬ ✩ [Student] Comparative Stem-and-leaf display on the solids data set taken two years earlier: Two Years Current ------------------------------------8 : 1 : 9851 : 2 : 7 9887640 : 3 : 0245779 9997765322111 : 4 : 002567789 877554200 : 5 : 366689 9887653221 : 6 : 111112255566899 72210 : 7 : 01245679 95 : 8 : 37 : 9 : 15 units: ppm Sometimes we redefine the leaves for low-numbered or narrow data sets: 58, 58, 57, 54, 54, 54, 57, 57, 56, 56, 57, 51, 58, 54, 52, . . . , 52, 54 60 59 58 57 56 55 54 53 52 51 : : : : : : : : : : 0 00 00000000000 0000000000 0000000000 0000000000000 0000000000000 0000 000 0 ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 9 Statistics 211 ✬ ✩ [Student] 2.2 Frequency Distributions for Quantitative Data A very popular way to summarize data is with a frequency distribution. A frequency distribution is a compact summary of a data set using a table with 3 or 4 columns: Class interval (or category) disjoint intervals of each obs in the data set. Frequency Number of obs in a class interval = f. Relative frequency Proportion of obs in interval = f /n Cumulative frequency Sum of the relative frequencies Pclass i=1 f /n. Number of classes: 5 to 20. Use √ n for a rough idea. ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 10 Statistics 211 ✬ ✩ [Student] A Frequency Distribution for the solids data set: 55.8 45.9 83.2 75.3 60.7 60.9 39.1 40.0 71.4 77.1 37.0 35.5 31.7 65.2 59.1 91.3 56.0 36.7 52.6 49.5 65.8 44.6 62.3 58.2 69.3 42.3 71.7 47.3 48.0 69.8 33.8 61.2 94.6 61.8 64.9 60.6 61.5 56.3 78.8 27.1 76.0 47.2 30.0 39.8 87.1 69.0 74.5 68.2 65.0 66.3 50 observations. Approximate number of classes: Class Interval [Tally] Frequency Relative f ✫ Chapter 1: Descriptive Statistics Cumulative f ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 11 Statistics 211 ✬ ✩ [Student] 2.3 Histogram A histogram is a pictorial representation of a frequency distribution. 1. Draw a x-axis and mark class intervals. 2. Draw a rectangle whose area is proportional to the frequency 0 5 10 15 of that interval. 20 40 60 80 100 solids ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 12 Statistics 211 ✬ ✩ [Student] A true histogram or a density scale will have an area that is equal to 1.0. In that case we make the: Rectangle Height = Relative Frequency Base Length In the case where all the intervals are of equal length all we need to do 0.0 0.005 0.010 0.015 0.020 0.025 0.030 is add the appropriately labeled y-axis. 20 40 60 80 100 solids ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 13 Statistics 211 ✬ Histograms often exhibit particular shapes: ✩ [Student] • unimodal • bimodal • multimodal • symmetric • positively skewed • negatively skewed ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 14 Statistics 211 ✬ 2.4 The M&M Data Set ✩ [Student] Some important questions: How many M&M’s are their in a regular plain size M&M bag? More importantly, how many red M&M’s are there? HW assignment: Buy a M&M bag (small, plain) and count the number of M&M’s and the number of red M&M’s. Email them to me at “henrik@stat.tamu.edu”. Part of homework #1 assignment (red M&M’s for 1-3): 1. Create a Stem-and-Leaf plot. 2. Create a frequency distribution. 3. Plot a histogram (density scale). 4. Create a Comparative Stem-and-Leaf plot of the total number of M&M’s (I will post the data on thew web) to that of Spring 1998 (next slide). Do you think the total number of M&M’s per bag has changed? ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 15 Statistics 211 ✬ ✩ [Student] 2.4.1 The M&M Data set for Spring 1998 The following data was collected by the Spring 1998 Stat 211 class: Total M&M’s (n 58 57 58 56 58 60 55 Red M&M’s (n 18 14 18 12 15 11 13 = 68): 58 51 55 55 55 55 56 57 58 56 55 56 58 58 54 54 52 57 55 54 59 54 52 57 58 55 55 54 54 54 55 54 57 54 53 57 55 57 53 56 56 52 57 54 58 53 54 57 54 56 56 59 55 54 53 56 56 58 57 58 55 6 9 4 14 10 10 5 6 5 15 17 10 7 3 7 8 9 7 14 12 9 15 10 14 6 12 13 14 8 16 6 9 5 12 11 20 9 8 11 12 13 15 13 19 11 = 66): 16 2 11 10 10 15 19 13 15 12 11 11 7 8 ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 16 Statistics 211 ✬ ✩ [Student] 3 Measures of Location Another step in gaining understanding of our data is to find the “center” of our data. What is the center? 3.1 Mean / Average Average: If we consider each number to have a “weight” equal to its value, then the average is the value which equally divides the data by weight. Think of a seesaw: We calculate the average as follows: Sample Average x̄ = 1 n Pn i=1 xi Population Average µ= 1 N PN xi : The i’th observation in the sample. yj : The j ’th value in the population. n: Sample size. N: Note: ✫ Chapter 1: Descriptive Statistics j=1 yj Population sample size. x1 , . . . , xn is a sample from population y1 , . . . , yN . c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) ✪ Slide 17 Statistics 211 ✬ ✩ [Student] Example: Calculate the average number of red M&M’s for the Spring 1998 M&M data set. Red M&M’s (n 18 14 18 12 15 11 13 = 66): 16 2 11 10 10 15 19 13 15 12 11 11 7 8 6 9 4 14 10 10 5 x x x x x x x x x x x x x x x x x x x x x x x 6 5 15 17 10 7 3 x x x x x x x x x x x x x 7 8 9 7 14 12 9 x x x x x x x x x x x 15 10 14 6 12 13 x x x x x 14 8 16 6 9 5 x x x x x x x x 12 11 20 9 8 11 x x x 12 13 15 13 19 11 x x x 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 18 Statistics 211 ✬ ✩ [Student] 3.2 Median Median: The middle observation of the sorted data set. Sample Median = x̃ Population Median = µ̃ We calculate the median: n odd: n even: x̃ = x[(n+1)/2] x̃ = (xn/2 + x[(n+2)/2] )/2 Example: Calculate the median number of red M&M’s for the Spring 1998 M&M data set. Red M&M’s (n 2 7 9 11 12 14 17 = 66): 3 7 9 11 12 14 18 4 7 9 11 13 15 18 5 7 10 11 13 15 19 5 8 10 11 13 15 19 5 8 10 11 13 15 20 6 8 10 12 13 15 6 8 10 12 14 15 6 9 10 12 14 16 6 9 11 12 14 16 Discussion: • Mean and for different types of “smoothed histograms” (distributions) [slide 14]. • How do outliers affect the mean and median? ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 19 Statistics 211 ✬ ✩ [Student] 3.3 Other Measures of Location 3.3.1 Trimmed Mean A trimmed mean is a compromise between x̄ and x̃ in that outliers will have some effect on the trimmed mean but not as much as they have on the mean. It is calculated by eliminating a certain percentage of the observation from both ends and calculating the average of the remaining data. For example a 10% trimmed mean would eliminate 10% of the observation from each end of the data (20% total) and average the remaining 80% of the observations. For example: If we have a sample of 100 observation and we want to find x̄12% (12% trimmed mean), how many observations must we eliminate from each end? Solution: We have n = 100 observations. 12% of this is 100 × .12 = 12. Therefore we eliminate 12 observation from each end for a total of 24 observations. There are a variety of ways to handle the case where we need to chop of a fractional number of data points. In this case we avoid the issue and simply round the number of observations that are removed. ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 20 Statistics 211 ✬ ✩ [Student] Example: Calculate the 10% trimmed mean of red M&M’s for the Spring 1998 M&M data set. Red M&M’s (n 2 7 9 11 12 14 17 = 66): 3 7 9 11 12 14 18 4 7 9 11 13 15 18 5 7 10 11 13 15 19 5 8 10 11 13 15 19 5 8 10 11 13 15 20 6 8 10 12 13 15 6 8 10 12 14 15 ✫ Chapter 1: Descriptive Statistics 6 9 10 12 14 16 6 9 11 12 14 16 ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 21 Statistics 211 ✬ ✩ [Student] 3.3.2 Percentiles and Quartiles The p’th percentile is the observation in our data set where p% are equal to or less than this observation. The median is the 50’th percentile. To calculate the p’th percentile x[p] : 1. Let x(i) refer to our data set in ascending order. 2. Let ip = np/100. 3. Find the first index i such that i > ip . 4. The p’th percentile is then: x[p] = x(i−1) +x(i) 2 x(i) if i − 1 = ip otherwise Q1 : First Quartile = 25’th percentile = lower fourth. Q2 : Second Quartile = 50’th percentile = median. Q3 : Third Quartile = 75’th percentile = upper fourth IQR = fs = Q3 − Q1 = “Interquartile Range” or “Fourth Spread.” ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 22 Statistics 211 ✬ Example: Calculate Q1 and Q3 for our Spring 1998 M&M data set. Red M&M’s (n 2 7 9 11 12 14 17 = 66): 3 7 9 11 12 14 18 4 7 9 11 13 15 18 5 7 10 11 13 15 19 5 8 10 11 13 15 19 5 8 10 11 13 15 20 6 8 10 12 13 15 6 8 10 12 14 15 ✫ Chapter 1: Descriptive Statistics ✩ [Student] 6 9 10 12 14 16 6 9 11 12 14 16 ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 23 Statistics 211 ✬ ✩ [Student] 3.3.3 Boxplots Box plots are useful in summarizing various aspects of the data. Side-by-side box plots provide useful comparisons of two or more sets of data. 1. Form an axis that includes all possible values of the data. 2. Draw a box extending from Q1 to Q3 . 3. Draw a vertical bar at the median. 4. Draw whiskers (horizontal lines) out 1.5 IQR from each end of the box. 5. Indicate mild outliers with a “◦” (1.5 − 3.0 IQR from each end of the box). 6. Indicate extreme outliers with a “∗” (more than 3.0 IQ from each end of the box). ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 24 Statistics 211 ✬ Example: Calculate the summary statistics x̄, x̃, Q1, Q3 for the water ✩ [Student] quality data set. Then construct a box plot. x̄ = Min = Q1 = X̃ = Q3 = Max = 60 40 solids(ppm) 80 Particulate Matter ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 25 ✬ Statistics 211 ✩ [Student] 3.3.4 Categorical Data and Sample Proportions We cannot calculate mean and median for categorical data. However we can calculate a sample proportion. We calculate the sample proportion: Proportion = p = Count Sample Size For example: What proportion—on the average—of M&M’s are red in for the Spring 1998 M&M data set? x̄red = 11.06 x̄total = 55.61 ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 26 Statistics 211 ✬ ✩ [Student] 4 Measures of Variability The mean, median, etc. do not give us a complete overview (summary) of our data. For Example: Consider the following three data sets: Data 1: 20 30 40 50 60 70 50 350 18.71 2: 20 43 44 46 47 70 50 252 15.87 3: 40 43 44 46 47 50 10 12 3.46 – The mean and median is 45 for all three data sets. – These data sets have very different spreads. Ways to measure spread: Range: range = maximum observation – minimum observation Average the Deviations from the Mean: We define the i′ th deviation to be: xi − x̄. Intuitive: We average the deviations: 1X (xi − x̄) n Problem: this does not give us anything useful! ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 27 Statistics 211 ✬ ✩ [Student] Variance: When we average the squared deviations from the mean and divide by n − 1 instead of n we get a measure of spread we call the variance: 1 X s = (xi − x̄)2 n−1 2 Calculation formula: s2 = 1 n−1 ³X x2i − ¡P xi n – The population variance is represented as: ¢2 ´ σ2 . – We will learn later why we divide by n–1 instead of n. Standard Deviation: The units of the variance are units of the data squared. To make the units the same as that of the data set we take the square root of the variance. This is called the standard deviation: s= √ s2 s is translation invariant: s(x1 , ..., xn ) = s(x1 + a, ..., xn + a) ∀ a. s is scale equivariant: s(ax1 , ..., axn ) = |a|s(x1 , ..., xn ) ∀ a. The population standard deviation is: σ. ✫ Chapter 1: Descriptive Statistics ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 28 Statistics 211 ✬ ✩ [Student] Example: Calculate the range, variance and standard deviation of red M&M’s for the Spring 1998 M&M data set. Remember: x̄ = 11.06 Red M&M’s (n 18 14 18 12 15 11 13 = 66): 16 2 11 10 10 15 19 13 15 12 11 11 7 8 6 9 4 14 10 10 5 6 5 15 17 10 7 3 7 8 9 7 14 12 9 15 10 14 6 12 13 14 8 16 6 9 5 ✫ Chapter 1: Descriptive Statistics 12 11 20 9 8 11 12 13 15 13 19 11 ✪ c 1998-2004 by Henrik Schmiediche Copyright ° Document shared on www.docsity.com Downloaded by: ajit-talluri (ajit.talluri@gmail.com) Slide 29