STAT 211 Handout 1 (Chapter 1) Overview and Descriptive Statistics Statistics: The branch of scientific inquiry that provides methods for organizing and summarizing data and for using information in the data to draw various conclusions. Population: All individuals or objects of a particular type. population size by N. We will define the Sample: A portion or subset of the population. We will define the sample size by n. Variable: Any characteristics whose value may change from one object to another in the population. Example 1: Population: engineering students in Texas A&M University Sample: graduating engineering students in Texas A&M University Variable: gender of graduating engineering students in Texas A&M University or GPA of graduating engineering students in Texas A&M University Question: Is the sample or the variable unique for the same population? Question: Population: all daily newspapers published in United States Sample: Variable: Question: Published papers propose that consumption of vitamin A prevents cancer. How would you prove their proposal is supported by the data or not? Question: Studies show that smoking cigarette causes cancer and yellow fingers? How would you prove if the data support this? Descriptive statistics: Organizing and summarizing the data. Inferential statistics: Drawing conclusions about the population based on sample information. DATA Univariate Qualitative: Categorical Quantitative: Numerical Discrete Continuous Example 2: Identify the following as categorical or numeric (if numeric, discrete or continuous). Color of eyes, number of students play baseball in different schools, price of your textbook, type of car each student drives, your height (in inches) or weight (in pounds), zip code, actual weight of tea-leaves in a 1-lb package, number of customers waiting in different banks. Bivariate Two groups Multivariate more than two groups Tabular and Pictorial Methods for Describing Data Given the data set of n observations on some variable X, the individual observations are x1 , x2 ,......., xn . The ordered observations (if numeric from smallest to largest) will be shown by x (i ) , i=1,2,....,n where x (i ) is the ith ordered value. n is the sample size and N is the population size. Stem–and–Leaf Display Stem and leaf plots are very easy to create and look at the numeric data. An advantage to this type of plot is that you can actually still see your data. How to make a stem-and-leaf: 1. Look at the range of your data. 2. Choose your stem – this is the leading digit(s). This is usually the 1’s, 10’s,100’s, etc. place 3. Add your leaf – this is the trailing digit(s). Some just plot the next digit while others may plot the next few digits. Example 3 (Exercise 1.14): Data set consist of observations on shower-flow rate, X (L/min.) for 129 houses in Perth, Australia. Unordered data (x1=4.6. x2=12.3, x3=7.1, …..,x127=6.3, x128=3.8, x129=6.0) are listed in the textbook. Data range 2.2 to 18.9. Thus I will use the first digit (my 10’s place) as my stem and I will attach the leaf, which is the next digit. The following Minitab output summarizes the final result for the complete data: Stem-and-leaf of Rate Leaf Unit = 0.10 2 12 20 37 62 (17) 50 42 27 17 10 8 7 5 1 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 n = 129 23 2344567789 01356889 00001114455666789 0000122223344456667789999 00012233455555668 02233448 012233335666788 2344455688 2335999 37 8 36 0035 9 The way the stem-and-leaf display is tabulated for the 6 data points (4.6, 12.3, 7.1, 6.3, 3.8, 6.0) selected from 129 data are as follows: . 3|8 4|6 5| 6 | 03 7|1 8| 9| 10 | 11 | 12 | 3 Example 4 (Exercise 9.23): Fusible interlinings are being used with increasing frequency to support outer fabrics and improve the shape and drape of various pieces of clothing. The data on extensibility (%) at 100 gm/cm for both high-quality fabric (H) and poorquality fabric (P) specimens are as follows. H 1.2 0.9 0.7 1.0 1.7 1.7 1.1 0.9 1.7 1.9 1.3 2.1 1.6 1.8 1.4 1.3 1.9 1.6 0.8 2.0 1.7 1.6 2.3 2.0 P 1.6 1.5 1.1 2.1 1.5 1.3 1.0 2.6 The following Minitab output summarizes the final result for the complete data using different appearances of stem-and-leaf displays. Stem-and-leaf of H: Leaf Unit = 0.10 1 4 6 9 10 (7) 7 4 1 0 0 1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 2 = 24 7 899 01 233 4 6667777 899 001 3 Stem-and-leaf of P: Leaf Unit = 0.10 2 3 (2) 3 2 2 1 1 1 n 01 3 55 6 Stem-and-leaf of H: Leaf Unit = 0.10 4 10 (10) 4 n = 8 0 1 1 2 1 1 2 2 = 24 n = 7899 012334 6667777899 0013 Stem-and-leaf of P: Leaf Unit = 0.10 3 (3) 2 1 n 8 013 556 1 6 1 6 Frequency Distributions For the numeric continuous (discrete) data, creates class intervals (lists the data points) and counts the number of data falls into it. This count is called frequency. Relative frequencies are obtained by dividing frequency by the total number of data. It is the fraction or the proportion of time the interval is observed (the value occurs). For categorical data, frequency is the number of data falls into each category. Example 5 (Exercise 1.21): The number of intersections, Z is listed as one of the characteristics of subdivisions. z 0 1 2 3 4 5 6 7 8 Relative Frequency 13/47=0.2766 11/47=0.2340 3/47=0.0638 7/47=0.1489 5/47=0.1064 3/47=0.0638 3/47=0.0638 0/47=0 2/47=0.0425 Count 13 11 3 7 5 3 3 0 2 Cumulative relative frequency 13/47=0.2766 24/47=0.5106 27/47=0.5745 34/47=0.7234 39/47=0.8298 42/47=0.8936 45/47=0.9575 0.9575 47/47=1 n=47 What percentage of these subdivisions had at most 3 intersections? What percentage of these subdivisions had less than 3 intersections? What percentage of these subdivisions had between 2 and 5 (inclusive) intersections? What percentage of these subdivisions had less than 2 and more than 5 intersections? Histogram A pictorial representation of a frequency distribution can be obtained by constructing a histogram. The histogram is a much better way of visualizing a data set than the stemand-leaf. The following is the histogram for example 5. Frequency 10 5 0 0 1 2 3 4 z: 5 6 7 8 How to construct a histogram for continuous data: a) Divide range of observations into intervals ( Plot on x axis) b) Count the # of observations that fall in each interval --- frequency. # in interval c) Compute the relative frequency = (percentage falls into the interval) size of data n d) Plot rectangle above each interval whose height is proportional to the relative frequency or frequency. If all the intervals for the continuous data do not have the same width, density (relative frequency/interval width) is a better measure to use for histogram. The following is the histogram and the frequency distribution for example 3 (flow rate). 40 Frequency 30 20 10 0 0 2 4 6 8 10 12 14 16 18 20 Rate rate Count [1,3) 2 [3,5) 18 [5,7) 42 [7,9) 25 [9,11) 25 [11,13) 9 [13,15) 3 [15,17) 4 [17,19) 1 Relative Cumulative Frequency relative frequency 2/129 2/129 0.0155 18/129 20/129 0.1550 42/129 62/129 0.4806 25/129 87/129 0.6744 25/129 112/129 0.8682 9/129 121/129 0.9380 3/129 124/129 0.9612 4/129 128/129 0.9923 1/129 129/129 1 Rule of thumb: number of classes Density 1/129 9/129 21/129 12.5/129 12.5/129 4.5/129 1.5/129 2/129 0.5/129 numberofobservations What to Look For In Your Graph: (Use with stem-and-leaf & histogram) 1. The center of the distribution 2. The overall Shape of the distribution. Unimodal Symmetric – portions on each side of the center value are mirror images of each other Skewed left (negatively skewed) – the left tail (lower values) is stretched out longer than the right tail (higher values) Skewed right (positively skewed) – the right tail (higher values) is stretched out longer than the left tail (lower values) Thus, whichever direction the curve is pulled – that is the direction in which it is skewed. Bimodal Multimodal 3. Marked deviations from the overall shape of the distribution. Outliers – individual observations that fall well outside the overall pattern of the graph Gaps in the distribution For the intersections data, we see that the center in our distribution of intersections is in 2’s. The graph is skewed to the right with one major distinct peak (unimodal). 8 intersections may be outliers. Note one major gap. For the flow rate data, we see that the center in our distribution of flow rates is in 7’s. The graph is skewed to the right with one major distinct peak (unimodal). No major gaps or outliers. For the categorical data (the cars students drive), we can count the number of students for each category of car defining the number of categories. We can use these counts on the histogram vertical axis, categories horizontal axis. Placing a bar as high as the frequency on the top of each category, histogram can be created. Measures of location: n _ The sample mean, x x i 1 i is the arithmetic average. n N The population mean, x i 1 i . N There is only one mean for a quantitative data set. Its value is influenced by extreme measurements. Note that the sample mean is the statistics where the population mean is the parameter. ~ The sample median, x is the middle value when the measurements are arranged from lowest to highest. If n is odd, the median is the observation which have exactly (n-1)/2 values are greater than and (n-1)/2 values are less than the median. If n is even, the median is the average of the two middle values and n/2 values are greater than and n/2 values are less than the median. There is only one median for the quantitative data and its value is not likely influenced by few extreme measurements. The mode is the most frequently occurring value. This measure may not be unique in that two (or more) values may occur with the same greatest frequency. There can be more than one mode for a data set. It is applicable for both quantitative and qualitative data. Its value is not likely influenced by few extreme measurements. Note that there is negatively skewed distribution if mean < median, positively skewed distribution if median<mean and symmetric distribution if mean = median. Quartiles divide the data set into four equal parts. Lower Quartile(Q1 ): The smallest 25% of the data. It can also be computed finding the median of the smallest n/2 observations if n is even and median of the smallest (n+1)/2 observations if n is odd. Your textbook calls this as lower fourth. Upper Quartile(Q3 ): The smallest 75% of the data. It can also be computed finding the median of the largest n/2 observations if n is even and median of the largest (n+1)/2 observations if n is odd. Your textbook calls this as upper fourth. Percentiles divide the data set into 100 equal parts. The pth percentile is the observation in the data set where p% are equal to or less than this observation. To calculate the pth percentile, x[p] - order the data from smallest to largest - let ip=np/100 - find the ith index such that i > ip x(i 1) x(i ) , if i 1 i p th - the p percentile is x[ p ] 2 x(i ) , otherwise Trimmed mean is a compromise between the mean and the median. A 5% trimmed mean would be computed by eliminating the smallest 5% and the largest 5% of the sample and averaging what is left over. Sample proportion is the number of successes divided by the total number of observations. Measures of variability: The sample range measures the distance between the largest and smallest observations. R = x ( n ) x (1) . It is sensitive to outliers and provide no information on patterns of variability. Interquartile range ( IQR=Upper Quartile - Lower Quartile= Q3 - Q1 ) : It is the range of middle half of the distribution. Your text book calls it as fourth spread. It is not sensitive to outliers. n _ _ Deviations from the mean: xi x where xi x = 0. i 1 n xi n n _ 2 2 ( xi x ) xi i 1 n The sample variance and standard deviation, are s 2 i 1 i 1 n 1 n 1 and 2 s s 2 , respectively with given sample size n where the population variance and N standard deviation are 2 (x i 1 i )2 N and = 2 , respectively with given population size N. It is the most commonly used measure of variability and sensitive to outliers. Not that the sample variance or standard deviation are statistics where the population variance or standard deviation are parameters. Coefficient of variation (CV): Unit free variation (amount of variability relative to the value of the mean) where variance and standard deviation measures the variability _ dependent on units of measurements. CV=100( s / x ). Example 6: When the heights of students (in inches) and their weights (in pounds) are recorded, the data set with more variation is measured by CV. The following Minitab output summarizes some of the measures of location and variability for flow rate data. Variable Rate n 129 Mean 7.708 Median 7.000 TrMean 7.540 Variable Rate Minimum 2.200 Maximum 18.900 Q1 5.600 Q3 9.600 StDev 3.077 SE Mean 0.271 The following Minitab output summarizes some of the measures of location and variability for the extensibility of high-quality versus low-quality fabric. Variable H: P: Variable H: P: n 24 8 Mean 1.5083 1.588 Minimum 0.7000 1.000 Median 1.6000 1.500 Maximum 2.3000 2.600 TrMean 1.5091 1.588 Q1 1.1250 1.150 StDev 0.4442 0.530 Q3 1.8750 1.975 SE Mean 0.0907 0.188 The following Minitab output summarizes some of the measures of location and variability for intersections data. Variable z: n 47 Mean 2.277 Median 1.000 TrMean 2.116 Variable z: Minimum 0.000 Maximum 8.000 Q1 0.000 Q3 4.000 StDev 2.253 SE Mean 0.329 Example 7: Suppose X is a random variable with the values –100, -50, 0, 50, 100. Define some of the measures of location and variability. Suppose X is a random variable with the values –200, -100, -50, 0, 50, 100. Define some of the measures of location and variability. If the sample mean is 50 for 10 observations and 11th observation is 50, what would be the sample mean of 11 observations? If the sample mean and variance are 50 and 3.25 for 10 observations and 11 th observation is 50, what would be the sample variance of 11 observations? If the deviations from the mean for 5 observations are –0.3, 0.1, 2, 1.4, -1.7, what would be the sum of the remaining 5 deviations from the mean where the data set have 10 observations. Question: Let c be a constant, X &Y be random variables. How would the mean and variance change if you add the same constant to the each observation (yi=xi+c, i=1,2,….,n) multiply each observation with the same constant (yi=cxi, i=1,2,….,n) Boxplots Boxplots are formed using what is called the five number summary: 1. minimum 2. first (lower) quartile, 25th percentile, Q1. 3. median, 50th percentile, Q2. 4. third (upper) quartile, 75th percentile, Q3. 5. maximum Ideal for comparing two populations (samples) when measuring a continuous random variable. 1. The ends of the box are at the quartiles. The length of the box is Q3-Q1. This box will contain 50% of the data values 2. The median is marked by a line within the box 3. The two vertical lines (called whiskers) outside the box extend to the smallest and largest observations within 1.5 x IQR of the edges of the box. 4. Observations outside of these whiskers (that is, farther away than 1.5 X IQR beyond edge of box) are called outliers. In general, outlier is the observation which is much larger or smaller than the rest of the data. If the data falls between 1.5IQR and 3IQR from the edge to which it is closest, they are called mild outliers. If the data fall more than 3IQR from the edge to which it is closest, they are called extreme outliers. Comparative boxplot for the extensibility of high-quality versus low-quality fabric: 2.5 Extensibility 1.5 0.5 H: P: Boxplot for flow rate: Rate 20 10 0 Boxplot for intersections: 10 8 6 4 2 0 -2 z: For the boxplots on the previous page, 1. H 1.5(1.875-1.125)=1.125 P 1.5(1.975-1.15)=1.2375 Rate 1.5(9.6-5.6)=6 Z 1.5(4-0)=6 Q1-1.5∙IQR 0 -0.0875 -0.4 -6 Q3+1.5∙IQR 3 3.2125 15.6 10 INTERPRETING BOXPLOTS Note the position of the median. Medians not in the middle of the box can indicate skewness in the middle 50% of the data as well as in the whole data set. Recall that the mean will get drawn in the direction of the “skewness”. Thus the box will be a lot longer in the direction of the skewness. Note the length of the whiskers and the outliers. If the data is symmetric, the whiskers will be of equal length. Coefficient of Skewness (SK): The direction of and degree to which a frequency distribution is skewed. (SK<0 negatively skewed, SK=0 symmetric, SK>0 positively skewed). Example 8: A computer scientist is investigating the usefulness of two different design languages in improving programming tasks. Twelve expert programmers, familiar with both languages, are asked to code a standard function in both languages, and the time in minutes is recorded. Programmer 1 2 3 4 5 6 7 8 9 10 11 12 Design language 1 17 16 21 14 18 24 16 14 21 23 13 18 Design language 2 18 14 19 11 23 21 10 13 19 24 15 20 The following are the descriptive statistics and the comparative boxplots obtained by MINITAB. Variable n Mean Median TrMean StDev SE Mean Design1 12 17.92 17.50 17.80 3.63 1.05 Design2 12 17.25 18.50 17.30 4.59 1.33 Variable Design1 Design2 Minimum 13.00 10.00 Maximum 24.00 24.00 Q1 14.50 13.25 26 24 22 20 18 16 14 12 10 8 DESIGN1 DESIGN2 Q3 21.00 20.75 Example 9 (Exercise 1.60): Observations on burst strength (lb/in2) were obtained both for test nozzle closure welds and for production canister nozzle welds. The following are the descriptive statistics and the comparative boxplots obtained by MINITAB. Variable Test Cannister n 11 12 n* 1 0 Mean 7355 5887.5 Median 7300 5887.5 TrMean 7389 5880.0 Variable Test Cannister SE Mean 185 91.8 Minimum 6100 5250.0 Maximum 8300 6600.0 Q1 7200 5725.0 Q3 8000 6037.5 Strength 8000 7000 6000 5000 Test Cannister StDev 614 317.9