Chapter 1: Looking at Data - Distributions I. Introduction (IPS page 4-5) A. Individuals – Individuals are the objects described by a set of data. Individuals may be people, but they may also be animals or things. B. Variable – A variable is any characteristic of an individual. A variable can take different values for different individuals. C. Categorical Variable – A categorical variable places an individual into one of several groups or categories. D. Quantitative Variable – A quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense. E. Distribution – The distribution of a variable tells us what values it takes and how often it takes these variables. II. Displaying Distributions with Graphs (IPS Section 1.1 page 5-38) A. Exploratory Data Analysis – Exploratory data analysis uses graphs and numerical summaries to describe the variables in data set and the relations among them. B. A large number of observations on a single variable can be summarized in a table of frequencies (counts) or relative frequencies (percents or fractions). C. Bar graphs and pie charts display the distributions of categorical variables. These graphs use the frequencies or relative frequencies of the categories. D. Stemplots and histograms display the distributions of quantitative variables. Stemplots separate each observation into a stem and a onedigit leaf. Histograms plot the frequencies or relative frequencies of classes of variables. E. To make a stemplot: 1. Separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf, the final digit. Stems may have as many digits as needed, but each leaf contains only a single digit. 2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column. 3. Write each leaf in the row to the right of its stem, in increasing order out from the stem. F. Examining a Distribution – In any graph of data, look for overall pattern and for striking deviations from that pattern. You can describe the overall pattern of a distribution by its shape, center and spread. An important kind of deviation is an outlier, an individual value that falls outside the overall pattern. Always look for outliers and try to explain them. Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman and Company, New York. 2 -101. G. Some distributions have simple shapes, such as symmetric and skewed. A distribution is symmetric if the values smaller and larger than its midpoint are mirror images of each other. It is skewed to the right if the right tail (larger values) is much longer than the left tail (smaller values). The number of modes (major peaks) is another aspect of overall shape. A distribution with one major peak is called unimodal. Not all distributions have a simple overall shape, especially when there are few observations. H. Time Plot – A time plot of a variable plots each observation against the time at which it was measured. Always put time on the horizontal scale of your plot and the variable you are measuring on the vertical scale. Connecting the data points by lines helps emphasize any change over time. I. Many interesting data sets are time series, measurements of a variable taken at regular intervals over time. J. Seasonal Variation and Trend – A pattern in a time series that repeats itself at known regular intervals of time is called seasonal variation. A trend in a time series is a persistent, long-term rise or fall. III. Describing Distributions with Numbers (IPS section 1.2 pages 38-63) A. Mean – To find the mean x of a set of observations, add their values and divide by the number of observations. If the n observations are x1 , x2 ,..., xn , their mean is x 1 x1 x2 ... xn x or, in more compact notation, n n x i B. Median – The median M or x is the midpoint of a distribution, the number such that half the observation are smaller and the other half are larger. To find the median of a distribution: 1. Arrange all observations in order of size, from smallest to largest. 2. If the number of observations n is odd, the median is the center observation in the ordered list. Find the location of the median by counting (n+1)/2 observations up from the bottom of the list. 3. If the number of observations n is even, the median is the mean of the two center observations in the ordered list. The location of the median is again (n+1)/2 from the bottom of the list. C. Mean versus Median– The median and mean are the most common measures of the center of a distribution. The mean and median of a symmetric distribution are close together. If the distribution is exactly symmetric, the mean and the median are exactly the same. In a skewed distribution, the mean is farther out in the long tail than is the median. Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman and Company, New York. 2 -101. D. The simplest useful numerical description of a distribution consists of both measure of center and a measure of spread. E. We can describe the spread or variability of a distribution by giving several percentiles. The pth percentile of a distribution is the value such that a p percent of the observations fall at or below it. The most commonly used percentiles other than the median are the quartiles. The first quartile is the 25th percentile, and the third quartile is the 75th percentile. The second quartile is the median. F. To Calculate the Quartiles: 1. Arrange the observations in increasing order and locate the median in the ordered list of observations. 2. The first quartile Q1 is the median of the observations whose position in the ordered list is to the left of the location of the overall median. 3. The third quartile Q3 is the median of the observations whose position in the ordered list is to the right of the location of the overall median. G. The Five-Number Summary – The five-number summary of a set of observations consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest. In symbols, the five-number summary is Minimum Q1 M (or x ) Q3 Maximum H. Boxplot – A boxplot is a graph of the five-number summary. A central box spans the quartiles Q1 and Q3. A line in the box marks the median. Lines extend from the box out to the smallest and largest observations. I. The Interquartile Range IQR – The interquartile range IQR is the distance between the first and third quartiles IQR Q3 Q1 I. The 1.5 X IQR Criterion for Outliers – Call an observation a suspected outlier if it falls more than 1.5 X IQR above the third quartile or below the first quartile. J. Modified Boxplot – In a modified boxplot, the lines extend out from the central box only to the smallest and largest observations that are not suspected outliers. Observations more than 1.5 X IQR outside the box are plotted as individual points. K. The Variance s2 and the Standard Deviation s – The variance s2 of a set of observations is the average of the squares of the deviations of the observations from their mean. In symbols, the variance of n observations x1 , x2 ,..., xn is ( x1 x ) 2 ( x2 x ) 2 ... ( xn x ) 2 s n 1 2 or, in more compact notation, s2 1 ( xi x ) 2 n 1 Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman and Company, New York. 2 -101. The standard deviation s is the square root of the variance s2: s 1 ( xi x )2 n 1 L. Properties of the Standard Deviation – 1. s measures spread about the mean and should be used only when the mean is chosen as the measure of center. 2. s=0 only when there is no spread. This happens only when all observations have the same value. Otherwise, s > 0. As the observations become more spread out about their mean, s gets larger. 3. s, like the mean, is not resistant. A few outliers can make s very large. M. Resistant Measures – a resistant measure of any aspect of a distribution is relatively unaffected by changes in the numerical value of a small proportion of the total number of observations, no matter how large there changes are. The median and quartiles are resistant, but the mean and the standard deviation are not. The mean and standard deviation are good descriptions for symmetric distributions without outliers. They are most useful for the normal distributions introduced in the following section. The five-number summary is a better exploratory summary for skewed distributions. N. Linear Transformations – A linear transformation changes the original variable x into the new variable xnew given by an equation of the form xnew a bx Adding the constant a shifts all values of x upward or downward by the same amount. In particular, such a shift changes the origin (zero point) of the variable. Multiplying by the positive constant b changes the size of the unit of measurement. Linear transformations do not change the shape of a distribution. Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman and Company, New York. 2 -101. IV. The Normal Distribution (IPS section 1.3 pages 63-101) A. Mathematical Model – A mathematical model is an idealized description. It gives a compact picture of the overall pattern of the data but ignores minor irregularities as well as any outliers. B. Density Curve – A density curve is a curve that is always on or above the horizontal axis and has area exactly 1 underneath it. A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values is the relative frequency of all observations that fall in that range. Examples of density curves include the following: Figure 1.22(a) A symmetric density curve with its mean and median marked. Figure 1.22(b) A right-skewed density curve with its mean and median marked. C. Mode – A mode of a distribution described by a density curve is a peak point of the curve, the location where the curve is highest. D. Median – Because areas under the curve represent proportions of the observations, the median is the point with half the total area on each side. E. Quartiles – You can roughly locate the quartiles by dividing the area under the curve into quarter as accurately as possible by eye. F. IQR – The IQR is the distance between the first and third quartiles. Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman and Company, New York. 2 -101. G. Median and Mean of a Density Curve – The median of a density curve is the equal-areas point, the point that divides the area under the curve in half. The mean of a density curve is the balancing point, at which the curve would balance if made of solid material. The median and mean are the same for a symmetric density curve. They both lie at the center of the curve. The mean of a skewed curve is pulled away from the median in the direction of the long tail. Figure 1.23 The mean of a density curve is the point at which it would balance. H. Normal Distributions – the normal distributions are described by bellshaped, symmetric, unimodal density curves. The mean μ and standard deviation σ completely specify the normal distribution N(μ, σ 2). The mean is the center of symmetry, and σ is the distance from μ to the change-of-curvature points on either side. I. The 68-95-99.7 Rule – In the normal distribution with mean μ and standard deviation σ: Approximately 68% of the observations fall within σ of the mean μ. Approximately 95% of the observations fall within σ of the mean μ. Approximately 99.7% of the observations fall within σ of the mean μ. Figure 1.25 The 68-95-99.7 rule for normal distributions. Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman and Company, New York. 2 -101. J. Standardizing and z-Scores – If x is an observation from a distribution that has mean μ and standard deviation σ, the standard value of x is z x A standardized value is often called a z-score. K. The Standard Normal Distribution –The standard normal distribution is the normal distribution N(0,1) with mean 0 and standard deviation 1. If a variable X has any normal distribution N(μ, σ2) with mean μ and standard deviation σ, then the standardized variable X Z has the standard normal distribution. L. The Standard Normal Table – Table A in the back of your book is a table of areas under the standard normal curve. The table entry for each value z is the area under the curve to the left of z. Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman and Company, New York. 2 -101. The following further illustrate of this concept: Figure 1.27(a) The area under a standard normal curve to the left of the point z = 1.40 is 0.9192. Table A gives areas under the standard normal curve. Figure 1.27(b) Areas under the standard normal curve to the right and left of z = -2.15. Table A gives areas to the left. M. Normal Quantile Plot – The adequacy of a normal model for describing a distribution of data is best assessed by a normal quantile plot, which is available in most statistical packages. N. Use of Normal Quantile Plots – If the points on a normal quantile plot lie close to a straight line, the plot indicates that the data are normal. Systematic deviations from a straight line indicate a nonnormal distribution. Outliers appear as points that are far away from the overall pattern of the plot. Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman and Company, New York. 2 -101.