S05 The Shape characteristic of the Distribution of populations arising from mathematical theory, and random samples from any population. (See textbook Chapter 3) Your textbook, in Sections 3.1 thru 3.3, deals with “sets of numerical data” without worrying about the possible source(s). Let us use a more limited view and say the dataset in question is either a population or a random sample from a population. Chapter 3 gives methods for summarizing data to describe important characteristics of the data's distribution. What do we mean by “distribution of data”? It is a very general term meaning the pattern of the numbers. (Where are they centered, how do they spread out around the center, what are the quantiles, deciles, ... etc?) One important characteristic of the distribution of a dataset is its shape. By shape we mean the way in which proportions of the numbers in the dataset change as we move across subintervals of that interval of the real line which contains all numbers in the dataset. Example: A tiny population. Consider a dataset which is (say) the population consisting of elements {1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5}. The proportions of unique values are A tiny population- Distribution Value 1 2 3 4 5 Proportion 3/20 5/20 6/20 4/20 2/20 A tiny population Population Rows elements 1 1 2 1 3 1 4 2 5 2 6 2 7 2 8 2 9 3 10 3 11 3 12 3 13 3 14 3 15 4 16 4 17 4 18 4 19 5 20 5 Distributions Population elements 6 5 4 3 2 1 0 Moments Mean Std Dev Std Err Mean Upper 95% Mean Lower 95% Mean N 2.85 1.2258187 0.2741014 3.4237008 2.2762992 20 S05 A histogram of relative frequencies (proportions) in this case looks as 5 4 3 2 1 1/20 2/20 3/20 4/20 5/20 6/20 The configuration across the tops of the bars shows the shape of this distribution. Note that every population whose unique elements are 1, 2, 3, 4 and 5, occurring with proportions 3/20, 5/20, 6/20, 4/20, and 2/20 will have this shape regardless of population size. Some other graphs describing various common shapes of distributio ns, and the name of each, are given in Figure 3.7 on page 73 of your textbook. Let us next review some of the theoretical families of populations and look at the shapes of their distributions as examples of what population and random sample shapes may look like. Example: The Binomial, a family of populations arising in mathematical theory. Consider a collection of populations indexed by a positive integer n and a real number 0 < p < 1. These were also mentioned in a previous handout. Thus, the pair (n, p), for specific n and p, identifies what we will call a Binomial (n, p) population. The unique elements in such a population are the integers 0, 1, 2, ..., n. These occur in proportions which are given by the value of the density function f (x) defined as n! x n− x x ! ( n − x) ! p (1 − p) f (x ) = 0 for any x = 0, 1, 2, ..., n otherwise (Recall that n ! = 1 ⋅ 2 ⋅ 3 ⋅ 4 ⋅ K ⋅ n and 0 ! ≡ 1 ) The shapes of the distributions of Binomial (5, 0.2), (5, 0.5), (5, 0.8), and (10, 0.2) are shown by histograms in Figure 5.3 on page 233. (Don't worry about chapter 5 now.) Note that population size is not a consideration. For the case of Binomial (2, 0.66), the proportions are x f ( x) 0 2! ( 0.66) 0 ( 0.34) 2 = 0.1156 0! 2 ! 1 2! (0.66) 1 (0.34) 1 = 0.4488 1 !1 ! 2 2! ( 0.66) 2 ( 0.34) 0 = 0.4356 2 !0 ! 2 S05 Also, the cumulative distribution function value F (1) = f ( 0) + f (1) for example is the proportion in the interval [0, 1]. The Binomial (n, p) is what we call a family of populations. Each population, of course, has its own distribution. These populations arise from mathematical theory and so they are called theoretical distributions. Certainly they are figments of our imagination. These are examples of Discrete Populations. Example: The standard Exponential population arising in theory. Populations like this one require some imagination to envision. The unique elements in this population are the totality of positive real numbers, and each number occurs one or more times in the population. The proportion of population elements in any given interval (a, b) is found by integrating the following density function h(x) over (a, b). e − x h (x ) = 0 x>0 otherwise x The cumulative distribution function is F ( x) = ∫ e − y dy. Thus the proportion of population 0 3 elements in the interval [2, 3] is F ( 3) − F (2) = ∫ e − y dy , and in [0, ∞] is 2 ∞ ∫0 e − y dy = 1 . Since the unique elements in this population form a continuum over (0, ∞), the shape of the distribution can be shown as the graph of h(x) over (0, ∞) which is the continuous curve shown on page 258, Figure 5.14 with α = 1. Think of this as the limiting histogram as the interval width of histogram rectangles goes to zero. This is a Continuous Population. It is, of course, one member of the Exponential E (α ) family of populations. Example: The Standard Normal population, a Continuous Population arising in theory. The unique elements in this theoretical population are all the real numbers. Each occurs one or more times. The proportion of population elements in any given interval (a, b) is the integral of the following density function g(x) over (a, b). g ( x) = e − x2 2 2π , −∞< x< ∞ . The Cumulative Distribution Function (CDF) is F ( x ) = ∫ x −∞ g ( y ) dy. The shape of the standard Normal distribution is the bell shape. The graphs in Figure 5.11 on page 254 show this shape. These are, of course, graphs of g(x). Note the symmetry of the distribution about zero. Thus half of the population elements are less than zero and half are greater than zero. The integral of g(x) does not have closed form. Thus the integral of g(x) over any finite interval must be approximated numerically. This is difficult and we won't attempt it. Table 3.10 on page 3 S05 89 can be used to find approximate population proportions over some intervals. Using the table we can find, for example, Interval (− ∞, −1.88) (− ∞, −0.39) (− ∞, −0.08) (− ∞, 1.65) (− ∞, 1.88) (−0.39, −0.08) Proportion 0.03 0.35 0.47 0.95 0.97 0.47 − 0.35 = 0.12 JMP will evaluate g(x) and the CDF. Table B.3 on pages 788 and 789 inverts Table 3.10 by giving proportion in the body of the table and interval endpoint in the margin. Why do we care about theoretical populations? Mathematical theory provides facts and figures about theoretical populations. Things like population mean, variance, quantiles, deciles, shape of distribution, etc. Basically, we can assume that anything we might wish to know about the characteristics of a population is known about each theoretical population. Now, in a real world situation, wherein we imagine a population of interest (like thrust face runouts), if evidence points to the distribution of our population being similar to that of theoretical population Z (say) then the characteristics of our population of interest will be similar to the known characteristics of population Z. How can we acquire evidence about the distribution of elements in our population of interest? The answer is to take a random sample of population elements and look at the distribution (especially the shape). If it approximates that of population Z, then this is evidence that the distribution of our population might be similar to that of population Z. Chapter 3 in your textbook discusses ways to get evidence about the distribution of elements in a random sample (or any dataset we might have). 4