S05 Statistics 305 Parameters Used to Describe Characteristics of Population Elements Every population consists of a set of numbers. To describe the characteristics of a population, we use scalar numerical summary measures. These are called parameters. The mean of population elements is the parameter named the population mean. A measure of the amount of variability in population elements is a parameter named population variance. A number Q ( p) such that the proportion p (for any 0 < p < 1 ) of population elements is less than or equal to Q ( p) and the proportion 1 – p is greater than or equal to Q ( p) is a parameter called the p quantile of the population. Thus, for example, Q (1 / 2) is the number such that half of the population elements are less than or equal to Q (1 / 2) , and half are greater than or equal to Q (1 / 2) . One-half is a rather special proportion and the name Median is given to this parameter. Other rather special proportions, tenths and quarters, are given the names first, second, third, …, etc. Deciles (for quantiles Q (0.1), Q (0.2), Q ( 0.3), ..., Q (0.9) ) and first, second and third Quartiles (for Q (0.25), Q (0.5) and Q ( 0.75) ). Of course the second quartile is also named Median. Quantiles in the theoretical continuous type populations are found, for any given p, by solving the integral equation F ( Q( p) ) = p (1) for Q ( p) , where F ( x) is the cumulative distribution function for the population. For example, the p = 0.5 quantile in the Exponential E (1) population is computed by solving the equation Q( 0 . 5 ) ∫0 e − y dy = 0.5 or 1 − e −Q (0 .5) = 0.5 Thus Q(0.5) = − ln (0.5) is the Median of the E(1) population. Quantiles in discrete type populations are a nuisance to deal with. To do so we normally modify the definition of Q(p) to say “the proportion 1 − p is greater than Q(p)”, rather than “greater than or equal to Q(p)”. Consider for example the Binomial B(3, 0.1) population considered in a previous handout. The cumulative distribution function values at unique population elements are the following: S05 x 0 1 2 3 F(x) 0.729 0.729 + 0.243 = 0.972 0.729 + 0.243 + 0.027 = 0.999 1.0 By the modified definition, the p = 0.729 quantile, Q(0.729), is zero. The p = 0.972 quantile is Q(0.972) = 1. In other words the proportion 0.972 of population elements is less than or equal to 1 (i.e., are either 0 or 1). But what about the p = 0.96 quantile or the p = 0.97 quantile? A correct but not very pleasing answer, which satisfies the definition of quantile, will be found by interpolating between x = 0 and x = 1. So we might find, for example (maybe), that Q(0.96) = 0.94. Obviously, x = 0.94 is not a population element, also note that x = 0.94 satisfies the definition as a p = 0.729 quantile as do other x values (hence the nuisance of dealing with discrete population quantiles). Our solution for discrete populations will be to ignore p values for which Q(p) in the solution to Equation (1) is not a population element. The solution to equation (1) will be taken as the p quantile When we consider computing quantiles in finite datasets (not theoretical populations) such as random samples from continuous populations, we will interpolate when necessary and thus find a quantile value which is not one of the dataset numbers. This will be described later in the handout and on page 80 in your textbook. In the following families of theoretical populations, the parameters Mean, Variance and p quantiles are given as derived from mathematical theory. For example, you can see that the mean of the Binomial B(4, 1/3) population is µ = 4(1/3) and the variance is σ 2 = 4(1/3) (2/3). (It is customary to use the notation µ and σ 2 for these parameters.). For the p value 4 0 p 0 = (1 / 3) ( 2 / 3) 4 0 = ( 2 / 3) 4 the p0 = (2/3)4 quantile is taken as Q(16/81) = 0. Also for example, the Standard Normal population, N(0, 1), has mean µ = 0 and variance σ 2 = 1. Quantiles (z) in N(0, 1) are given in the margins of Table B.3 on pages 788-789 for corresponding (rounded) p values in the body of the table. Thus (approximately) Q(0.0003) = −3.40, Q(0.00002) = −3.49, Q(0.5398) = 0.10 and Q(0.5753) = 0.19 for the N(0., 1) population. 1. Binomial B(n, r) . The unique elements in these populations are nonnegative integers: zero thru n, and 0 < r < 1. Denote these as x 0 = 0, x 1 = 1, …, x n = n. The mean and variance parameters of B(n, r) can be shown to have value Mean: µ = nr 2 S05 Variance : σ 2 = nr(1 − r) Quantiles: Only quantiles for p-values p0 , p1 , …, pn , defined as follows, are considered in this population. The pi-th quantile ( 0 < pi < 1) is x i ≡ Q( pi ) where pi and x i satisfy (F(x) is the CDF) F ( xi ) = pi 2. Geometric G(r) . The unique elements in these populations are all the positive integers. Denote these as x 1 = 1, x2 = 2, … . The mean and variance of these populations can be shown to have value Mean: µ = 1/r Variance : σ 2 = (1 − r)/r2 Quantiles: Only quantiles for p-values p1 , p2 , …, defined as follows, are considered in this population. The pi-th quantile ( 0 < pi < 1) is x i ≡ Q( pi ) where pi and x i satisfy (F(x) is the CDF) F ( xi ) = pi 3. Normal N(µ, σ 2 ) . The quantile for every p-value ( 0 < p < 1) is the number Q ( p) ≡ x p which satisfies the equation ( Φ (x ) is the CDF) Φ ( x p ) = p or xp ∫ φ ( y ) dy = p , ( φ ( x) is the density function) −∞ The mean and variance of the N ( µ , σ 2 ) population are, respectively, µ and σ 2 . Quantiles for certain p- values in N(0, 1) are given in Table B.3. 4. Exponential E(α) . Mean: µ = α Variance : σ 2 = α 2 Quantiles: ( 0 < p < 1) The p-th quantile for given p-value is the real number x p satisfying (F ( x) F(x p ) = p 3 is the CDF ) S05 Quantiles in Finite Datasets as Defined by Textbook and JMP The general definition of the p quantile of a random sample (or any dataset), for a given real number 0 < p < 1, is a number Q(p) such that the fraction p of the elements are less than or equal to Q(p) and the proportion 1 − p of the eleme nts are greater than or equal to Q(p). For finite datasets this general definition cannot be satisfied for many given p values. To exhibit the problem with the general definition for finite datasets, consider the set containing the numbers {1, 5, 9}. The quantile Q(1/3) is a number such that p = 1/3 of the numbers are less than or equal to Q(1/3) and 1 − p = 2/3 of the elements are greater than Q(1/3). What is the value of Q(1/3)? Well Q(1/3) = 2 satisfies the definition. Q(1/3) = 3 also satisfies the definition as does 2.5, 3.5, 3.7, …, etc. There is not a unique answer, and for other p values such as p = 1/4 the choice is not obvious. Thus, alternative definitions which produce approximate p quantiles are used. Your textbook gives one on page 78, and JMP uses a slightly different one. This permits us to compute approximations to population p quantiles in a random sample from a continuous population. For any ordered dataset y1 ≤ y 2 ≤ K ≤ y n , the p quantile (0 < p < 1) is the number Q(p) which satisfies: Textbook definition (see pages 78-81). 1. If i = np + 0.5 is an integer i ≤ n then Q(p) = yi . 0 .5 n − 0 .5 <p< , then let j = [i] (largest integer < n n i). The p quantile is computed by first computing 2. If i = np + 0.5 is not an integer and if d = p − ( ( j − 0.5) / n) , (( j + 0.5) / n) − (( j − 0.5) / n ) then Q( p) = (1 − d ) Q (( j − 0.5) / n ) + dQ(( j + 0.5) / n ) . If p ≤ 0.5/n then Q( p) = y1 , and if p ≥ (n − 0.5)/n then Q(p) = yn . JMP definition 1. If i = np + p is an integer (i ≤ n), then Q(p) = yi . 2. If i = np + p is not an integer, the n let j = [i] (integer part of i) and f = i − j (fractional part of i). Then if 0 < j < n, compute the p quantile as 4 S05 Q ( p) = (1 − f ) y j + f y j +1 . If j = 0 or j = n use y1 for y0 or yn for yn+1 , so we have Q(p) = y1 if j = 0 and Q(p) = yn if j = n. For example, consider the dataset given in Example 5 on page 79. For p = 0.93, your textbook definition gives i = 10( 0.93) + 0.5 = 9 .8 Since this is not an integer, compute j = [i] = 9. Then d = = 0.93 − ((9 − 0.5) / 10) ((9 + 0.5) /10 ) = (( 9 − 0.5) /10 ) 0.93 − 0.85 0.95 − 0.85 = 0.8 Q (0.93) = (1 − 0.8) Q ( 0.85) + 0.8 Q ( 0.95) = ( 0.2) (9614) + (0.8) (10688) = 10,473.2 The JMP definition gives i = (10) (0.93) + 0.93 = 10.23. Thus, j = 10 and f = 0.23. Since j = n, Q(0.93) = 10,688. Both numbers are p = 0.93 quantiles of the dataset. 5