Fundamental Sampling Distributions and Data Descriptions ENGSTAT Notes of AM Fillone, De La Salle University-Manila Population – the totality of observations with which we are concerned, whether their number be finite or infinite -Statisticians uses the term to refer to observations relevant to anything of interest, whether it be groups of people, animals, or all possible outcomes from some complicated biological or engineering system Definition 8.1 A population consists of the totality of the observations with which we are concerned. Definition 8.2 A sample is a subset of a population. Definition 8.3 Let X1, X2, …, Xn be n independent random variables, each having the same probability distribution f(x). Define X1, X2, …, Xn to be a random sample of size n from the population f(x) and write its joint probability distribution as f(x1, x2, …, xn) = f(x1)f(x2) … f(xn) ENGSTAT Notes of AM Fillone, De La Salle University-Manila Some Important Statistics Definition 8.4: Any function of the random variables constituting a random sample is called a statistic. Definition 8.5: If X1, X2, …, Xn represent a random sample of size n, then the sample mean is defined by the statistic. Definition 8.6: If X1, X2, …, Xn represent a random sample of size n, then the sample variance is defined by the statistic Theorem 8.1: If S2 is the variance of a random sample of size n, we may write Definition 8.7: The sample standard deviation, denoted by S, is the positive square root of the sample variance. ENGSTAT Notes of AM Fillone, De La Salle University-Manila Other Statistics The sample median – reflects the central tendency of the sample in such a way that it is uninfluenced by extreme values or outliers. Given that the observations in a sample are x1, x2, …, xn, arranged in increasing order of magnitude, the sample median is ENGSTAT Notes of AM Fillone, De La Salle University-Manila Example: Mean, median, mode, and standard deviation According to ecology writer Jacqueline Killeen, phosphates contained in household detergents pass right through our sewer systems, causing lakes to turn into swamps that eventually dry up into deserts. The following data show the amount of phosphates per load of laundry, in grams, for a random sample of various types of detergents used according to the prescribed directions: Laundry Detergent A & P Blue Sail Dash Concentrated All Cold Water All Breeze Oxydol Ajax Sears Fab Cold Power Bold Rinso Phosphates per Load (grams) 48 47 42 42 41 34 31 30 29 29 29 26 For the given phosphate data, find: (a) the mean; (b) the median; (c) the mode; and (d) the standard deviation. ENGSTAT Notes of AM Fillone, De La Salle University-Manila Solution: (a) (b) Arrange data in increasing order - 26, 29, 29, 29, 30, 31, 34, 41, 42, 42, 47, 48 = (1/2)(31+34) = 32.5 grams (c) Mode = 29 (d) Standard deviation, ENGSTAT Notes of AM Fillone, De La Salle University-Manila Data Displays and Graphical Methods Box-and-Whisker Plot or Box Plot • This plot encloses the interquartile range of the data in a box that has the median displayed within • The interquartile range has its extremes, the 75th percentile (upper quartile) and the 25th percentile (lower quartile) • “Whiskers” extend showing extreme observations in the sample • A variation called a box plot can provide the viewer information regarding which observations may be outliers • Outliers are observations that are considered to be unusually far from the bulk of the data ENGSTAT Notes of AM Fillone, De La Salle University-Manila Example: Consider the data in Table 8.1 about the nicotine content in a random sample of 40 cigarettes. Develop a box-and-whisker plot of the data. ENGSTAT Notes of AM Fillone, De La Salle University-Manila Example: Constructing a Stem-and-Leaf Plot Consider the data of Table 1.4, which specifies the “life” of 40 similar car batteries recorded to the nearest tenth of a year. The batteries are guaranteed to last 3 years. Table 1.4: Car Battery Life 2.2 4.1 3.5 3.4 1.6 3.1 2.5 4.3 3.4 3.3 3.1 3.7 4.7 3.8 3.2 4.5 3.3 3.6 4.4 2.6 3.2 3.8 2.9 3.2 3.9 3.7 3.1 3.3 4.1 3.0 3.0 4.7 3.9 1.9 4.2 2.6 3.7 3.1 3.4 3.5 Process: 1. Split each observation into two parts consisting of a stem and a leaf such that the stem represents the digit preceding the decimal and the leaf corresponds to the decimal part of the number. 2. For example, for number 3.7, the digit 3 is designated the stem and the digit 7 is the leaf. 3. The four stems 1, 2, 3, and 4 are listed vertically on the left side in Table 1.5; the leaves are recorded on the right side opposite the appropriate stem value. Table 1.5: Steam-andLeaf Plot Stem 1 2 3 4 Leaf Frequency 69 2 25669 5 0011112223334445567778899 25 11234577 8 ENGSTAT Notes of AM Fillone, De La Salle University-Manila Stem-and-Leaf Plot 1. The stem-and-leaf plot of Table 1.5 contains only four stems and consequently does not provide an adequate picture of the distribution. 2. To remedy the problem, the number of stems could be increased. 3. One way of doing this is to increase the number of stems of the plot. 4. One way to accomplish this is to write each stem value twice and then record the leaves 0, 1, 2, 3, and 4 opposite the appropriate stem value where it appears for the first time; and the leaves 5, 6, 7, 8, 9 opposite this same stem value where it appears for the second time Table 1.6: Double-Stem-and-Leaf Plot of Battery Life Stem 1 2* 2 3* 3 4* 4 Leaf 69 2 5669 001111222333444 5567778899 11234 577 Frequency 2 1 4 15 10 5 3 ENGSTAT Notes of AM Fillone, De La Salle University-Manila Frequency Distribution -The data are grouped into different classes or intervals and can be constructed by counting the leaves belonging to each stem and noting that each stem defines a class interval. - A table listing relative frequencies is called a relative frequency distribution. - The relative frequency distribution of Battery Life is given in Table 1.7 below. Table 1.7: Relative Frequency Distribution of Battery Life Class Interval 1.5-1.9 2.0-2.4 2.5-2.9 3.0-3.4 3.5-3.9 4.0-4.4 4.5-4.9 Class Midpoint 1.7 2.2 2.7 3.2 3.7 4.2 4.7 Frequency, f 2 1 4 15 10 5 3 Relative Frequency 0.050 0.025 0.100 0.375 0.250 0.125 0.075 1.000 0.4 Figure 1.6: Relative frequency histogram Relatvie Frequency 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 ENGSTAT Notes of AM Fillone, De La Salle University-Manila 1.7 2.2 2.7 3.2 3.7 Battery Life (years) 4.2 4.7 Quantile Plot Definition 8.8. A quantile of a sample, q(f), is a value for which a specified fraction f of the data values is less than or equal to q(f). Detection of Deviations from Normality: Normal Quantile-Quantile Plot Definition 8.9: The normal quantile-quantile plot is a plot of y(i) (ordered observations) against q0,1(fi), where fi = (i – 3/8)/(n + ¼). - where a good approximation of the quantile for the N(0,1) random variable is ENGSTAT Notes of AM Fillone, De La Salle University-Manila Sampling Distributions Definition 8.10: The probability distribution of a statistic is called a sampling distribution. Sampling Distribution of Theorem 8.2: Central Limit Theorem: If X is the mean of a random sample of size n taken from a population with mean and finite variance 2, then the limiting form of the distribution of As n , is the standard normal distribution n(z;0,1). Sampling Distribution of the Difference between Two Averages Theorem 8.3: If independent samples of size n1 and n2 are drawn at random from two populations, discrete or continuous, with means 1 and 2, and variances 21 and 22 , respectively, then the sampling distribution of the differences of means, , is approximately normally distributed with mean and variance given by and ENGSTAT Notes of AM Fillone, De La Salle University-Manila Hence, is approximately a standard normal variable. Sampling Distribution of S2 Theorem 8.4: If S2 is the variance of a random sample of size n taken from a normal population having the variance 2, then the statistic has a chi-squared distribution with = n – 1 degrees of freedom. ENGSTAT Notes of AM Fillone, De La Salle University-Manila Degrees of Freedom When is not known and one considers the distribution of There is 1 less degree of freedom, or a degree of freedom is lost in the estimation of (i.e., when is replaced byx) -In other words, there are n degrees of freedom or independent pieces of information in the random sample from the normal distribution. - When the data (the values in the sample) are used to compute the mean, there is 1 less degree of freedom in the information used to estimate 2. ENGSTAT Notes of AM Fillone, De La Salle University-Manila Examples: Chi-squared Distribution Ex. For the chi-squared distribution find 1. Answer: 27.488 (Table A.5) 2. Answer: 18.475 3. Answer: 36.415 ENGSTAT Notes of AM Fillone, De La Salle University-Manila t- Distribution Theorem 8.5: Let Z be a standard normal random variable and V a chi-squared random variable with degrees of freedom. If Z and V are independent, then the distribution of the random variable T, where is given by the density function This is known as the t-distribution with degrees if freedom. ENGSTAT Notes of AM Fillone, De La Salle University-Manila Shape of t-Distribution • The distribution of T is similar to the distribution of Z in that they both are symmetric about the mean zero. • Both distributions are bell shaped, but the t-distribution is more variable, owing to the fact that the T-values depend on the fluctuations of two quantities,X and S2, whereas the Z-values depend only on the changes ofX from sample to sample. • This distribution of T differs from that of Z in that the variance of T depends on the sample size n and is always greater than 1. • Only when the sample size n will the two distributions become the same. 0 Figure 8.14: Symmetry property of the t-distribution ENGSTAT Notes of AM Fillone, De La Salle University-Manila Example: t - Distribution ENGSTAT Notes of AM Fillone, De La Salle University-Manila Solution: From t-distribution table, Table A.4 Hence, the claim is supported by the data obtained since T value is inside the –t0.025 and t0.025. ENGSTAT Notes of AM Fillone, De La Salle University-Manila Corollary 8.1: Let X1, X2, …, Xn be independent random variables that are all normal with mean and standard deviation . Let and Then the random variable has a t-distribution with = n – 1 degrees of freedom. F-Distribution Theorem 8.6: Let U and V be two independent random variables having chi-squared distributions with 1 and 2 degrees of freedom, respectively. Then the distribution of the random variable F = (U/v1)/(V/v2) is given by the density This is known at the F-distribution with 1 and 2 degrees of freedom (d.f.). ENGSTAT Notes of AM Fillone, De La Salle University-Manila with 1 and 2 degrees of freedom, we obtain Theorem 8.7: Writing Theorem 8.8: If S21 and S22 are the variances of independent random samples of size n1 and n2 taken from normal populations with variances 21 and 22, respectively, then This is known as the F-distribution with 1 = n1 -1 and 2 = n2 -1 degrees of freedom. Use of the F-Distribution The F-Distribution is used in two-sample situations to draw inferences about the population variances. The F-distribution is called the variance ratio distribution. ENGSTAT Notes of AM Fillone, De La Salle University-Manila Solution: (a)2.71 (b)2.92 (c)0.345 ENGSTAT Notes of AM Fillone, De La Salle University-Manila Solution: ENGSTAT Notes of AM Fillone, De La Salle University-Manila