Elementary Statistics for the Biological and Life Sciences STAT 205 University of South Carolina Columbia, SC © 2005, University of South Carolina. All rights reserved, except where previous rights exist. No part of this material may be reproduced, stored in a retrieval system, or transmitted in any form or by any means — electronic, mechanical, photoreproduction, recording, or scanning — without the prior written consent of the University of South Carolina. DoStat Sign-up Go to http://www.dostat.com “SIGN UP” as a “student” using your VIP login name as your DOSTAT login name • Use course reference DS- _____ • Submit an e-mail address you read often • ( this is how you will receive info. on course announcements) STAT205 – Elementary Statistics for the Biological and Life Sciences 2 DoStat and StatCrunch We will use the StatCrunch online statistical system for online statistical computations and graphics. http://www.statcrunch.com We will also use the DoStat course management system for homework and example online calculations. STAT205 – Elementary Statistics for the Biological and Life Sciences 3 Motivation: why analyze data? Clinical trials/drug development: compare existing treatments with new methods to cure disease. Agriculture: enhance crop yields, improve pest resistance Ecology: study how ecosystems develop/respond to environmental impacts Lab studies: learn more about biological tissue/cellular activity STAT205 – Elementary Statistics for the Biological and Life Sciences 4 Chapter 2: Description of Populations and Samples Selected tables and figures from Samuels, M. L., and Witmer, J. A., Statistics for the Life Sciences, 3rd Ed. © 2003, Prentice Hall, Upper Saddle River, NJ. Used by permission. STAT205 – Elementary Statistics for the Biological and Life Sciences 5 Statistics is: Statistics is the science of • collecting, • summarizing, • analyzing, and • interpreting data. Goal: to understand the underlying biological phenomena that generate the data. STAT205 – Elementary Statistics for the Biological and Life Sciences 6 Random Variables Data are generated by some random process or phenomenon. Any observed datum represents the outcome of a Random Variable. NOTATION: upper case letter, W, X, Y, etc. STAT205 – Elementary Statistics for the Biological and Life Sciences 7 Types of Random Variables Qualitative • Nominal (e.g., blood type – A, B, AB, O) • Ordinal (e.g., therapy response – none, some, cured) Quantitative • Discrete (e.g., number of nests – 0,1,2,…) • Continuous (e.g., cholesterol conc. – 220.2, 210.4, 180.9, etc.) STAT205 – Elementary Statistics for the Biological and Life Sciences 8 Random Samples We take data as samples from a larger population. DEF’N: A SAMPLE is a collection of ‘subjects’ upon which we measure one or more variables. DEF’N: The SAMPLE SIZE is the number of subjects in a sample. NOTATION: n. STAT205 – Elementary Statistics for the Biological and Life Sciences 9 Observations DEF’N: The OBSERVATIONAL UNIT is the type of subject being sampled. Example: observational units could be (i) baby, (ii) moth, (iii), Petri dish, etc. DEF’N: An OBSERVATION is a recorded outcome of a variable from a random sample. NOTATION: lower case letter, x, y, etc. STAT205 – Elementary Statistics for the Biological and Life Sciences 10 Frequency Distributions DEF’N: A FREQUENCY DISTRIBUTION is a summary display of the frequencies of occurrence of each value in a sample. DEF’N: A RELATIVE FREQUENCY (a percent or proportion ) is a raw frequency divided by sample size, n: Freq Rel. Freq. = n STAT205 – Elementary Statistics for the Biological and Life Sciences 11 Frequency Distn’s Frequency distributions come in varied shapes: • • • • • Symmetric & bell-shaped Symmetric, not bell-shaped Asymmetric & skewed right Asymmetric & skewed left Bimodal We use histograms, etc., to visualize these shapes in the data. STAT205 – Elementary Statistics for the Biological and Life Sciences 12 Example 2.4 Ex. 2.4: Y = no. of piglets surviving 21 days (litter size). A sample of n=36 pigs (sows) generated the data in Table 2.4. STAT205 – Elementary Statistics for the Biological and Life Sciences 13 Dot Plot A DOT PLOT is a simple graphic where dots indicate observed data in a sample. Ex. 2.4: Fig. 2.4 gives the dot plot for the litter size data: STAT205 – Elementary Statistics for the Biological and Life Sciences 14 Histogram A HISTOGRAM is a simple bar chart where the bars replace the dots in a dot plot. Ex. 2.4 (cont’d): Fig. 2.5 gives the histogram for the litter size data. STAT205 – Elementary Statistics for the Biological and Life Sciences 15 Stemplot A STEMPLOT (a.k.a. STEM-LEAF DIAGRAM) is a dot plot (often drawn on its side) with data information replacing the dots. The ‘stems’ are the core values of the data, set in common groups. The ‘leaves’ are the last digits of each datum. STAT205 – Elementary Statistics for the Biological and Life Sciences 16 Example 2.8 Ex. 2.8: Y = radish growth. Data in Table 2.8: Radish Growth after 3 days in Total Darkness STAT205 – Elementary Statistics for the Biological and Life Sciences 17 Descriptive Statistics DEF’N: The SAMPLE MEAN is the arithmetic average of a set of n data values. NOTATION: n y1 + y2 + 1 y = n yi = n i=1 + yn The sample mean is often viewed as a kind of ‘balance point’ in the data. STAT205 – Elementary Statistics for the Biological and Life Sciences 18 Example 2.15 Ex. 2.15: Y = weight gain (lb) of lambs on special diet. Data: {11, 13, 19, 2, 10, 1} n = 6: y = 11 + 13 + 19 + 2 + 10 + 1 6 = 56 = 9.33 lb 6 Fig. 2.27: STAT205 – Elementary Statistics for the Biological and Life Sciences 19 Sample Median DEF’N: The SAMPLE MEDIAN is the value of the data nearest to their middle. Find the median by ordering the data, and calculating their middle point (n odd) or the average of their two middle points (n even). NOTATION: Q2 STAT205 – Elementary Statistics for the Biological and Life Sciences 20 Example 2.17 Ex. 2.17: (2.15 cont’d) Lamb weight gain. n = 6 is even , so find Q2 as avg. of two middle points ordered data: y(1) = 1, y(2) = 2, y(3) = 10, y(4) = 11, y(5) = 13, y(6) = 19. Q2 = 10 + 11 = 10.5 lb 2 STAT205 – Elementary Statistics for the Biological and Life Sciences 21 Example 2.19 Ex. 2.19: Y = cricket singing times. Data in Table 2.10: STAT205 – Elementary Statistics for the Biological and Life Sciences 22 Example 2.19 (cont’d) STAT205 – Elementary Statistics for the Biological and Life Sciences 23 Skewness Mean & median indicate skewness: • If data are skewed right, mean > median. • If data are skewed left, mean < median. • If data are symmetric, mean ≈ median. Both the mean and the median are useful summary measures of location. The median is slightly more ROBUST to extreme values of yi, but of course, the mean is easier to calculate. STAT205 – Elementary Statistics for the Biological and Life Sciences 24 Quartiles DEF’N: The QUARTILES of a distribution are points that separate the data into quarters or fourths: • The first quartile separates the lower 25% of the data from the upper 75%. NOTATION: Q1 • The second quartile separates the lower 50% of the data from the upper 50%. NOTATION: Q2 • The third quartile separates the lower 75% of the data from the upper 25%. NOTATION: Q3 STAT205 – Elementary Statistics for the Biological and Life Sciences 25 Example 2.20 Ex. 2.20: Y = Systolic blood pressure (mm Hg) in men; n= 7. Ordered data: y(1) = 113, y(2) = 124, y(3) = 124, y(4) = 132, y(5) = 146, y(6) = 151, y(7) = 170. Q1 = 124 Q2 = 132 Q3 = 151 STAT205 – Elementary Statistics for the Biological and Life Sciences 26 IQR DEF’N: The INTER-QUARTILE RANGE is IQR = Q3 – Q1 DEF’N: The MINIMUM is the smallest value of a data set or distribution. NOTATION: y(1) DEF’N: The MAXIMUM is the largest value of a data set or distribution. NOTATION: y(n) STAT205 – Elementary Statistics for the Biological and Life Sciences 27 Five Number Summary DEF’N: The FIVE NUMBER SUMMARY is {y(1), Q1, Q2, Q3, y(n)} DEF’N: A BOXPLOT is a graphic plot of the 5-no. summary, with a box spanning the IQR and bridging the quartiles: y(1) Q1 Q2 Q3 y(n) STAT205 – Elementary Statistics for the Biological and Life Sciences 28 Example 2.22 Ex. 2.22: Y = radish growth data from Ex. 2.8. Five-no. summary is {8, 15, 21, 30, 37}. Boxplot is given in Fig. 2.30: STAT205 – Elementary Statistics for the Biological and Life Sciences 29 Example 2.23 Ex. 2.23: Y = radish growth data over three different growth regimes (see Ex. 2.9). In Fig. 2.32, we use boxplots for comparative purposes. STAT205 – Elementary Statistics for the Biological and Life Sciences 30 Outliers DEF’N: An OUTLIER is an obsv’n that differs dramatically from the rest of the data. Formally: Yi is an outlier if Yi < Q1 – (1.5 IQR) or Yi > Q3 + (1.5 IQR) “lower fence” “upper fence” STAT205 – Elementary Statistics for the Biological and Life Sciences 31 Example 2.25 Ex. 2.25: Y = radish growth data in full light (from Ex. 2.23). The ordered data are: 3, 5, 5, 7, 7, 8, 9, 10, 10, 10, 10, 14, 20, 21 IQR = Q3 – Q1 = 10 – 7 = 3 Upper fence = Q3 + (1.5 IQR) = 10 + (1.5)(3) = 14.5 Lower fence = Q1 – (1.5 IQR) = 7 – (1.5)(3) = 2.5 y = 20 and y = 21 are outliers. STAT205 – Elementary Statistics for the Biological and Life Sciences 32 Dispersion DEF’N: The SAMPLE RANGE is Range = Y(n) – Y(1) = Max. – Min. DEF’N: The SAMPLE VARIANCE is 2 S = 1 n-1 n (Yi - Y) i=1 2 DEF’N: The SAMPLE STANDARD DEVIATION (SD) is S = S2 STAT205 – Elementary Statistics for the Biological and Life Sciences 33 The Empirical Rule The sample mean and the sample SD are useful in describing data sets (that are unimodal and not too skewed). The EMPIRICAL RULE states that • ~68% of the data lie between Y - S and Y + S • ~95% of the data lie between Y - 2S and Y + 2S • >99% of the data lie between Y - 3S and Y + 3S STAT205 – Elementary Statistics for the Biological and Life Sciences 34 Example 2.36 Ex. 2.36: Suppose Y = pulse rate after 5 mins. of exercise. For n = 28 subjects, we find Y = 98 (beats/min) and S = 13.4 (beats/min). Thus, e.g., from the empirical rule we expect ~95% of the data to lie between 98 – (2)(13.4) = 98 – 26.8 = 71.2 beats/min and 98 + (2)(13.4) = 98 + 26.8 = 124.8 beats/min. STAT205 – Elementary Statistics for the Biological and Life Sciences 35 Inference DEF’N: The POPULATION is the larger group of subjects (organisms, plots, regions, ecosystems, etc.) on which we wish to draw inferences. DEF’N: A PARAMETER is a quantified population characteristic. E.g., the popl’n mean is m and popl’n standard deviation is s. DEF’N: A STATISTIC is a sample quantity used to estimate a popl’n parameter. STAT205 – Elementary Statistics for the Biological and Life Sciences 36 Proportions DEF’N: The POPULATION PROPORTION is the proportion of subjects exhibiting a particular trait or outcome in the popl’n. (It generalizes to the probability that any popl’n element will exhibit the trait.) NOTATION: p DEF’N: The SAMPLE PROPORTION is the number of sample elements exhibiting the trait, divided by the sample size, n. NOTATION: p STAT205 – Elementary Statistics for the Biological and Life Sciences 37 Chapter 3: Random Sampling, Probability, and the Binomial Distribution Selected tables and figures from Samuels, M. L., and Witmer, J. A., Statistics for the Life Sciences, 3rd Ed. © 2003, Prentice Hall, Upper Saddle River, NJ. Used by permission. STAT205 – Elementary Statistics for the Biological and Life Sciences 38 Random Samples DEF’N: A SIMPLE RANDOM SAMPLE of n items is a data set where (a) every popl’n element has an equal chance of selection, and (b) every popl’n element is chosen independently of every other element. This draws upon the larger concept of RANDOMIZATION: selection of data that avoids sources of possible bias. STAT205 – Elementary Statistics for the Biological and Life Sciences 39 Random Sampling To choose a random sample: 1. assign each popl’n element a unique code (or set of codes); 2. from a random number table (Table 1, p. 670) or via computer, in a systematic manner select n random digits whose range corresponds to the codes assigned above; and 3. select every element if its code appears in step (2), ignoring repeated codes or those with no assignment. STAT205 – Elementary Statistics for the Biological and Life Sciences 40 Example 3.1 Ex. 3.1: Simple random sample of size n = 6 from population of 75 elements. 1. label each element 01, 02, …, 75 2. select random digits from a source such as Table 1 or DoStat 3. choose elements for the sample if they correspond to the selected random digits (ignore repeats and drop-outs) See Table 3.1 STAT205 – Elementary Statistics for the Biological and Life Sciences 41 Example 3.1 (cont’d) The sample uses elements 23, 38, 59, 21, 08, 09 STAT205 – Elementary Statistics for the Biological and Life Sciences 42 Probability DEF’N: A PROBABILITY is the chance of some event, E, occurring in a specified manner. NOTATION: P{E} We often view probabilities from a Relative Frequency Interpretation: P{E} = # ways E occurs # total events STAT205 – Elementary Statistics for the Biological and Life Sciences 43 Example 3.12 Ex. 3.12: Toss a fair coin twice. We know P{H} = 1/2 (see Ex. 3.8). What is P{HH}? Consider all possible outcomes: HH, HT, TH, TT If each outcome is equally likely, then P{HH} = # HH # all outcomes = 1 4 STAT205 – Elementary Statistics for the Biological and Life Sciences 44 Probability Rules Rule 1: 0 ≤ P{E} ≤ 1. Rule 2: The entirety of events has probability = 1. That is, if E1, ..., Ek are all the possible events, ∑P{Ei} = 1. Rule 3: (The Complement Rule): c c If E = {not E}, then P{E } = 1 – P{E}. STAT205 – Elementary Statistics for the Biological and Life Sciences 45 Example 3.19 Ex. 3.19: U.S. Blood types: P{O} = 0.44 P{A} = 0.42 P{B} = 0.10 P{AB} = 0.04 Note: (1) all are between 0 and 1 and (2) P{O} + P{A} + P{B} + P{AB} = 0.44 + 0.42 + 0.10 + 0.04 = 1.00 So, e.g., P{Oc} = 1 – P{O} = 1 – 0.44 = 0.56 STAT205 – Elementary Statistics for the Biological and Life Sciences 46 Probability (cont’d) DEF’N: Two events, E1 and E2, are DISJOINT (a.k.a MUTUALLY EXCLUSIVE) if they cannot occur simultaneously. DEF’N: The UNION of two events, E1 and E2, is the event that E1 or E2 (or both) occurs. DEF’N: The INTERSECTION of two events, E1 and E2, is the event that E1 and E2 occurs. STAT205 – Elementary Statistics for the Biological and Life Sciences 47 Venn Diagrams A useful graphic to conceptualize how events interrelate is the Venn Diagram. For example, Fig. 3.8 shows a Venn Diagram with 2 intersecting events, E1 and E2: STAT205 – Elementary Statistics for the Biological and Life Sciences 48 Probability Rules (cont’d) We often denote the entirety of events as the Sample Space, S. Conversely, the c Null Space is = S Rule 4: If E1 and E2 are disjoint, then P{E1 or E2} = P{E1} + P{E2}. Rule 5: If E1 and E2 are any two events, then P{E1 or E2} = P{E1} + P{E2} – P{E1 and E2}. STAT205 – Elementary Statistics for the Biological and Life Sciences 49 Example 3.20 Ex. 3.20: Hair/Eye color of 1770 men. We have the following distribution of traits: So, e.g., P{Black Hair} = 500/1770, etc. STAT205 – Elementary Statistics for the Biological and Life Sciences 50 Example 3.20 (cont’d) Find P{Black Hair OR Red Hair}. Clearly, E1 = {Black Hair} and E2 = {Red Hair} are disjoint, so from Rule 4, P{Black Hair OR Red Hair} = P{Black Hair} + P{Red Hair} = 500/1770 + 70/1770 = 570/1770 = 0.32. STAT205 – Elementary Statistics for the Biological and Life Sciences 51 Example 3.20 (cont’d) Now, find P{Black Hair OR Blue Eyes}. Here, E1 = {Black Hair} and E2 = {Blue Eyes} are NOT disjoint, so apply Rule 5: P{Black Hair OR Blue Eyes} = P{Black Hair} + P{Blue Eyes} – P{Black Hair AND Blue Eyes} = 500/1770 + 1050/1770 – 200/1770 = 1350/1770 = 0.76. STAT205 – Elementary Statistics for the Biological and Life Sciences 52 Probability (cont’d) DEF’N: Two events, E1 and E2, are INDEPENDENT if knowledge that E1 occurs does not affect P{E2} and vice versa. If two events are not independent, they are DEPENDENT. DEF’N: A CONDITIONAL PROBABILITY is the probability that 1 event occurs, given that the other has already occurred. NOTATION: P{E1 | E2}. STAT205 – Elementary Statistics for the Biological and Life Sciences 53 Probability Rules (cont’d) Rule 6: If E1 and E2 are independent, then P{E1 and E2} = P{E1} P{E2}. Rule 7: If E1 and E2 are any two events, then P{E1 and E2} = P{E1} P{E2 | E1} = P{E2} P{E1 | E2}. Consequences: • if E1 and E2 are independent, then P{E1} = P{E1 | E2} and P{E2} = P{E2 | E1} • also, P{E2 | E1} = P{E1 and E2}/P{E1} if P{E1}≠0. STAT205 – Elementary Statistics for the Biological and Life Sciences 54 Examples 3.21–3.22 Exs. 3.21–3.22 (3.20, cont’d): Hair/Eye color of 1770 men. Refer back to Table 3.3. There, we saw P{Blue Eyes AND Black Hair} = 200/1770, while P{Black Hair} = 500/1770. So, P{Blue Eyes | Black Hair } P{Blue Eyes AND Black Hair } = P{Black Hair} = 200/1770 = 200 = 0.40 500/1770 500 STAT205 – Elementary Statistics for the Biological and Life Sciences 55 Example 3.25 Ex. 3.25 (3.20, cont’d): Hair/Eye color of 1770 men. In Table 3.3, there is no evidence of independence between Hair & Eye color. So, e.g., P{Red Hair AND Brown Eyes} = P{Red Hair} P{Brown Eyes | Red Hair} = 70 1770 20 = 20 70 1770 which agrees with the display in Table 3.3. STAT205 – Elementary Statistics for the Biological and Life Sciences 56 Density Curves DEF’N: A RANDOM VARIABLE is a measured outcome of some random process. When a random variable is discrete, it is usually straightforward to interpret probabilities associated with it. For instance, if Y = {# leaves on tree}: P{Y = 122} = 0.42 is interpretable P{Y = 18} = 0.02 is interpretable but P{Y=120.472} is not interpretable. STAT205 – Elementary Statistics for the Biological and Life Sciences 57 Probability Histogram A probability histogram is used to visualize discrete probability masses: P{Y=k} 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 k 6 7 8 9 Notice: each “mass” has area=probability, and all masses sum to 1. STAT205 – Elementary Statistics for the Biological and Life Sciences 58 Continuous Random Variables By contrast, a continuous random variable has a different probability interpretation. Extending the probability histogram to the continuous case, we say Y has a PROBABILITY DENSITY CURVE, where area still represents probability. STAT205 – Elementary Statistics for the Biological and Life Sciences 59 Continuous Random Variables Consequences of the continuous probability model: • P{Y = a} = 0 = P{Y = b} (area of a line is zero) • So, P{Y ≤ a} = P{Y < a} + P{Y = a} = P{Y < a} • And for that matter: P{a ≤ Y ≤ b} = P{a < Y ≤ b} = P{a ≤ Y < b} = P{a < Y < b} (all if Y is continuous). STAT205 – Elementary Statistics for the Biological and Life Sciences 60 Example 3.30 Ex. 3.30: Y = diameter (in.) of tree trunk. • Suppose the density has the form given in Fig. 3.13: • Then, for example, P{Y > 8} = P{8 < Y ≤ 10} + P{Y > 10} = 0.12 + 0.07 = 0.19 STAT205 – Elementary Statistics for the Biological and Life Sciences 61 Mean and Expected Value DEF’N: If Y is a discrete random variable, its POPULATION MEAN is given by µY = ∑yiP{Y = yi} (where the sum is taken over all possible yi’s) More generally, the EXPECTED VALUE of Y is E(Y) = ∑yiP{Y = yi}. STAT205 – Elementary Statistics for the Biological and Life Sciences 62 Example 3.35 Ex. 3.35: Y = # tail vertebrae in fish. From Table 3.4 we find yi 20 21 22 23 P{Y = yi} .03 .51 .40 .06 So, E(Y) = ∑yiP{Y = yi} = (20)(.03) + (21)(.51) + (22)(.40) + (23)(.06) = … = 21.49. STAT205 – Elementary Statistics for the Biological and Life Sciences 63 Variance DEF’N: If Y is a discrete random variable, its POPULATION VARIANCE is given by sY2 = ∑(yi – µY)2P{Y = yi} One can show this is also sY2 = E(Y2) – {E(Y)}2 = E(Y2) – µY2 From this, the POPULATION STANDARD DEVIATION of Y is sY = (sY2)1/2. STAT205 – Elementary Statistics for the Biological and Life Sciences 64 Example 3.37 Ex. 3.37: (3.35, cont’d). From Table 3.4 we were given the values of P{Y = yi}. Recall µY = 21.49. So, sY2 = ∑(yi – µY)2P{Y = yi} 2 2 = (20–21.49) (.03) + (21–21.49) (.51) 2 2 + (22–21.49) (.40) + (23–21.49) (.06) = … = 0.4299. STAT205 – Elementary Statistics for the Biological and Life Sciences 65 Example 3.37 (cont’d) So sY2 = 0.4299. But, it’s a lot easier to use sY2 = E(Y2) – µY2 = 2 2 {(20) (.03) + (21) (.51) 2 2 + (22) (.40) + (23) (.06)} – (21.49)2 = 462.25 – 461.8201 = 0.4299. STAT205 – Elementary Statistics for the Biological and Life Sciences 66 Rules of Expected Value E(·) is a mathematical operator. It has certain general properties: • Rule E1: E(aX + bY) = aE(X) + bE(Y) = aµX + bµY • Rule E2: E(a + bY) = a + bE(Y) = a + bµY (a “linear operator”) STAT205 – Elementary Statistics for the Biological and Life Sciences 67 Rules of Variance The special variance operator also has certain general properties: • Rule E3: If X and Y are independent, then sX+Y2 = sX2 + sY2. • Rule E4: If X and Y are independent, then sX–Y2 = sX2 + sY2. • General rule: If X and Y are independent, then saX+bY2 = a2sX2 + b2sY2. STAT205 – Elementary Statistics for the Biological and Life Sciences 68 Example 3.41 Ex. 3.41: X = mass of cylinder from balance. Y = mass of cylinder from 2nd balance. Suppose sX = 0.03 and sY = 0.04. Then, if we calculate the difference between the two weighings, X – Y, we know sX-Y = = s2X + s2Y = 0.03 2 + 0.04 2 0.0009 + 0.0016 = 0.0025 = 0.05 STAT205 – Elementary Statistics for the Biological and Life Sciences 69 Independent Trials DEF’N: The INDEPENDENT TRIALS MODEL occurs when (i) n independent trials are studied (ii) each trial results in a single binary obsv’n (iii) each trial’s success has (constant) probability: P{success} = p Notice that if P{success} = p, P{failure} = 1–p. We call this a BInS (Binary / Indep. / n is const. / Same p) setting. STAT205 – Elementary Statistics for the Biological and Life Sciences 70 Example 3.43 Ex 3.43: Suppose 39% of organisms in a popl’n exhibit a mutant trait. Sample n=5 organisms randomly and check for mutation: • • • • Binary? Indep.? n const.? Same p? (mutant vs. non-mutant) (if no bias in sampling) (n=5) (p = 0.39) STAT205 – Elementary Statistics for the Biological and Life Sciences 71 Binomial Distribution DEF’N: In a BInS setting, if we let Y = {# successes} then Y has a BINOMIAL DISTRIBUTION. NOTATION: Y ~ Bin(n,p). The binomial probability function is P{Y = j} = nCj p j (1 – p)n–j (j = 0,1,…,n). STAT205 – Elementary Statistics for the Biological and Life Sciences 72 Binomial Coefficient In the binomial probability function P{Y = j} = nCj p j (1 – p)n–j the BINOMIAL COEFFICIENT is n! C = n j j! (n-j)! Also, j! is the FACTORIAL OPERATOR: j! = j(j–1)(j–2)…(2)(1) We define 0! = 1. STAT205 – Elementary Statistics for the Biological and Life Sciences 73 Factorial Operator Example of factorial operator: at n = 5, 5! = (5)(4)(3)(2)(1) = 120 4! = (4)(3)(2)(1) = 24 3! = (3)(2)(1) = 6 2! = (2)(1) = 2 So: j nCj 0 1 1 5 2 10 3 10 4 5 5 1 (Also see Table 3.6 on page 105 of text.) Values of nCj are given in Table 2 (p. 674) STAT205 – Elementary Statistics for the Biological and Life Sciences 74 Table 3.6 STAT205 – Elementary Statistics for the Biological and Life Sciences 75 Example 3.45 Ex 3.45 (Ex. 3.43 cont’d): Y ~ Bin(5 , 0.39); So P{Y = 3} = 5C3(.39)3(.61)2 = (10)(.0593)(.3721) = 0.22. Can also find this via DoStat. Table 3.7 gives the full distribution. Figure 3.15 gives a probability histogram. STAT205 – Elementary Statistics for the Biological and Life Sciences 76 Binomial Mean & Variance If Y ~ Bin(n,p), the population mean and variance are: µY = np and sY2 = np(1–p) Ex. 3.49: Y = {# Rh+ in BInS sample}. We’re given p = P{Rh+} = 0.85. So, if n = 6, we expect µY = (6)(0.85) = 5.1 Rh+ in the sample, with sY2 = (6)(.85)(.15) = 0.765, so that sY = √.765 = 0.87 Rh+ . STAT205 – Elementary Statistics for the Biological and Life Sciences 77 Chapter 4: The Normal Distribution Selected tables and figures from Samuels, M. L., and Witmer, J. A., Statistics for the Life Sciences, 3rd Ed. © 2003, Prentice Hall, Upper Saddle River, NJ. Used by permission. STAT205 – Elementary Statistics for the Biological and Life Sciences 78 Normal Distribution DEF’N: A continuous random variable Y has a NORMAL DISTRIBUTION if its probability density can be written as 2 2 -(y-µ ) 2s 1 / Y Y f (y) = e sY 2 over –∞ < y < ∞. NOTATION: Y ~ N(µY , sY2) The mean and variance of a normal dist’n are E(Y) = µY and E[(Y – µY)2] = sY2. STAT205 – Elementary Statistics for the Biological and Life Sciences 79 Normal Dist’n Examples The Normal distribution appears in many biological contexts: Ex. 4.1: Y = serum cholesterol (mg/dLi) Ex. 4.2: Y = eggshell thickness (mm) Ex. 4.3: Y = nerve cell interspike times (ms) STAT205 – Elementary Statistics for the Biological and Life Sciences 80 Normal Curve The Normal density curve is (i) continuous over –∞ < y < ∞ (ii) symmetric about y = µ (iii) unimodal, and hence “bell-shaped” STAT205 – Elementary Statistics for the Biological and Life Sciences 81 Figure 4.7 Since each µ,s2 pair indexes a different Normal dist’n, this represents a rich family of curves: STAT205 – Elementary Statistics for the Biological and Life Sciences 82 Standard Normal DEF’N: The STANDARDIZATION FORMULA for Y ~ N(µ,s2) is Z = (Y – µ)/s This is often called a ‘Z-score’. If Y ~ N(µ,s2), then Z ~ N(0,1) and we say Z has a STANDARD NORMAL dist’n. Std. Normal probab’s are tabulated in Table 3 (p. 675) and on text’s inside front cover. STAT205 – Elementary Statistics for the Biological and Life Sciences 83 (Portion of) Table 3, p.675 STAT205 – Elementary Statistics for the Biological and Life Sciences 84 P(Z ≤ z) Example: (p. 124) Suppose Z ~ N(0,1). Find P{Z ≤ 1.53}. In Table 3: 1.53 0.03 1.5 ………... 0.9370 Hint: “always draw the picture” STAT205 – Elementary Statistics for the Biological and Life Sciences 85 P(a < Z ≤ b) If Z ~ N(0,1), and we find P{Z ≤ 1.53} = 0.937, notice then that P{Z > 1.53} = 1 – 0.937 = 0.063. Example: (p. 125) Suppose Z ~ N(0,1); then P{–1.20 < Z ≤ 0.80} = P{Z ≤ 0.80} – P{Z ≤ –1.20} = 0.7881 – 0.1151 = 0.6730. (See Fig. 4.11) Can also find Std. Normal probabilities using DoStat’s Normal dist’n calculator! STAT205 – Elementary Statistics for the Biological and Life Sciences 86 Empirical Rule, revisited If Z ~ N(0,1), it mimics the empirical rule very closely: The same effect holds for any Y ~ N(µ,s2). STAT205 – Elementary Statistics for the Biological and Life Sciences 87 Example 4.5 Ex. 4.5: Y = length of herrings (mm). Suppose Y ~ N(54, 20.25). Then we know Z = Y - 54 = Y - 54 ~ N(0,1) 4.5 20.25 (a) What % of fish are less than 60 mm long? P[Y < 60] = P Y - 54 < 60 - 54 4.5 4.5 = P Z < 6 = P[Z < 1.33] 4.5 = 0.9082 STAT205 – Elementary Statistics for the Biological and Life Sciences 88 Example 4.5 (cont’d) Y = length of herrings ~ N(54, 20.25). (c) What % of fish are between 51 and 60 mm long? P[51 < Y < 60] = P 51 - 54 < Y - 54 < 60 - 54 4.5 4.5 4.5 = P -3 < Z < 6 4.5 4.5 = P[-.67 < Z < 1.33] = P[Z 1.33] - P[Z < -.67] = 0.9082 - 0.2514 = 0.6568 STAT205 – Elementary Statistics for the Biological and Life Sciences 89 Std. Normal Tail Areas We can also INVERT the std. Normal table (Table 3): Z ~ N(0,1), so find P{Z < 1.96} = 0.975. Then we know P{Z > 1.96} = 1 – 0.975 = 0.025. So, 2.5% of std. normal popl’n exceeds 1.96. STAT205 – Elementary Statistics for the Biological and Life Sciences 90 za More generally, if we find some number za such that P{Z ≤ za} = 1 – a, we know P{Z > za} = a and vice versa: STAT205 – Elementary Statistics for the Biological and Life Sciences 91 Std. Normal Critical Point DEF’N: The UPPER- a CRITICAL POINT from Z ~ N(0,1) is the value za such that P{Z > za} = a. Find za by: • carefully inverting Table 3 • reading off the bottom row (df = ∞) of Table 4 (p. 677) • using DoStat’s Normal dist’n calculator STAT205 – Elementary Statistics for the Biological and Life Sciences 92 Percentiles DEF’N: The point of a distribution below which p% lies is the p th PERCENTILE of the dist’n. If Z ~ N(0,1), za is the (1 – a)th percentile of Z. We often ask what value is the p th percentile of a biological population (see Ex. 4.6). STAT205 – Elementary Statistics for the Biological and Life Sciences 93 Example 4.6 STAT205 – Elementary Statistics for the Biological and Life Sciences 94 Example 4.6 (cont’d) We want to find y* such that P{Y < y*} = 0.70. This is y* - 54 y* - 54 Y 54 P < = P Z< 4.5 4.5 4.5 Now, from Table 3 we find P{Z < 0.52} = 0.6985 is close to 0.70. This tells us to equate (approximately) 0.52 and (y*–54)/4.5 y* – 54 ≈ (0.52)(4.5) y* ≈ (0.52)(4.5) + 54 = 56.34 STAT205 – Elementary Statistics for the Biological and Life Sciences 95 Example 4.6 (conclusion) So, we find that approximately 70% (69.85%, exactly) of herring are less than 56.34 mm long. Notice also that we derived the critical point z0.30 ≈ 0.52. (More precisely, we found z0.3015 = 0.52.) Using DoStat, we can find z0.30 = 0.5244: this yields the exact value y* = (0.5244)(4.5) + 54 = 56.36 for Example 4.6. STAT205 – Elementary Statistics for the Biological and Life Sciences 96 Assessing Normality Since many statistical procedures are based on having data from a normal population, we need ways to access whether it is a reasonable to use a normal model. We have shown that a histogram can be distorted by the selection of group size (binwidth) so we will consider a statistical graph called a normal probability plot or QQ plot. STAT205 – Elementary Statistics for the Biological and Life Sciences 97 QQ Plots A QQ Plot can be used to assess normality of the data. A QQ Plot is a scatter plot of the ordered pairs for the normal score (x) vs. data value (y) for all values in a data set. If the plot of data points show a linear pattern we can infer that the data values follow a normal distribution. STAT205 – Elementary Statistics for the Biological and Life Sciences 98 Example The heights in inches of 11 women are listed below. Check the assumption that the data is distributed normally. 61 62.5 63 64 64.5 65 66.5 67 68 68.5 70.5 STAT205 – Elementary Statistics for the Biological and Life Sciences 99 Normal Probability Plot of the Height Data STAT205 – Elementary Statistics for the Biological and Life Sciences 100 Example Measurements made for 62 mammals. Reference: Sleep in Mammals: Ecological and Constitutional Correlates, by Allison, T. and Cicchetti, D. (1976), Science, November 12, vol. 194, pp. 732-734. Variable: Brain Weight (g) STAT205 – Elementary Statistics for the Biological and Life Sciences 101 Normal Probability Plot of Brain Weight (g) STAT205 – Elementary Statistics for the Biological and Life Sciences 102 Normal Probability Plot of log(brainweight(g)) STAT205 – Elementary Statistics for the Biological and Life Sciences 103 Chapter 5: Sampling Distributions Selected tables and figures from Samuels, M. L., and Witmer, J. A., Statistics for the Life Sciences, 3rd Ed. © 2003, Prentice Hall, Upper Saddle River, NJ. Used by permission. STAT205 – Elementary Statistics for the Biological and Life Sciences 104 Sampling Variability Question: If Y is random, say Y ~ N(µ,s2), and we take a random sample, Y1,Y2,…,Yn, aren’t the Yi’s also random? And, if the Yi’s are random, aren’t any statistics based on them, such as Y or S2? This is known as SAMPLING VARIABILITY. STAT205 – Elementary Statistics for the Biological and Life Sciences 105 Sampling Distributions The fact that a sample statistic may itself have a probab. dist’n is called the SAMPLING DISTRIBUTION of the statistic. Think of it as repeatedly taking a new sample from the same popl’n and finding each sample mean, ad infinitum. • What will the probab. histogram/density function of the sample mean look like? The textbook calls this a Meta-Experiment. STAT205 – Elementary Statistics for the Biological and Life Sciences 106 Binary Data Recall that for Y ~ Bin(n,p) we can estimate p if it is unknown using the SAMPLE PROPORTION: p = Y n Since Y is random, so is this statistic. What is the sampling dist’n of p ? STAT205 – Elementary Statistics for the Biological and Life Sciences 107 Example 5.4 Ex. 5.4: Y = # of people with 20/15 vision (“superior”). Say n = 2. We are given P{superior} = 0.3. Let p = Y/n. What are its possible values? Clearly, Y = 0, 1, or 2. Thus, e.g., Pp= 1 2 = P[Y = 1] = 2C1 (.3)1(.7)1 = (2)(.3)(.7) = .42 STAT205 – Elementary Statistics for the Biological and Life Sciences 108 Example 5.4 (cont’d) Sampling dist’n of p : j 0 1 2 p 0 1/2 1 P(Y = j) .49 .42 .09 j 2 .49 .42 .09 P (p = ) STAT205 – Elementary Statistics for the Biological and Life Sciences 109 Large-Sample Dist’n Example 5.4 gives the sampling dist’n at n = 2. The effort gets harder as n increases. (Try it at n = 10….) Fig. 5.5 shows the effect at larger n: STAT205 – Elementary Statistics for the Biological and Life Sciences 110 Continuous Data DEF’N: Given a random sample, Y1,Y2,…,Yn, where E(Yi) = µ and E[(Yi – µ)2] = s2, then (i) the POPL 'N MEAN of Y is E(Y) = µ (ii) the POPL'N VARIANCE of Y is sY2 2 s = n (iii) the POPL'N SD of Y is sY = s n Notice: same popl’n mean, while SD as n . STAT205 – Elementary Statistics for the Biological and Life Sciences 111 Distribution of the Sample Mean If Yi ~ i.i.d. N(µ , s2) for i = 1,…,n, then 2 s Y ~ N(µ , n ) Once again: • Same mean • SD as n • So, more precision as as n STAT205 – Elementary Statistics for the Biological and Life Sciences 112 Example 5.9 Ex. 5.9: Y = weight of seeds ~ N(500,14400). Suppose n = 4. Since Y is normal, so is the sample mean: Y ~ N(500 , 14400 4 ) = N(500,3600) And so, Z = Y - 500 = Y - 500 ~ N(0,1) 60 3600 STAT205 – Elementary Statistics for the Biological and Life Sciences 113 Example 5.9 (cont’d) So, e.g., P[Y > 550] = P Y - 500 > 550 - 500 3600 3600 = P Z > 50 = P[Z > 0.83] 60 = 1 - P[Z < 0.83] = 1 - .7967 = 0.2033 STAT205 – Elementary Statistics for the Biological and Life Sciences 114 CLT Theorem: The CENTRAL LIMIT THEOREM states that for any i.i.d. random sample, Y1,Y2,…,Yn, where E(Yi) = µ and E[(Yi – µ)2] = s 2, 2 s Y N(µ , n ) as n ∞. This is approximately true for any finite n, and the approximation improves as n ∞. (A powerful tool !) STAT205 – Elementary Statistics for the Biological and Life Sciences 115 CLT and Sample Size Sometimes, the CLT kicks in after only a few observations ( small n). But, sometimes we need a very large n: STAT205 – Elementary Statistics for the Biological and Life Sciences 116 Example 5.13 Ex. 5.13: Y = # eye facets in fruit fly. • Clearly Y is a count and can’t be exactly normal (see the idealized plot in Fig. 5.13). • But, by about n = 32 we’re close to normal: STAT205 – Elementary Statistics for the Biological and Life Sciences 117 Unbiased Estimation Parameters such as µ or p are usually unknown, and we use the sample data to estimate them. DEF’N: If an estimator q of an unknown parameter q has the property that E q = q we say it is an UNBIASED ESTIMATOR. (A BIASED estimator is not unbiased.) For instance, we know E(Y) = µ, so Y is unbiased for µ. STAT205 – Elementary Statistics for the Biological and Life Sciences 118 Standard Error DEF’N: The STANDARD ERROR of a point estimator is the estimated SD (the square root of the variance) of the estimator: SE q = Variance q DEF’N: The STANDARD ERROR OF THE MEAN (SEM) is the estimated SD of the sample mean: 2 2 SY s S Y Y SE (Y) = n = n = n STAT205 – Elementary Statistics for the Biological and Life Sciences 119 Examples 6.1-6.2 Ex. 6.1-6.2: Y = stem length of soybean plants (cm). n = 13: 2 We find Y = 21.34 cm and S = 1.486 so SE(Y) = 1.486 = 1.22 = 0.338 cm 13 13 STAT205 – Elementary Statistics for the Biological and Life Sciences 120 SE vs. SD DO NOT confuse the SE with the SD ! In Ex. 6.2, the SD of the sample was S = √1.486 = 1.22, but the SEM was 1.22/√13 = 0.34. (Usually, we round SEM to 2 signif. digits.) Notice here again that as n , SEM more precision in larger samples. STAT205 – Elementary Statistics for the Biological and Life Sciences 121