C-N M120 Lecture Notes Based very loosely on Bluman’s Elementary Statistics (2008) B. A. Starnes 4 – 2014 Lecture Why do we study and/or use statistics? Is it for academic harassment ... to make sure that students in various majors run the “gauntlet”??? Is it to appease baseball fans with meaningless reams of numbers? Is it to employ meaningless analysis to “prove” things we already know? Is it to actually determine if a potential cure for some ailment doesn't end up killing someone? How about hurricane path predictions, insurance rate setting, agricultural, oil spill, and tsunami predictions? These are all good questions that we hope to answer in time. The last questions are the best ones and the answer to those are all “yes”!!! We will begin our study in Statistics with the notion of probability. Wackerly, Schaffer and Mendenhall (2002) define this as the belief in the occurrence of a future event. This is a good definition. We will say, more precisely, that it is man’s guess at what God is going to do. There are three major interpretations of probability: subjective, relative frequency and classical. Although we will study the classical interpretation in depth, the other two are quite useful in certain situations. For instance, we know that the law of sin that is at work in mankind because of the constant failure of world systems to maintain peace or prevent crime. Or, one might flip a coin 100 times and note that the number of heads is 52 and conclude that heads will occur about half the time. These are obvious examples of the relative frequency interpretation of probability. Here are some examples of subjective probability. Someone might feel that C-N has a 5% chance of winning the NCAA DII national football championship. Mel Kiper makes a good living off of assessing the future NFL value of college football players. Both of these would be examples of the subjective interpretation of probability. In this course, we will concentrate on the classical (or theoretical) interpretation of probability. We can discuss the rules of probability at a later time. 1S C-N M120 Lecture Notes Based very loosely on Bluman’s Elementary Statistics (2008) B. A. Starnes 4 – 2014 Lecture For now we will visualize the concepts using Venn diagrams. These are diagrams utilized by the English mathematician John Venn to illustrate concepts with sets. They are particularly useful for probabilities as well. We'll use one here to set out some definitions for the course. Here we have some illustrations of sets A and B in the set (omega set). We can think of the set as a dart board in which the thrown dart can never miss. Sometimes the Ω set is also known as the sample space, universal set, or population. We will also assume that the dart can hit each point in Ω with equal likelihood ... that is, it is thrown randomly in . The area of a specified set (A for example) divided by the total area () is the probability of the set being hit with the dart. Notice that in real applications, the concept of area can be easily replaced with number within the boundaries each set, and the idea of throwing a dart replaced with a random selection. In the graph above, the light blue area represents (A ∪ B)c, read “the complement of, the quantity, A union B, the quantity”. The medium blue represents A\B and B\A respectively, read “A remove B and B remove A” respectively. The darker blue represents A ∩ B, read “A intersect B”. Note finally that A ∪ Ac = and A ∩ Ac = , the “null” set. Now any study of classical probability must include the rules of probability. As Andy Griffith once told Opie, “I like to have fun like everybody else, but we have to obey the rules!” The first rule is … P(E) (“the probability of event E”) ∈ [0,1]. Now this means that all probabilities must occur in the [0, 1] interval. The closer to certainty we are for event E, the closer the probability is to 1. Do we/you know of something having a probability of 0 or 1? Of course not! These probabilities belong solely to God. Man's realistic probabilities lie on the open interval (0, 1). But in the course of our study we will write probabilities of 0 or 1 for events that are highly unlikely or very nearly certain respectively. Here's a brief video to illustrate this. http://www.youtube.com/watch?v=KX5jNnDMfxA&feature=related Next we wish to assert some relation rules which will help us ascertain various probabilities. The Complementation Rule states that for an event A, P(A) = 1 – P(Ac). That should make sense based on the notion of the complementary event. Here is an example. 2S C-N M120 Lecture Notes Based very loosely on Bluman’s Elementary Statistics (2008) B. A. Starnes 4 – 2014 Ex. The probability of rain tomorrow in Jefferson City is given to be 70% (or .7). What is the probability that it doesn't rain? P(R) = 1 – P(Rc), so P(R) - 1 = - P(Rc), and 1 - P(R) = P(Rc) = 1 - .7 = .3. The probability that it doesn't rain in Jefferson City tomorrow is 30% or .3. [] Lecture The General Addition Rule for 2 events is given by... P(A ∪ B) = P(A) + P(B) – P(A ∩ B). We also call this “the Pirate Rule” because it's acronym is “GAR” ... matey! The General Addition Rule for n events is an extension of this that the reader should attempt to figure out. Ex. Consider the situation in which someone is selected at random from the famous Carson-Newman Marching Eagle band represented by this contingency table. C-N Band Front Brass Wind Percussion Total Female 14 12 24 10 60 Male 0 14 16 10 40 Total 14 26 40 20 100 Notice that the contingency table is, in fact, , the universal set for this experiment. Observe the following probabilities. P(Male) = 40/100 = .4. P(Female) = P(Malec) = 1 - .4 = .6. P(Brass) = 26/100 = .26. P(Brass ∪ Male) = P(Brass) + P(Male) – P(Brass∩Male) = .26 + .4 - .14 = .52. Let's see, there 40 males and 12 females who play brass. Yes, 52 of the C-N band members are either male or play brass! [] Lecture It's important to see in the previous example that the events Male and Female are considered to be disjoint ... that is they cannot occur simultaneously. 3S C-N M120 Lecture Notes Based very loosely on Bluman’s Elementary Statistics (2008) B. A. Starnes 4 – 2014 Lecture Whenever we deal with a contingency table, we'll always consider events on the same side disjoint. In this case P (Male ∩ Female) = 0. The conditional probability of event A given (or conditioned on) event B has already occurred is written P(A| B) = P(A ∩ B)/P(B). We can use this type of probability for various things ... not the least of which is joint probability! Now we can express P(A ∩ B) = P(A)P(B|A) = P(B)P(A|B). The reader should contemplate this last equation. This brings us to the notion of dependence and independence. 2 events are considered to be independent if the outcome of one does not affect the outcome of the other. That is P(A|B) = P(A) (or vice versa). If this does not occur, the events are considered to be dependent. Another way to view this mathematically is P(A|B) = P(A ∩ B)/P(B) = P(A) => P(A ∩ B) = P(A)P(B). So then, there are two definitions of independence. Another way of looking at these concepts is to consider information. Check out this commercial from AT&T. The gentleman arriving late to the party asks if the invitation is information he “would like to know”. If the prior occurrence of B leads to greater information on P(A), then that is indeed information you would like to know and the two events are dependent. If the prior occurrence of B gives no more information on P(A), then that's information that is unhelpful and the two events are considered independent. Ex. Consider again the C-N Marching Band example from above. Observe the following: P(Male|Brass) = 14/26 = .5385 (approximately). P(Brass|Male) = 14/40 = .35. Note that they are not equal. Also observe the relationships between Brass and Male. P(Brass ∩ Male) = .14 = P(Brass)P(Male|Brass) = .26(14/26). P(Male|Brass) = .5385 ≠ P(Male) = .4. Therefore these two events are considered to be dependent. Can you find any pairs of independent events in ? [] Lecture Let's have a quick review of various symbols and their meanings. 1. ∩ - “intersection”, “and” 2. ∪ - “union”, “or” 3. | - “given” 4. Ec - “not E”, “E complement” 5. P(E) - “the probability of event E” 4S C-N M120 Lecture Notes Based very loosely on Bluman’s Elementary Statistics (2008) B. A. Starnes 4 – 2014 Lecture Practice saying (aloud) these things in order to develop the relationships in your mind. Then, using the file given on the CMS content page try using the Bucket Fill tool to identify some compound events for yourself. Student problem set A A1. Consider selecting someone at random from Virginia Tech's Marching Virginians with the following contingency table: VT Band Front Brass Wind Percussion Total Female Male 27 55 70 8 160 3 75 45 37 160 Total 30 130 115 45 320 Find the following ... a) P(Front|Female), b) P(Female|Front), c) P(Front ∩ Female), d) P(Front ∪ Female), e) P(Female), f) P(Wind). g) Are the events Brass and Male independent? Why? A2. In a regular 52 card deck (this is your set), find the following … a) P(Red) b) P(Ace) c) P(Heart) d) P(Face) e) P(Red | Ace) f) P(Red | Heart) g) P(Heart | Red) h) P(Face | Black). A3. John and Beth Dickerson plan to have three children. Assuming they are successful, find the set for the outcomes of this family (use a tree diagram if necessary). What is the probability they will have two boys and a girl? What is the probability they will have two boys and a girl, in that order, given that they have two boys and a girl? A4. Using the image in the course content graphically illustrate the following sets: a) A ∩ Bc. b) Ac ∪ Bc. c) Ac ∩ Bc. A5. If P(A | B) = .35 = P(B | A), and P(B) = .3, find P(A ∩ B) and P(A ∩ Bc). 5S C-N M120 Lecture Notes Based very loosely on Bluman’s Elementary Statistics (2008) B. A. Starnes 4 – 2014 A6. If a Christmas secret passes through three people to a fourth person, and each of the first three independently have a 95% probability of telling the secret accurately … a) what is the probability that the secret is successfully transmitted to the fourth person? b) what is the probability that at least one person incorrectly transmits the information to the fourth person? A7. Using the information in problem A6, what is the probability that the fourth person successfully gets the information given that the first person correctly transmits the information to the second one? Student Solution set A A1. a) P(Front|Female) = 27/160 = .16875, b) P(Female|Front) =27/30 = .9, c) P(Front ∩ Female) = 27/320 = .0844, d) P(Front ∪ Female) = .5094, e) P(Female) = 160/320 = .5, f) P(Wind) = 115/320 = .3594, g) No, P(Brass|Male) = .46875 ≠ .40625 = P(Brass). A3. = {GGG, GGB, GBG, BGG, GBB, BGB, BBG, BBB}. P(2 Boys and Girl) = 3/8 = .375. P(BBG|2 Boys and Girl) = 1/3 = .3333. [] Lecture function Next, let's define the notion of the random variable X. If there is a X: -> A (⊂ ℝ), then this is the random variable X. Here, is exactly what we defined before (the sample space, or the universal set) Okay, Okay, what's this all about? Well, think of X as a big cat (like a Kentucky Wildcat! Where did that come from???). The cat is sitting on the classroom podium and JUMPS out on someone or something in the space (the class of students). And, then, turns and reports back a value ... like age ... or length ... or hair color ... or place of origin. We can classify the random variable according to what it reports back. Random variables (rvs) that report back a numerical quantity are called quantitative r.v.s. Random variables (rvs) that report back a non-numerical value are referred to as qualitative r.v.s (like hair color). Moreover, quantitative r.v.s for which the range is an interval are called continuous r.v.s, and those for which the range is a finite or countable set are called discrete r.v.s. Another way to recognize discrete rvs is if there is separation between the possible values. The following chart should help. 6S C-N M120 Lecture Notes Based very loosely on Bluman’s Elementary Statistics (2008) B. A. Starnes 4 – 2014 Lecture RV chart / \ Qualitative Quantitative / \ Discrete Continuous A general mathematical convention is to use letters at the last part of the Roman (or Latin) Alphabet for random variables (X, Y, Z), while letters at the beginning (a, b, c, d) are normally reserved for constant representation. Ex. Height is a quantitative continuous rv. It is numerical and can take values on an interval. Birthplace is a qualitative rv since we can't order the values. Number of siblings is a quantitative discrete rv since it is numerical, but a person cannot have 4.3 siblings. [] Lecture So now that we have an idea about the r. v. X, we want to introduce the idea of a distribution. Here's the dealio! A distribution is the random variable's numerical occurrence pattern. Think about this. What would one need to do to describe the occurrence pattern? Maybe one would want to say how and where the r.v. occurs most. Maybe one would want to indicate an expectation of the occurrences (like give an average value). Maybe one would want to describe how far apart the occurrences are. These measures are called the shape, location and spread of the distribution, respectively. At this point we will plan on using the terms random variable and distribution interchangeably. Are they the same word? No, but they do have similar adjectives. The quantification of these notions of shape, location and spread are called parameters of the r.v. All of these are important, so we use a graph to illustrate each facet in a consolidated picture. The function on the right is called the probability density function (or pdf), OR in the discrete case the probability mass function (or pmf). The pmf graph appears as “points in space”, while the pdf graph appears as a smooth curve. Here's a picture of a pdf (and cdf ... to the left … ask your teacher about that) to help get us started. 7S C-N M120 Lecture Notes Based very loosely on Bluman’s Elementary Statistics (2008) B. A. Starnes 4 – 2014 Ex. Take, for instance, the situation in which we track the movement of a vehicle through an intersection. In this case Ω = {Left, Straight, Right}. Define X: Ω → ℕ. as X(Left) = -1, X(Straight) = 0, and X(Right) = 1 (note that A = {-1, 0, 1}). X is a quantitative and discrete r.v. Further, define P(-1) = P(0) = P(1) = 1/3. This is an example of an r.v. with a Discrete Uniform distribution. This is a famous distribution which is also considered to be symmetric. Note too that P(vehicle turns) = ca.= .67. Here, we would also say that X is a symmetric r.v. [] James 1 17Every good gift and every perfect gift is from above, and cometh down from the Father of lights, with whom is no variableness, neither shadow of turning. Ex. Next suppose that we have another intersection with Y being the direction taken by the motorist, the same set A, but these probabilities. Y P(Y) -1 .2 0 .7 1 .1 What is the probability of a motorist turning here? What is the probability that the motorist turns left (wrong)? Is the spread of this rv greater, less or about the same as that of X? [] Student problem set B B1. Classify the following r.v.s: Birthplace, Letter grade in this class, ACT Score, Weight, Hair Color, Class Rank, Number of Siblings, Age. B2. Describe the shape of the distributions affiliated with the same r.v.s listed in B1 for the current class. B3. Give the shape, location and spread of the discrete uniform distribution (X) in the intersection problem above. B4. Give the shape location and spread of the variable Y from the second intersection problem above. Student Solution Set B1. Birthplace ACT Score Weight Hair Color Age Qualitative Quantitative Discrete Quantitative Continuous Qualitative Quantitative Continuous.[] 8S C-N M120 Lecture Notes Based very loosely on Bluman’s Elementary Statistics (2008) B. A. Starnes 4 – 2014 Lecture For the large remainder of the course,we will focus on quantitative r.v.s ... which have dim = 1. These quantitative r.v.s, then, have distributions that we can study in C-N MATH 120. In this case we can use a variety of graphs to give us insight into not only the shape, but the location and spread of the r.v. as well. In some instances, particularly those in which the r.v. is either continuous or close to it, we utilize a stem 'n' leaf plot. This is somewhat similar to the histogram, except that it gives much more information. The idea is to divide the data values into stems of a multiples of a power of 10, and then list the next lower power of 10 values into the appropriate stem. A histogram, on the other hand, is a stem 'n' leaf plot without the leaves! Ex. Suppose we have a class of 25 students having the following commuting times (door to door) for getting to M201 (in min). Consider this a sample of 25 observations of the r.v. X ... the commuting time of a 2007 C-N student. Data (min): 23, 2, 10, 15, 2, 5, 5, 6, 15, 6, 10, 6, 2, 45, 35, 2, 5, 5, 25, 20, 5, 5, 5, 2, 6. And here is the data displayed in a stem 'n' leaf plot. X (10 min) X (min) 0 2,2,2,2,2,5,5,5,5,5,5,5,6,6,6,6 1 0,0,5,5 2 0,3,5 3 5 4 5 Do you see what's going on in terms of the powers of 10? The left column deals with units of 101 while the right deals with units of 100. This can be done with other powers of 10 as you shall see. Since the “heights” of the stacks go “down” from left to right, we classify the shape of this variable as right skewed. Turn your head to the right to see this? This can also be generated using the TI calculator, SAS or R. [] 9S C-N M120 Lecture Notes Based very loosely on Bluman’s Elementary Statistics (2008) B. A. Starnes 4 – 2014 Lecture There are obviously ways to quantify what we're seeing in the distributions in terms of shape, center and spread ... particularly the last two. As mentioned earlier, these are called the parameters of the r.v. (or distribution). We shall now consider the following parameters of location: mean, median, mode. The mean of a r.v. X is its expectation. That is, it's where we expect the r.v. to occur in general. Sometimes the mean is written E(X) for “expectation of X”. Yet another way to express the mean is to say the arithmetic average of X. The mean in statistics is the same thing mathematically as the first raw moment in physics. It is a function of distance and mass. As a rule, the mean of a random variable is assigned the value and is the balance point of the distribution. The mean give a fair approximation for the location of a r.v., but the drawback is that it is susceptible to extreme values. A more robust (less susceptible) parameter is the median. The median is the probabilistic midpoint of the distribution. It is the value for which the r.v. X has an equal probability (or likelihood) of occurring above or below. The median is also called the 50th%ile or Q2. Similarly Q1, and Q3, are represent the 25th%ile (First Quartile) and 75%ile (Third Quartile) for X respectively. The median is estimated by taking the “middle value(s)” in the ordered data set. Can you see why this would be a more robust parameter? Finally, the mode of a r.v. is simply the value of X corresponding to the highest point(s) of the pdf or pmf. It is estimated by finding the value in a sample data set that occurs most often. An interesting note is that the mean, median and mode often occur according to the skew of the r.v. Ex. Consider the r.v. X having the following graph ... Notice that the r.v. is quantitative continuous and left skewed. Consequently, the mean, median and mode will occur from left to right. The student should also attempt to locate the quartiles on the graph. [] Lecture Measures of spread for a r.v. X are not as easily quantified. Nevertheless, they are useful. We shall consider the following parameters of spread: variance, standard deviation, range. Like the expectation (or mean) the variance is a moment. In the case of the mean we are looking at an expected location for the r.v. X. 10S C-N M120 Lecture Notes Based very loosely on Bluman’s Elementary Statistics (2008) B. A. Starnes 4 – 2014 Lecture Here, we are looking at an expected squared missed distance from , although we can't get at that just yet. If we simply added up the missed distances from together, we'd always get 0, so we have to do the squared missed distances to get a meaningful value. The expectation of this squared missed distance is called the variance, and denoted 2. This is all well and good, but something is lost in the translation ... namely the unit of X. It's hard to see the variance on a graph of X. So we can take the square root of the variance and have (essentially) what we wanted to begin with ... the average missed distance ... sort of. This is called the standard deviation of X and is designated with . This is probably the most useful measure of spread. The range of X is simply that ... the distance between the smallest and largest possible values of X. Note that this is the length of A, the other range ... back when we first defined r.v.s on p 6. Ex. Consider again, the r.v. from the last example. Notice that we have depicted here three of the spread parameters. The variance would make little to no sense here. [] Lecture So is it enough to just look at pictures and estimate the parameters of the r.v.s??? Of course not! Again, from the (hopefully) random sample of observations (data set) of the r.v. we estimate the aforementioned parameters with these estimators ... er ... statistics. Since they are obtained from a data set, we use the word “sample” in front of them. Here is a recap. Sample Shape: Left skewed, Right skewed, Symmetric Sample Mean: Sample Median: Sample Mode: Sample Quartiles: Xn = ( i=1n xi)/n Estimates: middle value in the ordered data set Estimates: X.5) The most recurrent value in the data set The medians of the halves (ordered data set) Estimates: X.75,X.25 Sample Variance: Sample Std Devn: Sample Range: s2 = i=1n (xi- Xn)2/(n-1) s = √(sample variance (s2)) max(xi) – min(xi) Estimates: 2 Estimates: We divide by n – 1 in the variance estimate to account for using the data to estimate one parameter already and thereby losing a degree of freedom. 11S C-N M120 Lecture Notes Based very loosely on Bluman’s Elementary Statistics (2008) B. A. Starnes 4 – 2014 Ex. Suppose we have a class of 25 students having the following commuting times (door to door) for getting to M201 (in min). Data (min): 23, 2, 10, 15, 2, 5, 5, 6, 15, 6, 10, 6, 2, 45, 35, 2, 5, 5, 25, 20, 5, 5, 5, 2, 6. Find the aforementioned shape, location and spread estimators for these bad boys. Earlier we estimated the shape to be right skewed from the stem 'n' leaf plot. X25 = 10.68min, X.5 = 6min, mode = 5min. (From stem 'n' leaf plot) First Quartile = 5min, Third Quartile = 15min. s = 11.0707min, s2 = 122.56min2, IQR = 10min, Range = 43min. And finally, one quick way to summarize all three features of a r.v. X is to give what is known as the 5 number summary. This is simply the min(xi), Q1, Q2, Q3, max(xi) ... in that order. The 5 number summary, (2, 5, 6, 15, 45) min, is useful in creating the boxplot (below). The boxplot reveals the shape, location and spread of the r.v. X. The 5 number summary can also be found easily using the TI calculator, SAS or R. [] Student problem set C C1. Here we have a data set on the r.v. X = winter temperature in C in Jefferson City. Hourly readings (in C) were taken on a randomly selected blustery winter day. f represents the frequency for each reading X(C) f ­­­­­­­­­­­ 7 3 8 3 9 5 10 6 11 7 ­­­­­­­­­­­ total 24 Find the sample mean, sample median, sample mode, s2, 5 number summary, s, and a shape description of this r.v. 12S C-N M120 Lecture Notes Based very loosely on Bluman’s Elementary Statistics (2008) B. A. Starnes 4 – 2014 C2. Consider the following data set of observations from the r.v. X, attendance at C-N home football games (Roy Harmon Field at Burke-Tarr Stadium). Attendance(fan): 5130, 4292, 3315, 4093, 4714. Find the sample mean, sample median, s, SSE, 5 number summary and a shape description for this r.v. Do a boxplot to confirm the shape. C3. Another way of thinking of for a r.v. X is as its expectation. So for a discrete r.v. we have ... = E(X) = xP(x). Suppose it costs $3 to play a game in which a player rolls two dice. If the player rolls a 2 or 12, she wins $20. If she rolls a 7, she wins $5. If X is the r.v. representing the amount gained by playing the game find = E(X). Is it wise to play this game? C4. Lett X be the previously defined random variable in problem C3. Find the variance of X. C5. Let X be the previously defined random variable in problem C3. Describe the shape of X. C6. Let X be = number of saves/week got by the Indy Mustangs baseball team during the 2011 BRBA season. Thus far we have the following data set. X(save): 5, 9, 3, 4, 5, 5, 2, 4, 5 Find the sample mean and sample standard deviation for X. C7. Let X be previously defined random variable in problem C6. Give the 5 number summary for X and describe the distribution. Student Solution Set C C1. 9.4583 C, 10 C, 11 C, 1.9119 C2, (7 C, 8.5 C, 10 C, 11C, 11C), 1.383 C, Quantitative Continuous and Left Skewed. C2. 4308.8f, 4292f, 684.297f, 1873050f2, (3315f, 3704f, 4292f, 4922f, 5130f), Quantitative Discrete and Left Skewed. C3. $17 x (1/18) + $2 x (1/6) - $3 x (7/9) = -$1.06. [] 13S C-N M120 Lecture Notes Based very loosely on Bluman’s Elementary Statistics (2008) B. A. Starnes 4 – 2014 Lecture This should be a good primer for C-N Math 201. 14S