• A random variable is a variable whose values are numerical outcomes of a random experiment. That is, we consider all the outcomes in a sample space S and then associate a number with each outcome • Example: Toss a fair coin 4 times and let X=the number of Heads in the 4 tosses We write the so-called probability distribution of X as a list of the values X takes on along with the corresponding probabilities that X takes on those values. We know X is B(4,.5) • Recall the figure below showing the probability distribution of X. Eachindividual outcome has prob=1/16, and using dbinom(0:4,4,.5) you can find the P(X=x), where x=0,1,2,3,4 • There are two types of r.v.s: discrete and continuous. A r.v. X is discrete if the number of values X takes on is finite (or countably infinite). In the case of any discrete X, its probability distribution is simply a list of its values along with the corresponding probabilities X takes on those values. Values of X: x1 x2 … xk P(X): p1 p2 pk NOTE: each value of p is between 0 and 1 and all the values of p sum to 1. We display probability distributions for discrete r.v.s with so-called probability histograms. The next slide shows the Binomial probability histogram for X=# of Hs in 4 tosses of a fair coin. The next slide gives a similar example... • Toss a fair coin until you get the first occurrence of "H". Let X = the number of the toss on which the first "H" appears. • What are the possible values of X? What are the corresponding probabilities? Values of X: 1 P(X) 2 3 4 5 6 … : X is called a geometric r.v. and in R is computed with the dgeom, pgeom, qgeom, rgeom functions - the d, p, q, and r stand for the same functions we've seen before… what would the probability histogram look like? • A continuous r.v. X takes its values in an interval of real numbers. The probability distribution of a continuous X is described by a density curve, whose values lie wholly above the horizontal axis, whose total area under the curve is 1, and where probabilities about X correspond to areas under the curve. • The first example is the random variable which randomly chooses a number between 0 and 1 (perhaps using a spinner). This r.v. is called the uniform random variable and has a density curve that is completely flat! Probabilities correspond to areas under the curve... use the punif(x) = P(X <= x) to get the areas under the uniform r.v.; e.g., P(.3 < X < .7) = punif(.7) - punif(.4) A continuous random variable X takes all values in an interval. Example: There are an infinite number of values between 0 and 1 (e.g., 0.001, 0.4, 0.0063876). How do we assign probabilities to events in an infinite sample space? We use density curves and compute probabilities for intervals. The probability of any event is the area under the density curve for the values of X that make up the event. This is a uniform density curve for the variable X. The probability that X falls between 0.3 and 0.7 is the area under the density curve for that interval (base x height for this density): P(0.3 ≤ X ≤ 0.7) = (0.7 – 0.3)*1 = 0.4 X The probability of a single point is meaningless for a continuous random variable. Only intervals can have a non-zero probability, represented by the area under the density curve for that interval. The probability of a single point is zero since there is no area above a point! This makes the following statement true: Height =1 The probability of an interval is the same whether boundary values are included or excluded: P(0 ≤ X ≤ 0.5) = (0.5 – 0)*1 = 0.5 P(0 < X < 0.5) = (0.5 – 0)*1 = 0.5 X P(0 ≤ X < 0.5) = (0.5 – 0)*1 = 0.5 P(X < 0.5 or X > 0.8) = P(X < 0.5) + P(X > 0.8) = 1 – P(0.5 < X < 0.8) = 0.7 (You may use either the “OR” Rule or the “NOT” Rule...) • The other example of a continuous r.v. that we’ve already seen is the normal random variable. See the next slide for a reminder of how we’ve used the normal and how it relates to probabilities under the normal curve... Continuous random variable and population distribution The shaded area under a density curve shows the proportion, or %, of individuals in a population with values of X between x1 and x2. Because the probability of drawing one individual at random depends on the frequency of this type of individual in the population, the probability is also the shaded area under the curve. % individuals with X such that x1 < X < x2 Mean of a random variable •The mean x bar of a set of observations is their arithmetic average. •The mean µ of a random variable X is a weighted average of the possible values of X, reflecting the fact that all outcomes might not be equally likely. A basketball player shoots three free throws. The random variable X is the number of baskets successfully made (“H”). MMM HMM MHM MMH HHM HMH MHH HHH Value of X 0 1 2 3 Probability 1/8 3/8 3/8 1/8 The mean of a random variable X is also called expected value of X. What is the expected number of baskets made? Do the computations... • We’ve already discussed the mean of a density curve as being the “balance point” of the curve… to establish this mathematically requires some higher level math… So we’ll think of the mean of a continuous r.v. in this way. For a discrete r.v., we’ll compute the mean (or expected value) as a weighted average of the values of X, the weights being the corresponding probabilities. E.g., the mean # of Hs in 4 tosses of a fair coin is computed as: (1/16)*0 + (4/16)*1 + (6/16)*2 + (4/16)*3 + (1/16)*4 = (32/16) = 2. • In either case (discrete or continuous), the interpretation of the mean is as the long-run average value of X (in a large number of repetitions of the experiment giving rise to X). • We've used mean(rbinom(1000, 5, .1)) for example to simulate the mean of a binomial r.v. (n=5, p=.1). • Look at the Pick 3 Lottery, like the old numbers game…you pay $1 to play (pick a 3 digit number), and if your number comes up, you win $500; otherwise, you win nothing. • What is the probability that you win (i.e., that your 3 digits match the ones chosen that night)? • What is the probability that you lose? • Define X = your winnings when you play "Pick 3" possible values of X: P(X) : So what is your expected winnings? • There's also a discrete uniform r.v.: Like Table B that I've handed out… sample(0:9, 1, replace=T), chooses 1 number from Table B; sample(0:9,100,replace=T) chooses 100. • Let X = the digit chosen at random from Table B • What are the values of X? What are the corresponding probabilities? What is the mean? • Use the sample function to simulate this and see if you can tell what the mean and s.d. are… HINT: Try this: z=numeric(1000); zz=numeric(1000) for (i in 1:1000) { x=sample(0:9,20,replace=T); m=mean(x); s=sd(x) z[i] = m ; zz[i] = s } par(mfrow=c(1,2)) hist(z) ; hist(zz) mean(z) ; mean(zz) • Now what if we look at means of samples of size n=20 of these 10 digits (0,1,…,9) - what does the distribution look like then? Try this R code: par(mfrow=c(1,1)) #a single plot per page z=numeric(1000) for (i in 1:1000) { x=sample(0:9,20,replace=T); m=mean(x); z[i] = m} hist(z) ; mean(z) ; sd(z) What's going on here? Notice especially the standard deviation - compare the sd of this simulation of means of samples of 20 with the sd of samples of 20… why is the sd of the means ~ .6 while the sd of the digits is ~2.9 ? • Is there an intuitive reason why this might be happening? • The mathematical reason is called the Central Limit Theorem. • • Central Limit Theorem: Suppose we take a large sample of size n from a population with mean = and sd = (call the sample X1, X2, … , Xn). If Xbar = mean(X1, X2, … , Xn), then the distribution of Xbar will look Normal with mean = and sd = /sqrt(n) . • All the situations we've been looking at prior to this are examples of the CLT: – Binomial(n, p) with large n tends to look Normal (np, sqrt(np(1-p)) – p-hat with large n tends to look Normat( p , sqrt(p(1-p)/n) ) – means of samples of size n=20 from the digits 0:9, tend to look Normal ( 4.5, 2.87/sqrt(20) ) • Now you try these (Hand in next time): – Choose n=25 numbers at random from the interval [0,1] (see runif) • simulate this choice of 25 numbers enough times (1000) to convince you that the mean of this population of numbers from 0 to 1 is ~ .5 and the sd is ~ .29 • so what would the CLT say about the mean of the samples of 25? • simulate this and verify by looking at a histogram and by computing the mean and sd of the means…