BIOINF 2118 N03- Discrete Random Variables p.1 of 8 Definition of a Random Variable Consider an experiment with a sample space X. A random variable is real-valued function that is defined on the sample space X. We write . Capital Roman letters = random variables. Lower case Roman letters = values of a random variable. We use A for a subset of the sample space X, and B for a subset of . Random Variable Example Consider an experiment in which we roll two dice. The sample space is Let X denote the sum of the two dice. X is a random variable. Let Y denote the value of the first die. Y is a random variable. Let Z denote the value of the first die divided by the value of the second die. Z is a r.v. Types of Random Variables Discrete random variable: Sample space X is countable(usually {0,1}, {0,1,...,n}, or {0,1,...,}). The distribution is described by a probability mass function. Continuous random variable: X is an interval (usually [0,1], [0, ∞), or (-∞,∞).) Therefore X is UNcountably infinite. The distribution is described by a probability density function. RVs discard information Consider the sample space Xdice of results from throwing two dice. The size, or cardinality, of Xdice , written | Xdice |, is 36 (or 21 if indistinguishable). Let S = the sum of the two dice. The sample space for S is Xsum = {2, 3,…, 12} Î | Xsum | = _____? Then S is a function . . Let B be the event “S=4”. So . S(A) = B = {4}. Three outcomes in Xdice map to the same value of S in Xsum. You’ve “lost information”. (But is it valuable information? Later we define “sufficient statistic”.) Then . Distribution of a RV For any random variable S, the probability distribution of S specifies the probability that S is in (almost) any subset A of : The first “Pr” is on the sample space . (A is a subset of The other two “Pr”s are on the original full sample space X. .) N03- Discrete Random Variables BIOINF 2118 p.2 of 8 Distribution of a Discrete RV The distribution of a discrete random variable may be represented by a probability mass function (p.m.f.) f defined by: for every possible value s of the random variable S. P.M.F. : The Dice Example The p.m.f. of S is: Some Common Discrete Distributions The symbol ~ is read thus: “is distributed as”. (Notice we can use other letters: S, Z, X, ..) Discrete Uniform: S ~ Unif({1,…,k}) if Pr(S=s) = 1/k for x=1,…,k. (We will encounter this with the “bootstrap” method, using the function sample( ).) Bernoulli: Z ~ Bernoulli(p) if . (Kind of exciting-- flipping a coin at the beginning of the superbowl?) Binomial: X ~ Bin(n,p) if for x {0,1,….,n}. Also exciting!! Why? It counts things. Consider a sequence of Bernoulli outcomes Z1,..., Zn which are i.i.d.. i.i.d = “independent, identically distributed”. Then . Z’s could be superbowl coin flips over years... or, response outcomes for some patients. N03- Discrete Random Variables BIOINF 2118 p.3 of 8 æ n ö n! ÷= è x ø x!(n - x)! The Binomial Coefficient ç is read “n choose x”, because it is the number of ways to choose a subset of x things from a set of size n. For n = 3, Sample space X HHH HHT HTH HTT THH THT TTH TTT X = Z1 + Z2 + Z3 3 2 2 X 0 1 2 3 æ 3 ö ç ÷ è 0 ø æ 3 ö ç ÷ è 1 ø æ 3 ö ç ÷ è 2 ø æ 3 ö ç ÷ è 3 ø =1 =3 =3 =1 1 2 1 1 0 # subsets = S -1(X ) See also the document N03-Discrete Random Variables-whiteboards.docx Multinomial: . k = # categories, , . See the tables from last week’s class – the prisoners’ picnic: what’s k? Study the notation! Ask for clarification if unfamiliar. Exercise: You have 10 scrabble tiles: S T A T I S T I C S. If you scramble them face down, then put them in a line, and turn them over, what is the chance that they spell “STATISTICS”? Hint: How many permutations (orderings) are there? ( new word) How many of those to choose all 3 S’s for the S spots? Etc. What are the p’s? What are the m’s? See the document “multinomial and the probability of getting the right letters in the right order.docx” BIOINF 2118 N03- Discrete Random Variables p.4 of 8 Geometric Distribution Notation: X ~ Geom(p) or X ~ NegBin(1,p) , where 0<p<1 Negative Binomial Notation: X ~ NegBin(r,p) , where r is a positive integer and 0<p<1. The pmf is: This distribution may describe the number of tails obtained while repeatedly flipping a coin until r heads are obtained, for a coin that has probability p of landing heads. Confusing: it counts tails, not heads. The # of heads is fixed. The # tails is UNBOUNDED. The negative binomial differs from binomial ONLY in the STOPPING RULE. They have the “same” (proportional) likelihood function. BIOINF 2118 N03- Discrete Random Variables p.5 of 8 N03- Discrete Random Variables BIOINF 2118 p.6 of 8 Poisson Distribution if for x=0,1,2,…. The Poisson distribution is VERY exciting! Often appropriate for count data when there is no natural upper bound. Markov chains Suppose the sample space at each time t = 1,2,3,... is {A, B, C, D}, called “states”. At each time t, we’ll write the current state as X t . The key assumption is: regardless of . This is a “memoriless” property, a special case of conditional independence. Pr( Xt + 1 | X1,..., Xt- 1, Xt ) = Pr( Xt + 1 | Xt ) . Markov chains are tremendous useful in many many ways, especially for (a) modeling processes, (b) devising computational methods. BIOINF 2118 N03- Discrete Random Variables p.7 of 8 Cumulative Distribution Function Another representation of the distribution of a random variable is given by the cumulative distribution function (c.d.f). The cdf of a random variable X is the function F defined by: for . F is a non-decreasing function, continuous from the right, with . CDF: The discrete case . This is a step function. In R, the CDFs are obtained from the functions beginning with “p” for probability: pbinom, ppois, etc.. r: random p: CDF q: quantile d: prob mass (“density”) binom rbinom() pbinom() qbinom() dbinom() geom rgeom() pgeom() qgeom() dgeom() nbinom rnbinom() pnbinom() qnbinom() dnbinom() pois rpois () ppois () qpois () dpois () multinom rmultinom() pmultinom() qmultinom() dmultinom() N03- Discrete Random Variables BIOINF 2118 p.8 of 8 Bernoulli Binomial Geometric Negative Binomial Poisson Discrete Discrete Discrete Discrete Discrete {0,1} {0,1,…,n} {0,1,…} {0,1,…} {0,1,…} #(heads) #(heads) #(tails) #(tails) count Pr(heads) Pr(heads) Pr(tails) Pr(tails) Pr(count) Sample size 1 n 1 r 1 (?) E(X) p np λ V(X) p(1–p) np(1 – p) λ Variable Type Sample Space Meaning of x Meaning of p or l Pr CV(X) Binomial – Sum of independent Bernoulli trials with the same probability of success for each trial. Stopping rule The total sample size (n) is fixed in advance. x = number of successes (heads) in the first n trials. Geometric – Independent Bernoulli trials. x = number of failures (tails) before the first success. Stopping rule stop at the first success Negative Binomial – Sum of independent Bernoulli trials with the same probability of success for each trial. Stopping rule The number of successes is fixed in advance (r). x = number of failures before the rth success. (CAUTION: the parametrization is not entirely consistent across books and software packages.)