Chapter 2 Random Variables If the value of a numerical variable depends on the outcome of an experiment, we call the variable a random variable. Definition 2.0.1 (Random Variable) A function X : Ω → 7 R is called a random variable. X assigns to each elementary event a real value. Standard notation: capital letters from the end of the alphabet. Example 2.0.3 Very simple Dartboard In the case of three darts on a board as in the previous example, we are usually not interested in the order, in which the darts have been thrown. We only want to count the number of times, the red area has been hit. This count is a random variable! More formally: we define X to be the function, that assigns to a sequence of three throws the number of times, that the red area is hit. X(s) = k, if s consists of k hits to the red area and 3 − k hits to the gray area. X(s) is then an integer between 0 and 3 for every possible sequence. What is then the probability, that a player hits the red area exactly two times? We are looking now for all those elementary events s of our sample space, for which X(s) = 2. Going back to the tree, we find three possibilities for s : rrg, rgr and grr. This is the subset of Ω, for which X(s) = 2. Very formally, this set can be written as: {s|X(s) = 2} We want to know the total probability: P ({s|X(s) = 2}) = P (rrg ∪ rgr ∪ grr) = P (rrg) + P (rgr) + P (grr) = To avoid cumbersome notation, we write X=x for the event {ω|ω ∈ Ω and X(ω) = x}. 23 8 8 8 + 3 + 3 = 0.03. 93 9 9 24 CHAPTER 2. RANDOM VARIABLES Example 2.0.4 Communication Channel Suppose, 8 bits are sent through a communication channel. Each bit has a certain probability to be received incorrectly. So this is a Bernoulli experiment, and we can use Ω8 as our sample space. We are interested in the number of bits that are received incorrectly. Use random variable X to “count” the number of wrong bits. X assigns a value between 0 and 8 to each sequence in Ω8 . Now it’s very easy to write events like: a) no wrong bit received X=0 P (X = 0) b) at least one wrong bit received X≥1 P (X ≥ 1) c) exactly three bits are wrong X=3 P (X = 3) d) at least 3, but not more than 6 bits wrong 3 ≤ X ≤ 6 P (3 ≤ X ≤ 6) Definition 2.0.2 (Image of a random variable) The image of a random variable X is defined as all possible values X can reach Im(x) := X(Ω). Depending on whether or not the image of a random variable is countable, we distinguish between discrete and continuous random variables. Example 2.0.5 1. Put a disk drive into service, measure Y = “time till the first major failure”. Sample space Ω = (0, ∞). Y has uncountable image → Y is a continuous random variable. 2. Communication channel: X = “# of incorrectly received bits” Im(X) = {0, 1, 2, 3, 4, 5, 6, 7, 8} is a finite set → X is a discrete random variable. 2.1 Discrete Random Variables Assume X is a discrete random variable. The image of X is therefore countable and can be written as {x1 , x2 , x3 , . . .} Very often we are interested in probabilities of the form P (X = x). We can think of this expression as a function, that yields different probabilities depending on the value of x. Definition 2.1.1 (Probability Mass Function, PMF) The function pX (x) := P (X = x) is called the probability mass function of X. A probability mass function has two main properties: all values must be between 0 and 1 the sum of all values is 1 Theorem 2.1.2 (Properties of a pmf ) pX is the pmf of X, if and only if (i) 0 ≤ pX (x) ≤ 1 for all x ∈ {x1 , x2 , x3 , . . .} P (ii) i pX (xi ) = 1 Note: this gives us an easy method to check, whether a function is a probability mass function! 2.1. DISCRETE RANDOM VARIABLES 25 Example 2.1.1 Which of the following functions is a valid probability mass function? 1. x pX (x) -3 0.1 -1 0.45 0 0.15 5 0.25 7 0.05 2. y pY (y) -1 0.1 0 0.45 1.5 0.25 3 -0.05 4.5 0.25 3. z pZ (z) 0 0.22 5 0.17 7 0.18 1 0.18 3 0.24 We need to check the two properties of a pmf for pX , pY and pZ . 1st property: probabilities between 0 and 1 ? This eliminates pY from the list of potential probability mass functions, since pY (3) is negative. The other two functions fulfill the property. 2nd P property: sum of all probabilities is 1? Pi p(xi ) = 1, so pX is a valid probability mass function. i p(zi ) = 0.99 6= 1, so pZ is not a valid probability mass function. Example 2.1.2 Probability Mass Functions 1. Very Simple Dartboard X, the number of times, a player hits the red area with three darts is a value between 0 and 3. What is the probability mass function for X? The probability mass function pX can be given as a list of all possible values: pX (0) = P (X = 0) = P (ggg) = 83 ≈ 0.70 93 82 ≈ 0.26 93 8 pX (2) = P (X = 2) = P (rrg) + P (rgr) + P (grr) = 3 · 3 ≈ 0.03 9 1 pX (3) = P (X = 3) = P (rrr) = 3 ≈ 0.01 9 pX (1) = P (X = 1) = P (rgg) + P (grg) + P (ggr) = 3 · 2. Roll of a fair die Let Y be the number of spots on the upturned face of a die: Obviously, Y is a random variable with image {1, 2, 3, 4, 5, 6}. Assuming, that the die is a fair die means, that the probability for each side is equal. The probability mass function for Y therefore is pY (i) = 61 for all i in {1, 2, 3, 4, 5, 6}. 3. The diagram shows all six faces of a particular die. If Z denotes the number of spots on the upturned face after toss this die, what is the probability mass function for Z? Assuming, that each face of the die appears with the same probability, we have 1 possibility to get a 1 or a 4, and two possibilities for a 2 or 3 to appear, which gives a probability mass function of: x p(x) 1 1/6 2 1/3 3 1/3 4 1/6 26 2.1.1 CHAPTER 2. RANDOM VARIABLES Expectation and Variance Example 2.1.3 Game Suppose we play a “game”, where you toss a die. Let X be the number of spots, then if X is 1,3 or 5 I pay you $ X 2 or 4 you pay me $ 2 · X 6 no money changes hands. What money do I expect to win? For that, we look at another function, h(x), that counts the money I win with respect to the number of spots: −x for x = 1, 3, 5 2x for x = 2, 4 h(x) = 0 for x = 6. Now we make a list: In 1/6 of all tosses X will be 1, and I will gain -1 dollars In 1/6 of all tosses X will be 2, and I will gain 4 dollars In 1/6 of all tosses X will be 3, and I will gain -3 dollars In 1/6 of all tosses X will be 4, and I will gain 8 dollars In 1/6 of all tosses X will be 5, and I will gain -5 dollars In 1/6 of all tosses X will be 6, and I will gain 0 dollars In total I expect to get 61 · (−1) + 16 · 4 + 61 · (−3) + 61 · 8 + 16 · (−5) + 61 · 0 = 63 = 0.5 dollars per play. Assume, that instead of a fair die, we use the die from example 3. How does that change my expected gain? h(x) is not affected by the different die, but my expected gain changes: in total I expect to gain: 1 1 1 1 9 1 6 · (−1) + 3 · 4 + 3 · (−3) + 6 · 8 + 0 · (−5) + 6 · 0 = 6 = 1.5 dollars per play. Definition 2.1.3 (Expectation) The expected value of a function h(X) is defined as E[h(X)] := X h(xi ) · pX (xi ). i The most important version of this is h(x) = x: E[X] = X xi · pX (xi ) =: µ i Example 2.1.4 Toss of a Die Toss a fair die, and denote by X the number of spots on the upturned face. What is the expected value for X? Looking at the above definition for E[X], we see that we need to know the probability mass function for a computation. The probability mass function of X is pX (i) = 61 for all i ∈ {1, 2, 3, 4, 5, 6}. Therefore 6 X 1 1 1 1 1 1 E[X] = ipX (i) = 1 · + 2 · + 3 · + 4 · + 5 · + 6 · = 3.5. 6 6 6 6 6 6 i=1 A second common measure for describing a random variable is a measure, how far its values are spread out. We measure, how far we expect values to be away from the expected value: 2.1. DISCRETE RANDOM VARIABLES 27 Definition 2.1.4 (Variance of a random variable) The variance of a random variable X is defined as: V ar[X] := E[(X − E[X])2 ] = X (xi − E[X])2 · pX (xi ) i The variance is measured in squared units of X. p σ := V ar[X] is called the standard deviation of X, its units are the original units of X. Example 2.1.5 Toss of a Die, continued Toss a fair die, and denote with X the number of spots on the upturned face. What is the variance for X? Looking at the above definition for V ar[X], we see that we need to know the probability mass function and E[X] for a computation. The probability mass function of X is pX (i) = 61 for all i ∈ {1, 2, 3, 4, 5, 6}; E[X] = 3.5 Therefore 6 X 1 1 1 1 1 1 2 V ar[X] = (Xi − 3.5)2 pX (i) = 6.25 · + 2.25 · + 0.25 · + 0.25 · + 2.25 · + 6.25 · = 2.917 (spots ). 6 6 6 6 6 6 i=1 The standard deviation for X is: σ= 2.1.2 p V ar(X) = 1.71 (spots). Some Properties of Expectation and Variance The following theorems make computations with expected value and variance of random variables easier: Theorem 2.1.5 For two random variables X and Y and two real numbers a, b holds: E[aX + bY ] = aE[X] + bE[Y ]. Theorem 2.1.6 For a random variable X and a real number a holds: (i) E[X 2 ] = V ar[X] + (E[X])2 (ii) V ar[aX] = a2 V ar[X] Theorem 2.1.7 (Chebyshev’s Inequality) For any positive real number k, and random variable X with variance σ 2 : P (|X − E[X]| ≤ kσ) ≥ 1 − 2.1.3 1 k2 Probability Distribution Function Very often we are interested in the probability of a whole range of values, like P (X ≤ 5) or P (4 ≤ X ≤ 16). For that we define another function: Definition 2.1.8 (probability distribution function) Assume X is a discrete random variable: The function FX (t) := P (X ≤ t) is called the probability distribution function of X. 28 CHAPTER 2. RANDOM VARIABLES Relationship between pX and FX Since X is a discrete random variable, the image of X can be written as {x1 , x2 , x3 , . . .}, we are therefore interested in all xi with xi ≤ t: X FX (t) = P (X ≤ t) = P ({xi |xi ≤ t}) = pX (xi ). i,with xi ≤t Note: in contrast to the probability mass function, FX is defined on R (not only on the image of X). Example 2.1.6 Roll a fair die X = # of spots on upturned face Ω = {1, 2, 3, 4, 5, 6} pX (1) = pX (2) = . . . = pX (6) = 16 F (X)(t) = P i<t pX (i) = Properties of FX variable X. Pbtc i=1 pX (i) = btc 6 , where btc is the truncated value of t. The following properties hold for the probability distribution function FX of a random • 0 ≤ FX (t) ≤ 1 for all t ∈ R • FX is monotone increasing, (i.e. if x1 ≤ x2 then FX (x1 ) ≤ FX (x2 ).) • limt→−∞ FX (t) = 0 and limt→∞ FX (t) = 1. • FX (t) has a positive jump equal to pX (xi ) at {x1 , x2 , x3 , . . .}; FX is constant in the interval [xi , xi+1 ). Whenever no confusion arises, we will omit the subscript X. 2.2 Special Discrete Probability Mass Functions In many theoretical and practical problems, several probability mass functions occur often enough to be worth exploring here. 2.2.1 Bernoulli pmf Situation: Bernoulli experiment (only two outcomes: success/ no success) with P ( success ) = p We define a random variable X as: X( success ) = 1 X( no success ) = 0 The probability mass function pX of X is then: pX (0) = 1 − p pX (1) = p This probability mass function is called the Bernoulli mass function. The distribution function FX is then: t<0 0 1−p 0≤t<1 FX (t) = 1 1≤t This distribution function is called the Bernoulli distribution function. That’s a very simple probability function, and we’ve already seen sequences of Bernoulli experiments. . . 2.2. SPECIAL DISCRETE PROBABILITY MASS FUNCTIONS 2.2.2 29 Binomial pmf Situation: n sequential Bernoulli experiments, with success rate p for a single trial. Single trials are independent from each other. We are only interested in the number of successes he had in total after n trials, therefore we define a random variable X as: X = “ number of successes in n trials” This leads to an image of X as im(X) = {0, 1, 2, . . . , n} We can think of the sample space Ω as the set of sequences of length n that only consist of the letters S and F for “success” and ”failure”: Ω = {F...F F, F...F S, ...., S...SS} This way, we get 2n different outcomes in the sample space. Now, we want to derive a probability mass function for X, i.e. we want to get to a general expression for pX (k) for all possible k = 0, . . . , n. pX (k) = P (X = k), i.e. we want to find the probability, that in a sequence of n trials there are exactly k successes. Think: if s is a sequence with k successes and n − k failures, we already know the probability: P (s) = pk (1 − p)n−k . Now we need to know, how many possibilities there are, to have k successes in n trials: think of the n trials as numbers from 1 to n. To have k successes, we need to choose a set of k of these numbers out of the n possible numbers. Do you see it? - That’s the Binomial coefficient, again. pX (k) is therefore: n k p (1 − p)n−k . pX (k) = k This probability mass function is called the Binomial mass function. The distribution function FX is: FX (t) = btc X n i=0 i pi (1 − p)n−i =: Bn,p (t) This function is called the Binomial distribution Bn,p , where n is the number of trials, and p is the probability for a success. It is a bit cumbersome to compute values for the distribution function. Therefore, those values are tabled with respect to n and p. Example 2.2.1 Compute the probabilities for the following events: A box contains 15 components that each have a failure rate of 2%. What is the probability that 1. exactly two out of the fifteen components are defective? 2. at most two components are broken? 3. more than three components are broken? 4. more than 1 but less than 4 are broken? Let X be the number of broken components. Then X has a B15,0.02 distribution. 30 CHAPTER 2. RANDOM VARIABLES 1. P (exactly two out of the fifteen components are defective) = pX (2) = 15 2 0.022 0.9813 = 0.0323. 2. P (at most two components are broken) = P (X ≤ 2) = B15,0.02 (2) = 0.9638. 3. P ( more than three components are broken ) = P (X > 3) = 1 − P (X ≤ 3) = 1 − 0.9945 = 0.0055. 4. P ( more than 1 but less than 4 are broken ) = P (1 < X < 4) = P (X ≤ 3) − P (X ≤ 1) = 0.9945 − 0.8290 = 0.1655. If we want to say that a random variable has a binomial distribution, we write: X ∼ Bn,p What are the expected value and variance of X ∼ Bn,p ? E[X] = = = n X i · pX (i) = i=0 n X n i i· p (1 − p)n−i = i i=0 n X i=1 = np · i n! pi (1 − p)n−i i!(n − i)! n−1 X j=0 | V ar[X] 2.2.3 j:=i−1 = (n − 1)! pj (1 − p)n−1−j = np j!((n − 1) − j)! {z } =1 = . . . = np(1 − p). Geometric pmf Assume, we have a single Bernoulli experiment with probability for success p. Now, we repeat this experiment until we have a first success. Denote by X the number of repetitions of the experiment until we have the first success. Note: X = k means, that we have k − 1 failures and the first success in the kth repetition of the experiment. The sample space Ω is therefore infinite and starts at 1 (we need at least one experiment): Ω = {1, 2, 3, 4, . . .} Probability mass function: pX (k) = P (X = k) = (1 − p)k−1 · p | {z } |{z} k−1 failures success! This probability mass function is called the Geometric mass function. Expected value and variance of X are: E[X] = ∞ X i=1 V ar[X] = i(1 − p)i p = . . . = 1 , p ∞ X 1 1−p (i − )2 (1 − p)i p = . . . = . p p2 i=1 2.2. SPECIAL DISCRETE PROBABILITY MASS FUNCTIONS 31 Example 2.2.2 Repeat-until loop Examine the following programming statement: Repeat S until B assume P (B = true) = 0.1 and let X be the number of times S is executed. Then, X has a geometric distribution, P (X = k) = pX (k) = 0.9k−1 · 0.1 How often is S executed on average? - What is E[X]? Using the above formula, we get E[X] = 1 p = 10. We still need to compute the distribution function FX . Remember, FX (t) is the probability for X ≤ t. Instead of tackling this problem directly, we use a trick and look at the complementary event X > t. If X is greater than t, this means that the first btc trials yields failures. This is easy to compute! It’s just (1 − p)btc . Therefore the probability distribution function is: FX (t) = 1 − (1 − p)btc =: Geop (t) This function is called the Geometric distribution (function) Geop . Example 2.2.3 Time Outs at the Alpha Farm Watch the input queue at the alpha farm for a job that times out. The probability that a job times out is 0.05. Let Y be the number of the first job to time out, then Y ∼ Geo0.05 . What’s then the probability that • the third job times out? P (Y = 3) = 0.952 0.05 = 0.045 • Y is less than 3? P (Y < 3) = P (Y ≤ 2) = 1 − 0.952 = 0.0975 • the first job to time out is between the third and the seventh? P (3 ≤ Y ≤ 7) = P (Y ≤ 7) − P (Y ≤ 2) = 1 − 0.957 − (1 − 0.952 ) = 0.204 What are the expected value for Y , what is V ar[Y ]? Plugging in p = 0.05 in the above formulas gives us: 2.2.4 E[Y ] = V ar[Y ] = 1 = 20 p 1−p = 380 p2 we expect the 20th job to be the first time out very spread out! Poisson pmf The Poisson density follows from a certain set of assumptions about the occurrence of “rare” events in time or space. The kind of variables modelled using a Poisson density are e.g. X = # of alpha particles emitted from a polonium bar in an 8 minute period. Y = # of flaws on a standard size piece of manufactured product (100m coaxial cable) Z = # of hits on a web page in a 24h period. 32 CHAPTER 2. RANDOM VARIABLES The Poisson probability mass function is defined as: p(x) = e−λ λx x! for x = 0, 1, 2, 3, . . . λ is called the rate parameter. P oλ (t) := FX (t) is the Poisson distribution (function). We need to check that p(x) as defined above is actually a probability mass function, i.e. we need to check whether the two basic properties (see theorem 2.1.2) are true: • Obviously, all values of p(x) are positive for x ≥ 0. • Do all probabilities sum to 1? ∞ X p(x) = k=0 ∞ X ∞ e−λ X λk λk = e−λ k! k! (∗) k=0 k=0 Now, we need to remember from calculus that the exponential function has the series representation ex = ∞ X xn . n! n=0 In our case this simplifies (∗) to: e−λ ∞ X λk k=0 k! = e−λ · eλ = 1. p(x) is therefore a valid probability mass function. Expected Value and Variance of X ∼ P oλ are: E[X] V ar[X] = ∞ X e−λ λx x = ... = λ x! x=0 = ... = λ Computing E[X] and V ar[X] involves some math, but as it is not too hard, we can do the computation for E[X]: E[X] = ∞ ∞ X X λx e−λ λx = e−λ x = x x! x! x=0 x=0 = e−λ = e −λ for x = 0 the expression is 0 ∞ ∞ X X λx λx x = e−λ = x! (x − 1)! x=1 x=1 λ = e−λ λ ∞ X x=1 ∞ X x=0 x λx−1 = (x − 1)! x λx = e−λ λeλ = λ (x)! start at x = 0 again and change summation index How do we choose λ in an example? - look at the expected value! 2.2. SPECIAL DISCRETE PROBABILITY MASS FUNCTIONS 33 Example 2.2.4 A manufacturer of chips produces 1% defectives. What is the probability that in a box of 100 chips no defective is found? Let X be the number of defective chips found in the box. So far, we would have modelled X as a Binomial variable with distribution B100,0.01 . 100 Then P (X = 0) = 100 0.010 = 0.366. 0 0.99 On the other hand, a defective chip can be considered to be a rare event, since p is small (p = 0.01). What else can we do? We expect 100 · 0.01 = 1 chip out of the box to be defective. If we model X as Poisson variable, we know, that the expected value of X is λ. In this example, therefore, λ = 1. −1 0 Then P (X = 0) = e 0!1 = 0.3679. No big differences between the two approaches! For larger k, however, the binomial coefficient nk becomes hard to compute, and it is easier to use the Poisson distribution instead of the Binomial distribution. Poisson approximation of Binomial pmf For large n, the Binomial distribution is approximated by the Poisson distribution, where λ is given as np: n k (np)k p (1 − p)n−k ≈ e−np k! k Rule of thumb: use Poisson approximation if n ≥ 20 and (at the same time) p ≤ 0.05. Why does the approximation work? - We will have a closer look at why the Poisson distribution approximates the Binomial distribution. This also explains why the Poisson is defined as it is. Example 2.2.5 Typos Imagine you are supposed to proofread a paper. Let us assume that there are on average 2 typos on a page and a page has 1000 words. This gives a probability of 0.002 for each word to contain a typo. The number of typos on a page X is then a Binomial random variable, i.e. X ∼ B1000,0.002 . Let’s have a closer look at a couple of probabilities: • the probability for no typo on a page is P (X = 0). We know, that P (X = 0) = (1 − 0.002)1000 = 0.9981000 . We can also write this probability as P (X = 0) = 2 1− 1000 1000 (= 0.13506). From calculus we know, that x n = ex . n→∞ n Therefore the probability for no typo on the page is approximately lim 1− P (X = 0) ≈ e−2 (= 0.13534). • the probability for exactly one typo on a page is 1000 P (X = 1) = 0.002 · 0.998999 1 (= 0.27067). We can write this as 2 P (X = 1) = 1000 · 1000 1− 2 1000 999 ≈ 2 · e−2 (= 0.27067) 34 CHAPTER 2. RANDOM VARIABLES • the probability for exactly two typos on a page is 1000 P (X = 2) = 0.0022 · 0.998998 2 (= 0.27094), which we again re-write to 1000 · 999 22 P (X = 2) = · 2 1000 · 1000 2 1− 1000 998 ≈ 2 · e−2 (= 0.27067) • and a last one: the probability for exactly three typos on a page is 1000 P (X = 3) = 0.0023 · 0.998997 (= 0.18063), 3 which is P (X = 3) = 2.2.5 1000 · 999 · 998 23 · 3·2 1000 · 1000 · 1000 1− 2 1000 997 ≈ 23 −2 ·e 3! (= 0.18045) Compound Discrete Probability Mass Functions Real problems very seldom concern a single random variable. As soon as more than 1 variable is involved it is not sufficient to think of modeling them only individually - their joint behavior is important. Again, the How do we specify probabilities for more than one random variable at a time? individual probabili- Consider the 2 variable case: X, Y are two discrete variables. The joint probability mass function is defined ties must be between 0 as and 1 and their sum PX,Y (x, y) := P (X = x ∩ Y = y) must be 1. Example 2.2.6 A box contains 5 unmarked PowerPC G4 processors of different speeds: 2 400 mHz 1 450 mHz 2 500 mHz Select two processors out of the box (without replacement) and let X = speed of the first selected processor Y = speed of the second selected processor For a sample space we can draw a table of all the possible combinations of processors. We will distinguish between processors of the same speed by using the subscripts 1 or 2 . Ω 4001 4002 4001 x 4002 x 450 x x 5001 x x 5002 x x 1st processor 450 x x x x 5001 x x x x 5002 x x x x - 2nd processor In total we have 5 · 4 = 20 possible combinations. Since we draw at random, we assume that each of the above combinations is equally likely. This yields the following probability mass function: 2.2. SPECIAL DISCRETE PROBABILITY MASS FUNCTIONS 400 450 500 (mHz) 1st proc. 400 0.1 0.1 0.2 35 2nd processor 450 500 (mHz) 0.1 0.2 0.0 0.1 0.1 0.1 What is the probability for X = Y ? this might be important if we wanted to match the chips to assemble a dual processor machine: P (X = Y ) = pX,Y (400, 400) + pX,Y (450, 450) + pX,Y (500, 500) = = 0.1 + 0 + 0.1 = 0.2. Another example: What is the probability that the first processor has higher speed than the second? P (X > Y ) = pX,Y (400, 450) + pX,Y (400, 500) + pX,Y (450, 500) = = 0.1 + 0.2 + 0.1 = 0.4. We can go from joint probability mass functions to individual pmfs: X pX (x) = pX,Y (x, y) y X “marginal” pmfs pY (y) = pX,Y (x, y) x Example 2.2.7 Continued For the previous example the marginal probability mass functions are x pX (x) 400 0.4 450 0.2 500 (mHz) 0.4 y pY (y) 400 0.4 450 0.2 500 (mHz) 0.4 Just as we had the notion of expected value for functions with a single random variable, there’s an expected value for functions in several random variables: X E[h(X, Y )] := h(x, y)pX,Y (x, y) x,y Example 2.2.8 Continued Let X, Y be as before. What is E[|X − Y |] (the average speed difference)? here, we have the situation E[|X − Y |] = E[h(X, Y )], with h(X, Y ) = |X − Y |. Using the above definition of expected value gives us: X E[|X − Y |] = |x − y|pX,Y (x, y) = x,y = |400 − 400| · 0.1 + |400 − 450| · 0.1 + |400 − 500| · 0.2 + |450 − 400| · 0.1 + |450 − 450| · 0.0 + |450 − 500| · 0.1 + |500 − 400| · 0.2 + |500 − 450| · 0.1 + |500 − 500| · 0.1 = = 0 + 5 + 20 + 5 + 0 + 5 + 20 + 5 + 0 = 60. 36 CHAPTER 2. RANDOM VARIABLES The most important cases for h(X, Y ) in this context are linear combinations of X and Y . For two variables we can measure how “similar” their values are: Definition 2.2.1 (Covariance) The covariance between two random variables X and Y is defined as: Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] Note, that this definition looks very much like the definition for the variance of a single random variable. In fact, if we set Y := X in the above definition, the Cov(X, X) = V ar(X). Definition 2.2.2 (Correlation) The (linear) correlation between two variables X and Y is % := p Cov(X, Y ) V ar(X) · V ar(Y ) read: “rho” Facts about %: • % is between -1 and 1 • if % = 1 or -1, Y is a linear function of X %=1 % = −1 → Y = aX + b with a > 0, → Y = aX + b with a < 0, % is a measure of linear association between X and Y . % near ±1 indicates a strong linear relationship, % near 0 indicates lack of linear association. Example 2.2.9 Continued What is % in our box with five chips? Check: E[X] = E[Y ] = 450 Use marginal pmfs to compute! V ar[X] = V ar[Y ] = 2000 The covariance between X and Y is: X Cov(X, Y ) = (x − E[X])(y − E[Y ])pX,Y (x, y) = x,y = (400 − 450)(400 − 450) · 0.1 + (450 − 450)(400 − 450) · 0.1 + (500 − 450)(400 − 450) · 0.2 + (400 − 450)(450 − 450) · 0.1 + (450 − 450)(450 − 450) · 0.0 + (500 − 450)(450 − 450) · 0.1 + (400 − 450)(500 − 450) · 0.2 + (450 − 450)(500 − 450) · 0.1 + (500 − 450)(500 − 450) · 0.1 = = 250 + 0 − 500 + 0 + 0 + 0 − 500 + 250 + 0 = −500. % therefore is %= p Cov(X, Y ) V ar(X)V ar(Y ) = −500 = −0.25, 2000 % indicates a weak negative (linear) association. Definition 2.2.3 (Independence) Two random variables X and Y are independent, if their joint probability pX,Y is equal to the product of the marginal densities pX · pY . 2.3. CONTINUOUS RANDOM VARIABLES 37 Note: so far, we’ve had a definition for the independence of two events A and B: A and B are independent, if P (A ∩ B) = P (A) · P (B). Random variables are independent, if all events of the form X = x and Y = y are independent. Example 2.2.10 Continued Let X and Y be defined as previously. Are X and Y independent? Check: pX,Y (x, y) = pX (x) · pY (y) for all possible combinations of x and y. Trick: whenever there is a zero in the joint probability mass function, the variables cannot be independent: pX,Y (450, 450) = 0 6= 0.2 · 0.2 = pX (450) · pY (450). Therefore, X and Y are not independent! More properties of Variance and Expected Values Theorem 2.2.4 If two random variables X and Y are independent, E[X · Y ] V ar[X + Y ] = = E[X] · E[Y ] = V ar[X] + V ar[Y ] Theorem 2.2.5 For two random variables X and Y and three real numbers a, b, c holds: V ar[aX + bY + c] = a2 V ar[X] + b2 V ar[Y ] + 2ab · Cov(X, Y ) Note: by comparing the two results, we see that for two independent random variables X and Y , the covariance Cov(X, Y ) = 0. Example 2.2.11 Continued E[X − Y ] V ar[X − Y ] 2.3 = E[X] − E[Y ] = 450 − 450 = 0 = V ar[X] + (−1)2 V ar[Y ] − 2 Cov(X, Y ) = 2000 + 2000 + 1000 = 5000 Continuous Random Variables All previous considerations for discrete variables have direct counterparts for continuous variables. So far, a lot of sums have been involved, e.g. to compute the distribution functions or expected values. Summing over (uncountable) infinite many values corresponds to an integral. The main trick in working with continuous random variables is to substitute all sums by integrals in the definitions. As in the case of a discrete random variable, we define a distribution function as the probability that a random variable has outcome t or a smaller value: Definition 2.3.1 (probability distribution function) Assume X is a continuous random variable: The function FX (t) := P (X ≤ t) is called the probability distribution function of X. The only difference to the discrete case is that the distribution function of a continuous variable is not a stairstep function: 38 CHAPTER 2. RANDOM VARIABLES Properties of FX variable X. The following properties hold for the probability distribution function FX for random • 0 ≤ FX (t) ≤ 1 for all t ∈ R • FX is monotone increasing, (i.e. if x1 ≤ x2 then FX (x1 ) ≤ FX (x2 ).) • limt→−∞ FX (t) = 0 and limt→∞ FX (t) = 1. f (x) is no probability! f (x) may be > 1. Now, however, the situation is slightly different from the discrete case: Definition 2.3.2 (density function) For a continuous variable X with distribution function FX the density function of X is defined as: 0 fX (x) := FX (x). Theorem 2.3.3 (Properties of f (x)) A function fX is a density function of X, if (i) fX (x) ≥ 0 for all x, R∞ (ii) −∞ f (x)dx = 1. Relationship between fX and FX Since the density function fX is defined as the derivative of the distribution function, we can re-gain the distribution function from the density by integrating: Then Rt • FX (t) = P (X ≤ t) = −∞ f (x)dx • P (a ≤ X ≤ b) = Rb a f (x)dx Therefore, Z P (X = a) = P (a ≤ X ≤ a) = a f (x)dx = 0. a Example 2.3.1 Let Y be the time until the first major failure of a new disk drive. A possible density function for Y is −y e y>0 f (y) = 0 otherwise First, we need to check, that f (y) is actually a density function. Obviously, f (y) is a non-negative function on whole of R. The second condition, f must fulfill to be a density of Y is Z ∞ Z ∞ f (y)dy = e−y dy = −e−y |∞ 0 = 0 − (−1) = 1 −∞ 0 What is the probability that the first major disk drive failure occurs within the first year? Z P (Y ≤ 1) = 1 e−y dy = −e−y |10 = 1 − e−1 ≈ 0.63. 0 What is the distribution function of Y ? Z t Z t f (y)dy = e−y dy = 1 − e−t for all t ≥ 0. FY (t) = ∞ 0 2.4. SOME SPECIAL CONTINUOUS DENSITY FUNCTIONS f(y) 39 density function of Y y F(y) distribution function of Y y Figure 2.1: Density and Distribution function of random variable Y . Summary: discrete random variable image Im(X) finite or countable infinite continuous random variable image Im(X) uncountable probability distribution function: P FX (t) = P (X ≤ t) = k≤btc pX (k) FX (t) = P (X ≤ t) = probability mass function: pX (x) = P (X = x) probability density function: 0 fX (x) = FX (x) expected value: P E[h(X)] = x h(x) · pX (x) E[h(X)] = variance: V ar[X] =P E[(X − E[X])2 ] = = x (x − E[X])2 pX (x) V ar[X] =RE[(X − E[X])2 ] = ∞ = −∞ (x − E[X])2 fX (x)dx 2.4 2.4.1 R x Rt ∞ f (x)dx h(x) · fX (x)dx Some special continuous density functions Uniform Density One of the most basic cases of a continuous density is the uniform density. On the finite interval (a, b) each value has the same density (cf. diagram 2.2): 1 if a < x < b b−a f (x) = 0 otherwise The distribution function FX is Ua,b (x) := FX (x) = 0 x b−a 1 if x ≤ a if a < x < b if x ≥ b. We now know how to compute expected value and variance of a continuous random variable. 40 CHAPTER 2. RANDOM VARIABLES f(x) 1/ uniform density on (a,b) (b-a) a b x Figure 2.2: Density function of a uniform variable X on (a, b). Assume, X has a uniform distribution on (a, b). Then Z b 1 1 1 2b dx = x | = b−a b−a2 a a b2 − a2 1 = = (a + b). 2(b − a) 2 Z b a+b 2 1 (b − a)2 V ar[X] = (x − ) dx = . . . = . 2 b−a 12 a E[X] x = Example 2.4.1 The(pseudo) random number generator on my calculator is supposed to create realizations of U (0, 1) random variables. Define U as the next random number the calculator produces. What is the probability, that the next number is higher than 0.85? 1 For that, we want to compute P (U ≥ 0.85). We know the density function of U : fU (u) = 1−0 = 1. Therefore Z 1 P (U ≥ 0.85) = 1du = 1 − 0.85 = 0.15. 0.85 2.4.2 Exponential distribution This density is commonly used to model waiting times between occurrences of “rare” events, lifetimes of electrical or mechanical devices. Definition 2.4.1 (Exponential density) A random variable X has exponential density (cf. figure 2.3), if λe−λx if x ≥ 0 fX (x) = 0 otherwise λ is called the rate parameter. Mean, variance and distribution function are easy to compute. They are: E[X] = V ar[X] = Expλ (t) = 1 λ 1 λ2 FX (t) = 0 1 − e−λx if x < 0 if x ≥ 0 The following example will accompany us throughout the remainder of this class: we expect X to be in the middle between a and b - makes sense, doesn’t it? 2.4. SOME SPECIAL CONTINUOUS DENSITY FUNCTIONS 41 f2 f1 f0.5 x Figure 2.3: Density functions of exponential variables for different rate parameters 0.5, 1, and 2. Example 2.4.2 Hits on a webpage On average there are 2 hits per minute on a specific web page. I start to observe this web page at a certain time point 0, and decide to model the waiting time till the first hit Y (in min) using an exponential distribution. What is a sensible value for λ, the rate parameter? Think: on average there are 2 hits per minute - which makes an average waiting time of 0.5 minutes between hits. We will use this value as the expected value for Y : E[Y ] = 0.5. On the other hand, we know, that the expected value for Y is 1/λ. → we are back at 2 = λ as a sensible choice for the parameter! λ describes the rate, at which this web page is hit! What is the probability that we have to wait at most 40 seconds to observe the first hit? ok, we know the rate at which hits come to the web page in minutes - so, it’s advisable to express the 40s in minutes also: The above probability then becomes: What is the probability that we have to wait at most 2/3 min to observe the first hit? This, we can compute: P (Y ≤ 2/3) = Expλ (2/3) = 1 − −e−2/3·2 ≈ 0.736 How long do we have to wait at most, to observe a first hit with a probability of 0.9? This is a very different approach to what we have looked at so far! Here, we want to find a t, for which P (Y ≤ t) = 0.9: P (Y ≤ t) = 0.9 ⇐⇒ 1 − e−2t = 0.9 ⇐⇒ e−2t = 0.1 ⇐⇒ t = −0.5 ln 0.1 ≈ 1.15 (min) - that’s approx. 69 s. Memoryless property Example 2.4.3 Hits on a web page In the previous example I stated that we start to observe the web page a time point 0. Does the choice of this time point affect our analysis in any way? Let’s assume, that during the first minute after we started to observe the page, there is no hit. What is the probability, that we have to wait for another 40 seconds for the first hit? - this implies an answer to the question, what would have happened, if we had started our observation of the web page a minute later - would we still get the same results? 42 CHAPTER 2. RANDOM VARIABLES The probability we want to compute is a conditional probability. If we think back - the conditional probability of A given B was defined as P (A ∩ B) P (A|B) := P (B) Now, we have to identify, what the events A and B are in our case. The information we have is, that during the first minute, we did not observe a hit =: B, i.e. B = (Y > 1). The probability we want to know, is that we have to wait another 40 s for the first hit: A = wait for 1 min and 40 s for the first hit (= Y ≤ 5/3). P ( first hit within 5/3 min P (A ∩ B) P (Y ≤ 5/3 ∩ Y > 1) = = P (B) P (Y > 1) | no hit during 1st min) = P (A|B) = = P (1 < Y ≤ 5/3) e−2 − e−10/3 = 0.736. = 1 − P (Y < 1) e−2 That’s exactly the same probability as we had before!!! The result of this example is no coincidence. We can generalize: P (Y ≤ t + s|Y ≥ s) = 1 − e−λt = P (Y ≤ t) This means: a random variable with an exponential distribution “forgets” about its past. This is called the memoryless property of the exponential distribution. An electrical or mechanical device whose lifetime we model as an exponential variable therefore “stays as good as new” until it suddenly breaks, i.e. we assume that there’s no aging process. 2.4.3 Erlang density Example 2.4.4 Hits on a web page Remember: we modeled waiting times until the first hit as Exp2 . How long do we have to wait for the second hit? In order to get the waiting time for the second hit, we can add the waiting times until the first hit and the time between the first and the second hit. For both of these we know the distribution: Y1 , the waiting time until the first hit is an exponential variable with λ = 2. After we have observed the first hit, we start the experiment again and wait for the next hit. Since the exponential distribution is memoryless, this is as good as waiting for the first hit. We therefore can model Y2 , the time between first and second hit, by another exponential distribution with the same rate λ = 2. What we are interested in is Y := Y1 + Y2 . Unfortunately, we don’t know the distribution of Y , yet. Definition 2.4.2 (Erlang density) If Y1 , . . . , Yk are k independent exponential random variables with parameter λ, their sum X has an Erlang distribution: k X X := Yi is Erlang(k,λ) i=1 The Erlang density fk,λ is ( f (x) = λe −λx 0 k−1 · (λx) (k−1)! k is called the stage parameter, λ is the rate parameter. x<0 for x ≥ 0 2.4. SOME SPECIAL CONTINUOUS DENSITY FUNCTIONS 43 Expected value and variance of an Erlang distributed variable X can be computed using the properties of expected value and variance for sums of independent random variables: E[X] V ar[X] k k X X 1 = E[ Yi ] = E[Yi ] = k · λ i=1 i=1 = V ar[ k X Yi ] = i=1 k X V ar[Yi ] = k · i=1 1 λ2 In order to compute the distribution function, we need another result about the relationship between P oλ and Expλ . Theorem 2.4.3 If X1 , X2 , X3 , . . . are independent exponential random variables with parameter λ and (cf. fig. 2.4) W := largest index j such that j X Xi ≤ T i=1 for some fixed T > 0. Then W ∼ P oλT . * 0 X1 * X2 * X3 * * <- occurrence times T Figure 2.4: W = 3 in this example. With this theorem, we can derive an expression for the Erlang distribution function. Let X be an Erlangk,λ variable: Erlangk,λ (x) = P (X ≤ x) = 1st trick = 1 − P (X > x) = 1 − P( X Yi > x) above theorem = i | {z } less than k hits observed = 1 − P o( a Poisson r.v. with rate xλ ≤ k − 1) = = 1 − P oλx (k − 1). Example 2.4.5 Hits on a web page What is the density of the waiting time until the next hit? We said that Y as previously defined, is the sum of two exponential variables, each with rate λ = 2. X has therefore an Erlang distribution with stage parameter 2, and the density is given as fX (x) = fk,λ (x) = 4xe−2x for x ≥ 0 If we wait for the third hit, what is the probability that we have to wait more than 1 min? Z := waiting time until the third hit has an Erlang(3,2) distribution. P (Z > 1) = 1 − Erlang3,2 (1) = 1 − (1 − P o2·1 (3 − 1)) = P o2 (2) = 0.677 44 CHAPTER 2. RANDOM VARIABLES Note: The exponential distribution is a special case of an Erlang distribution: Expλ = Erlang(k=1,λ) Erlang distributions are used to model waiting times of components that are exposed to peak stresses. It is assumed that they can withstand k − 1 peaks and fail with the kth peak. We will come across the Erlang distribution again, when modelling the waiting times in queueing systems, where customers arrive with a Poisson rate and need exponential time to be served. 2.4.4 Gaussian or Normal density The normal density is the archetypical “bell-shaped” density. The density has two parameters: µ and σ 2 and is defined as (x−µ)2 1 fµ,σ2 (x) = √ e− 2σ2 2πσ 2 The expected value and variance of a normal distributed r.v. X are: Z ∞ E[X] = xfµ.σ2 (x)dx = . . . = µ −∞ Z ∞ V ar[X] = (x − µ)2 fµ.σ2 (x)dx = . . . = σ 2 . −∞ Note: the parameters µ and σ 2 are actually mean and variance of X - and that’s what they are called. f0,0.5 f0,1 f0,2 x f-1,1 f0,1 f2,1 x Figure 2.5: Normal densities for several parameters. µ determines the location of the peak on the x−axis, σ 2 determines the “width” of the bell. 2.4. SOME SPECIAL CONTINUOUS DENSITY FUNCTIONS 45 The distribution function of X is Z t fµ,σ2 (x)dx Nµ,σ2 (t) := Fµ,σ2 (t) = −∞ Unfortunately, there does not exist a closed form for this integral - fµ,σ2 does not have a simple antiderivative. However, to get probabilities means we need to evaluate this integral. This leaves us with several choices: 1. personal numerical integration uuuh, bad, bad, idea 2. use of statistical software later 3. standard tables of normal probabilities We will use the third option, mainly. First of all: only a special case of the normal distributions is tabled: only positive values of N (0, 1) are tabled - N (0, 1) is the normal distribution, that has mean 0 and a variance of 1. This is the so-called standard normal distribution, also written as Φ. A table for this distribution is enough, though. We will use several tricks to get any normal distribution into the shape of a standard normal distribution: Basic facts about the normal distribution that allow the use of tables (i) for X ∼ N (µ, σ 2 ) holds: Z := X −µ ∼ N (0, 1) σ This process is called standardizing X. (this is at least plausible, since E[Z] = V ar[Z] = 1 (E[X] − µ) = 0 σ 1 V ar[X] = 1 σ2 (ii) Φ(−z) = 1 − Φ(z) since f0,1 is symmetric in 0 (see fig. 2.6 for an explanation). f0,1 P(Z ≤ -z) P(Z ‡ +z) -z +z x Figure 2.6: standard normal density. Remember, the area below the graph up to a specified vertical line represents the probability that the random variable Z is less than this value. It’s easy to see, that the areas in the tails are equal: P (Z ≤ −z) = P (Z ≥ +z). And we already know, that P (Z ≥ +z) = 1 − P (Z ≤ z), which proves the above statement. Example 2.4.6 Suppose Z is a standard normal random variable. this is, what we are going to do! 46 CHAPTER 2. RANDOM VARIABLES 1. P (Z < 1) = ? P (Z < 1) = Φ(1) straight look-up = 0.8413. 2. P (0 < Z < 1) = ? P (0 < Z < 1) = P (Z < 1) − P (Z < 0) = Φ(1) − Φ(0) look-up = 0.8413 − 0.5 = 0.3413. 3. P (Z < −2.31) = ? P (Z < −2.31) = 1 − Φ(2.31) look-up = 1 − 0.9896 = 0.0204. 4. P (|Z| > 2) = ? P (|Z| > 2) = P (Z < −2) + P (Z > 2) = 2(1 − Φ(2)) (1) look-up = (2) f0,1 (3) f0,1 Example 2.4.7 Suppose, X ∼ N (1, 2) P (1 < X < 2) =? A standardization of X gives Z := P (1 < X < 2) 2(1 − 0.9772) = 0.0456. f0,1 (4) f0,1 X−1 √ . 2 1−1 X −1 2−1 P( √ < √ < √ )= 2 2 2 √ = P (0 < Z < 0.5 2) = Φ(0.71) − Φ(0) = 0.7611 − 0.5 = 0.2611. = Note that the standard normal table only shows probabilities for z < 3.99. This is all we need, though, since P (Z ≥ 4) ≤ 0.0001. Example 2.4.8 Suppose the battery life of a laptop is normally distributed with σ = 20 min. Engineering design requires, that only 1% of batteries fail to last 300 min. What mean battery life is required to ensure this condition? Let X denote the battery life in minutes, then X has a normal distribution with unknown mean µ and standard deviation σ = 20 min. What is µ? The condition, that only 1% of batteries is allowed to fail the 300 min limit translates to: P (X < 300) ≤ 0.01 We must make sure to choose µ such, that this condition holds. 2.5. CENTRAL LIMIT THEOREM (CLT) 47 In order to compute the probability, we must standardize X: Z := Then P (X ≤ 300) = P ( X −µ 20 X −µ 300 − µ 300 − µ 300 − µ ≤ ) = P (Z ≤ ) = Φ( ) 20 20 20 20 The condition requires: P (X ≤ 300) ≤ 0.01 300 − µ ⇐⇒ Φ( ) ≤ 0.01 = 1 − 0.99 = 1 − Φ(2.33) = Φ(−2.33) 20 300 − µ ⇐⇒ ≤ −2.33 20 ⇐⇒ µ ≥ 346.6. Normal distributions have a “reproductive property”, i.e. if X and Y are normal variables, then W := aX + bY is also a normal variable, with: E[W ] V ar[W ] = aE[X] + bE[Y ] = a2 V ar[X] + b2 V ar[Y ] + 2abCov(X, Y ) The normal distribution is extremely common/ useful, for one reason: the normal distribution approximates a lot of other distributions. This is the result of one of the most fundamental theorems in Math: 2.5 Central Limit Theorem (CLT) Theorem 2.5.1 (Central Limit Theorem) If X1 , X2 , . . . , Xn are n independent, identically distributed random variables with E[Xi ] = µ and V ar[Xi ] = σ 2 , then: Pn the sample mean X̄ := n1 i=1 Xi is approximately normal distributed with E[X̄] = µ and V ar[X̄] = σ 2 /n. 2 i.e. X̄ ∼ N (µ, σn ) or P i Xi ∼ N (nµ, nσ 2 ) Corollary 2.5.2 (a) for large n the binomial distribution Bn,p is approximately normal Nnp,np(1−p) . (b) for large λ the Poisson distribution P oλ is approximately normal Nλ,λ . (c) for large k the Erlang distribution Erlangk,λ is approximately normal N k , k λ λ2 Why? (a) Let X be a variable with a Bn,p distribution. We know, that X is the result from repeating the same Bernoulli experiment n times and looking at the overall number of successes. We can therefor, write X as the sum of n B1,p variables Xi : X := X1 + X2 + . . . + Xn X is then the sum of n independent, identically distributed random variables. Then, the Central Limit Theorem states, that X has an approximate normal distribution with E[X] = nE[Xi ] = np and V ar[X] = nV ar[Xi ] = np(1 − p). 48 CHAPTER 2. RANDOM VARIABLES (b) it is enough to show the statement for the case that λ is a large integer: Let Y be a Poisson variable with rate λ. Then we can think of Y as the number of occurrences in an experiment that runs for time λ - that is the same as to observe λ experiments that each run independently for time 1 and add their results: Y = Y1 + Y2 + . . . + Yλ , with Yi ∼ P o1 . Again, Y is the sum of n independent, identically distributed random variables. Then, the Central Limit Theorem states, that X has an approximate normal distribution with E[Y ] = λ · 1 and V ar[Y ] = λV ar[Yi ] = λ. (c) this statement is the easiest to prove, since an Erlangk,λ distributed variable Z is by definition the sum of k independently distributed exponential variables Z1 , . . . , Zk . For Z the CLT holds, and we get, that Z is approximately normal distributed with E[Z] = kE[Zi ] = and V ar[Z] = kV ar[Zi ] = λk2 . k λ 2 Why do we need the central limit theorem at all? - first of all, the CLT gives us the distribution of the sample mean in a very general setting: the only thing we need to know, is that all the observed values come from the same distribution, and the variance for this distribution is not infinite. A second reason is, that most tables only contain the probabilities up to a certain limit - the Poisson table e.g. only has values for λ ≤ 10, the Binomial distribution is tabled only for n ≤ 20. After that, we can use the Normal approximation to get probabilities. Example 2.5.1 Hits on a webpage Hits occur with a rate of 2 per min. What is the probability to wait for more than 20 min for the 50th hit? Let Y be the waiting time until the 50th hit. We know: Y has an Erlang50,2 distribution. therefore: P (Y > 20) = 1 − Erlang50,2 (20) = 1 − (1 − P o2·20 (50 − 1)) = = P o40 (49) ≈ N40,40 (49) = 49 − 40 table √ Φ = Φ(1.42) = 0.9222. 40 = CLT ! Example 2.5.2 Mean of Uniform Variables Let U1 , U2 , U3 , U4 , and U5 be standard uniform variables, i.e. Ui ∼ U(0,1) . Without the CLT we would have no idea, what distribution the sample mean Ū = approx 1 With it, we know: Ū ∼ N (0.5, 60 ). Issue: 1 5 Accuracy of approximation • increases with n • increases with the amount of symmetry in the distribution of Xi Rule of thumb for the Binomial distribution: Use the normal approximation for Bn,p , if np > 5 (if p ≤ 0.5) or nq > 5 (if p ≥ 0.5)! P5 i=1 Ui had!