3 Random Variables and Their Properties 3.1 Random Variables Mathematically, a random variable is a real-valued function for which the domain is a sample space and the range is the set of real numbers. However, intuitively, a random variable can be treated as a numerical outcome of the random experiment. The set of all possible values a random variable can take with non-negative probabilities is called the support of the random variable. Example 1: Consider a simple experiment of tossing two coins and observing the results. Let Y denote the number of heads obtained. Note, Y is a random variable that takes values with certain probabilities. 1 Example 2: Consider the daily amount of rainfall at a specified geographical location. With measuring equipment of perfect accuracy, the amount of rainfall could take on any value in the interval [0,5]. Each of 2 the uncountable infinite number of points in the interval represents a distinct possible value of the amount of rainfall in a day. Random variables are of two types: 1. Discrete Random Variable 2. Continuous Random Variable The number of heads in Example - 1 is a discrete random variable as it takes a finite number of values only whereas the amount of rainfall in Example - 2 is a continuous random variable as it takes infinite number of values in an interval. 3.2 Discrete Random Variables A random variable Y is said to be a discrete random variable (DRV), if it can assume only a finite or countably infinite number of distinct values. The set of values the random variable can take on is called the support, denoted as Sy , of the random variable. Note that the list of probabilities of Y for all the values in its support is called probability distribution of Y , and is denoted by P (y), where y takes values in Sy . For a discrete random variable Y , its probability distribution function P (y) is called the probability mass function. Example 3: Consider the Bus Ridership model in your recommended text [section 1.1, Probability and Statistics for Data Science by Norman Matloff] which is also discussed in section 2B.1.3 as follows: • At each stop, each passenger alights from the bus, independently of the actions of others, with probability 0.2 each. • Either 0, 1 or 2 new passengers get on the bus, with probabilities 0.5, 0.4 and 0.1, respectively. Passengers at successive stops act independently. • Assume the bus is so large that it never becomes full, so the new passengers can always board. • Suppose the bus is empty when it arrives at its first stop. Let L1 and L2 denote the number of passengers on the bus as it leaves the first and second stop respectively. Note L1 and L2 are two discrete random variable with supports SL1 = {0, 1, 2} and SL2 = {0, 1, 2, 3, 4} respectively. 3 Example 4: Consider an experiment, in which we toss a coin until we get a head. Let X denote the number of tosses needed. Then X is a discrete random variable with support {1, 2, 3, . . .}. Note that this support is countably infinite set. 4 You can visualize tossing a coin until you get a head as follows: H TH TTH .. . TT . . . H At First Toss At Second Toss At Third Toss .. . At yth Toss 3.2.1 Expected Value of a Discrete Random Variable Definition: Let Y be a DRV with the probability mass function P (y). Then the mathematical definition of the expected value of Y , denoted as E(Y ) is given by X E(Y ) = yP (y), y∈Sy given that the sum exists, i.e y∈Sy |y|P (y) < ∞. P Intuitively, the expected value of a DRV Y , is defined as the long-run average value of Y , as we repeat the experiment indefinitely. The long-run average value of Y can be written as: lim n→∞ Y1 + Y2 + . . . + Yn n Note, E(Y ) = = lim X Y1 + Y2 + . . . + Yn n yP (Y = y) n→∞ (1) (2) y∈Sy Note that the expected value of a DRV Y with support Sy is a weighted sum of the values in the support of Y , with the weights being the probabilities of the those values. 3.2.2 Variance of Discrete RV The variance of a random variable (RV) X, for which the expected value exists, is defined mathematically as V ar(X) = E(X − E(X))2 . Note V ar(X) is the expected squared difference of X from E(X). The positive square root of V ar(X) is called the standard deviation of the RV X. 3.2.3 The Bus Ridership model - EV and Variance Consider the Bus Ridership model in your recommended text [section 1.1, Probability and Statistics for Data Science by Norman Matloff] which is also discussed in section 2B.1.3 as follows: - At each stop, each passenger alights from the bus, independently of the actions of others, with probability 0.2 each. - Either 0, 1 or 2 new passengers get on the bus, with probabilities 0.5, 0.4 and 0.1, respectively. Passengers at successive stops act independently. - Assume the bus is so large that it never becomes full, so the new passengers can always board. - Suppose the bus is empty when it arrives at its first stop. 5 Let L1 and L2 denote the number of passengers on the bus as it leaves the first and second stop respectively. Note L1 and L2 are two discrete random variable with supports SL1 = {0, 1, 2} and SL2 = {0, 1, 2, 3, 4} respectively. We want to find the expected value and the variance of L1 and L2 , that is find E(L1 ), V ar(L1 ), E(L2 ), V ar(L2 ). From the given information you can write the probability mass function (pmf) of L1 as follows: Table 1: Probability mass function of L1 l1 p(l1 ) 0 0.5 1 0.4 2 0.1 Expected Value E(L1 ) = 2 X l1 × p(l1 ) = 0 × 0.5 + 1 × 0.4 + 2 × 0.1 = 0.6 l1 =0 Variance V ar(L1 ) = E(L21 ) − [E(L1 )]2 = [(02 × 0.5) + (12 × 0.4) + (22 × 0.1)] − 0.62 = 0.44 From the given information you can calculate the probability masses for L2 = 0, 1, 2, 3, 4 and hence write pmf of L2 as follows: Table 2: Probability mass function of L2 l2 p(l2 ) 0 0.292 1 0.4036 2 0.2244 3 0.066 4 0.014 Expected Value E(L2 ) = 4 X l2 p(l2 ) = 1.1064 l2 =0 Variance V ar(L2 ) = E(L22 ) − [E(L2 )]2 = 4 X l22 p(l2 ) − (1.1064)2 = 0.8952 l2 =0 Expected value and Variance using Monte Carlo Simulation 6 Let us now consider finding the value of the expected number of passengers and the variance of the number of passengers on the bus as it leaves the tenth stop, L10 . You may try finding the probability mass function of the random variable, L10 and then apply the definition of expected value. Note that the support for the number of passengers on the bus as it leaves the third stop L3 , fourth stop L4 , and so on the tenth stop, L10 are as follows: SL3 = {0, 1, 2, 3, 4, 5, 6} SL4 = {0, 1, 2, 3, 4, 5, 6, 7, 8} .. . SL10 = {0, 1, 2, . . . 20} Thus it is tedious to find the probability mass function of the random variable, L10 and then use the definition of expected value to find E(L10 ) and to use the definition of variance to find V ar(L10 ). Instead you can use the Monte Carlo Simulation to approximate E(L10 ) and V ar(L10 ). nreps = 10000 nstops = 10 total = 0 total2 = 0 for (i in 1:nreps){ passengers = 0 for (j in 1: nstops) { if (passengers > 0) for (k in 1:passengers) if (runif(1) <0.2) passengers = passengers -1 newpass = sample(0:2,1,prob = c(0.5,0.4,0.1)) passengers = passengers + newpass } total = total + passengers total2 = total2 + passengers*passengers } EV = total/nreps EV ## [1] 2.7021 VAR = total2/nreps - (total/nreps)ˆ2 VAR ## [1] 2.269756 Thus on average there will be about 3 passengers on the bus as it leaves the tenth stop with variance of 2 passengers. 3.3 Continuous Random Variables The type of random variable that takes on any value in an interval is called a continuous random variable. Recall the rainfall amount in Example 2 that takes values in the interval [0,5]. If we let Z to denote the amount of rainfall taking values in the interval [0,5], SZ = [0, 5] is the support of Z. 7 Other examples of a continuous random variable, are i) the length of time, in years, of a washing machine, ii) systolic and diastolic blood pressure measurements, iii) intelligence quotient (IQ) etc. Note: Mathematically it is impossible to assign nonzero probabilities to all the points on an interval and at the same time satisfy the requirement that the probabilities of the distinct possible values sum to 1. Thus the probability distribution of a continuous random variable is defined using different methods, based on rules of calculus. 3.3.1 Cumulative Distribution Function (CDF) of a random variable Definition: Let Y denote any random variable (discrete or continuous). The distribution function of Y , denoted by FY (y) is defined as FY (y) = P (Y ≤ y), −∞ < y < ∞ Note, if the CDF of a RV is discrete in nature, the random variable is discrete, whereas if the CDF of a RV is continuous, the random variable is continuous. Example 5: Let Y be a discrete random variable with the probability mass function given by, pY (y) = y 2−y 1 2 1 , y = 0, 1, 2. 2 2 y Accordingly, p(0) = 1/4, p(1) = 1/2, p(2) = 1/4. The cdf of Y can be written as: 0, 1/4, FY (y) = P (Y ≤ y) = 3/4, 1, for for for for y<0 0≤y<1 1≤y<2 y≥2 Note, the CDF of Y is a step function with a jump at each values y takes 0, 1, 2. Also note that F (−2) = 0 and F (1.5) = P (Y ≤ 1.5) = P (y = 0) + P (y = 1) = 1/4 + 1/2 = 3/4 which you get directly from the step function. Since the CDF FY (y) is discontinuous, the associated RV is not continuous, but discrete. 8 Properties of a CDF The CDF of a random variable Y FY (y) satisfies the following properties: 1. F (−∞) = lim F (y) = 0 y→−∞ 2. F (∞) = lim F (y) = 1 y→∞ 3. F (y) is a nondecreasing function of y. That is, if y1 and y2 are any values of y, such that y1 < y2 , then F (y1 ) ≤ F (y2 ) Definition: Let Y denote a random variable with CDF F (y), then Y is said to be a continuous random variable, if the CDF F (y) is a continuous function for −∞ < y < ∞. Note for a continuous random variable Y , for any real value y, P (Y = y) = 0 Example 6: Let the random variable Y have the following CDF. 0, for y < 0 y, for 0 ≤ y ≤ 1 FY (y) = P (Y ≤ y) = 1, for y > 1 You can sketch the graph of FY (y) as follows: 9 10 Density function Definition: Let FY (y) be the distribution function for a continuous random variable Y . Then fY (y) given by fY (y) = dFY (y) = F ′ (y), dy wherever the derivative exists, is called the probability density function (pdf) of the random variable Y . From the fundamental principle of calculus, it follows that FY (y) = Z y f (u)du, −∞ where f (.) is the probability density function (pdf) and the u is the variable of integration. Example 7 Let the random variable Y have the following cdf. 0, for y < 0 y, for 0 ≤ y ≤ 1 FY (y) = P (Y ≤ y) = 1, for y > 1 Let’s find the pdf of Y and graph it. d(0) dy = 0, for y < 0 dF (y) d(y) fY (y) = = dy = 1, for 0 < y < 1 dy d(1) dy = 0, for y > 1 Note f (y) is not defined at y = 0 and y = 1 since the derivative does not exist at those points. Therefore the pdf of Y is given by f (y) = 1, 0 < y < 1 The random variable Y is said to have a uniform probability distribution. 11 Properties of a density function The density function f (y) satisfies the following properties: 1. f (y) ≥ 0 for any value in the support of Y R∞ 2. −∞ f (y)dy = 1 Rb 3. P (a ≤ Y ≤ b) = a f (y)dy Note : 1) When it is clear that fY (y) and FY (y) are the pdf and cdf of the random variable Y , you can drop the subscript and denote these by f (y) and F (y) respectively. 2) We will use R to find the cdf of a continuous variable with known pdf and the probability within an interval (property 3). 3.3.2 Expected Value of a Continuous Random Variable Definition: The expected value of a continuous random variable Y is defined as E(Y ) = Z ∞ yf (y)dy, −∞ given that the integral exists. Let g(Y ) be a function of Y , then the expected value of g(Y ) if given by E[g(Y )] = Z ∞ g(Y )f (y)dy, −∞ given that the integral exists. 3.3.3 Variance of a Continuous Random Variable Let us denote the expected value of Y by Greek symbol µ. That is, E(Y ) = µ. and let g(Y ) = (Y − µ)2 Then variance of Y is simply E[g(Y )]. That is, 12 V ar(Y ) = E[(Y − µ) ] = 2 Z ∞ (Y − µ)2 f (y)dy −∞ Example 8 Let X be a continuous random variable with density 2x 15 , 1 ≤ x ≤ 4 f (x) = 0, elsewhere Find the expected value and variance of X. That is find E(X), V ar(X). You can use the definition of E(X) and V ar(X) to find these directly as follows: E(X) = Z 4 xf (x)dx = 1 Z 4 x 1 2x dx = 15 V ar(X) = E(X ) − 2.8 = 2 2 Z 4 1 Z 4 2x2 2 34 2 126 dx = [x ]1 = [64 − 1] = = 2.8 15 45 45 45 x f (x)dx − 7.84 = 2 1 Z 4 1 2x3 dx − 7.84 15 2 4 2 x4 4 [ ] − 7.84 = (4 − 14 ) − 7.87 = 8.5 − 7.84 = 0.66 = 15 4 1 60 You can use the R function integrate() to evaluate the integrals and hence find the expected value and variance of X as follows. 2 To find E(X), we need to integrate g(x) = 2x 15 over the range [1,4]. And to find V ar(X), we need to integrate 2x3 g(x) = 15 over the range [1,4]. # E(X) g1 = function(x) 2*xˆ2/15 integrate(g1, 1,4)$value ## [1] 2.8 # Var(X) g3 = function(x) 2*xˆ3/15 integrate(g3, 1,4)$value - 7.84 ## [1] 0.66 3.4 Properties of Expected Value and Variance 3.4.1 Properties of Expected Value 1. If Y is a DRV and g(Y ) be a function of Y then E(g(Y )) = X y∈Sy 13 g(y)p(y) 2. If Y is a DRV and g(Y ) = c where c is a constant then X cp(y) = c. E(g(Y )) = E(c) = y∈Sy Note the last equality holds because y∈Sy p(y) = 1 P 3. For any random variable Y (holds for continuous RVs as well), and constant c E(cY ) = cE(Y ). 4. For any random variables Y1 and Y2 (holds for continuous RVs as well), and constants c1 and c2 , E(c1 Y1 + c2 Y2 ) = c1 E(Y1 ) + c2 E(Y2 ). This result holds for more than two random variables. Note that for c1 = c2 = 1, we can write E(Y1 + Y2 ) = E(Y1 ) + E(Y2 ). 5. If Y1 and Y2 are two independent random variables, then E(Y1 Y2 ) = E(Y1 )E(Y2 ) Note that, if you can find the expected values of one or more random variables, these properties can used to find expected values under multiple scenarios. 3.4.2 Properties of Variance 1. V ar(c) = 0, where c is a constant 2. V ar(X) = E(X 2 ) − (E(X))2 3. V ar(cX) = c2 V ar(X) 4. V ar(cX + d) = c2 V ar(X) Why these properties make sense? 1. V ar(c) = E(c − E(c))2 = E(c − c)2 = 0 (3) 2. V ar(X) = (4) E(X − E(X))2 = E(X − 2XE(X) + E(X) ) 2 (5) 2 = E(X ) − 2E(X)E(X) + E(E(X) ) (6) = E(X 2 ) − 2E(X)2 + E(X)2 (7) = E(X ) − E(X) (8) 2 2 2 2 Note, equality (3) is expansion of binomial formula, equality (4) holds as expectation E is a linear operator, equality (5) holds since E(X) is constant. 14 3. V ar(cX) 4. = E(cX − E(cX))2 (9) = E(cX − cE(X)) 2 (10) = 2 c E(X − E(X)) 2 (11) = c2 V ar(X) (12) V ar(cX + d) = c2 V ar(X), is left as an exercise. 3.4.3 Chebychev’s Inequality The central idea behind the Chebychev’s inequality is that it states a bound for the probability on how many standard deviation away a RV is from it’s mean. For a random variable X with finite mean µ and variance σ2 , P (|X − µ| ≥ cσ) ≤ 1 . c2 Note, µ = E(X), and σ 2 = E(X − µ)2 , and c is a positive constant. The inequality says that no more than 25% of the data can be two or more standard deviations from the mean, whatever the underlying distribution. For example, let X denote the number of claims an insurance company recieves from its customers. Note that X is RV and without knowing anything about its underlying probability distribution except that the mean and variance exist, Chebyshev’s inequality tells you that no more than 11% of the claims can be three or more standard deviations from the mean. 3.4.4 Coefficient of Variation To compare the variability of random variables with different probability distributions, it is useful to consider the size of V ar(X) relative to the size of E(X). Coefficient of variation, defined to be the ratio of the standard deviation to the mean, Coef of var = p V ar(X) , E(X) is a scale-free measure that serves as a good way to compare variance across several random variables. Few Remarks on Expected Value and Variance 1. For a random X, the function g(θ) = E[(X − θ)2 ] is minimized for θ = E(X). You can use calculus, in particular, the optimization process, to prove this result. 2. Note, if you plug in θ in the function it becomes the variance of X 3. Note, if we are interested in the unknown constant θ, known as the parameter of interest [such as mean weight], then X − θ is the prediction error that we want to minimize 4. E[(X − θ)2 ] is expected squared error which is minimized for θ = E(X). Later in the course we will learn when E(X) = θ, X is called an unbiased estimator of θ 15 5. Another related function is g(θ) = E(|X − θ|), which is minimized when we replace θ by the median of the distribution. 3.4.5 Covariance Covariance is a measure of the degree to which two random variables X and Y vary together and is defined as Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))] If Cov(X, Y ) is scaled by inverse of the product of the standard deviations of X and Y , you get what is called the Pearson correlation coefficient, denoted as ρXY . That is, ρXY = E[(X − E(X))(Y − E(Y ))] p V arX × V ar(Y ) For example, if X and Y denote the height and weights of adults, in general people taller than the average are heavier and shorter than the average are lighter. Thus X − E(X) > 0 =⇒ Y − E(Y ) > 0, =⇒ (X − E(X))(Y − E(Y )) > 0 X − E(X) < 0 =⇒ Y − E(Y ) < 0, =⇒ (X − E(X))(Y − E(Y )) > 0 Note, if two variables are positively related as in the above case, their covariance is positive. But if two variables are negatively related, e.g, price of a commodity and its supply in the market, their covariance is negative. Properties of covariance 1. Cov(X, Y ) = E(XY ) − E(X)E(Y ) 2. V ar(X + Y ) = V ar(X) + V ar(Y ) + 2Cov(X, Y ) 3. V ar(aX + bY ) = a2 V ar(X) + b2 V ar(Y ) + 2abCov(X, Y ) 4. If X and Y are independent then Cov(X, Y ) = E(XY ) − E(X)E(Y ) = E(X)E(Y ) − E(X)E(Y ) = 0. Then V ar(X + Y ) = V ar(X) + V ar(Y ) Properties (3) and (4) can be extended as follows: 1. V ar(a1 X1 + . . . + ak Xk ) = k X a2i V ar(Xi ) + 2 i=1 k X 1≤i≤j≤k 16 ai aj Cov(Xi , Xj ) 2. If Xi s are independent V ar(a1 X1 + . . . + ak Xk ) = k X a2i V ar(Xi ) i=1 3.5 Other Properties of a RV : Skewness and Kurtosis The expected value and the variance of a random variable, X are the first and second order moments of the probability distribution of X that give you the measures of central tendency and the dispersion of the distribution. Higher order moments are needed to learn about the shape of the distribution. The coefficients of skewness and kurtosis are two measures of shapes of the probability distribution of a random variable X and are defined as the standardized third and fourth moments respectively. The coefficient of skewness is denoted as β1 , is defined as follows: β1 = E(X − E(X))3 3 V ar(X) 2 The numerator in the above defition is called the third central moment. The normal distribution which we will present in few weeks and other symmetric distributions with finite third moment have skewness of 0. The coefficient of kurtosis is denoted as β2 , is defined as follows: β2 = E(X − E(X))4 [V ar(X)]2 The numerator in the above defition is called the fourth central moment. The excess kurtosis is defined as kurtosis minus 3. Distributions with zero excess kurtosis are called mesokurtic. A distribution with positive excess kurtosis is called leptokurtic. In terms of shape, a leptokurtic distribution has fatter tails. An example of leptokurtic distribution is the Student’s t-distribution. A distribution with negative excess kurtosis is called platykurtic. In terms of shape, a platykurtic distribution has thinner tails. An example of platykurtic distribution is the uniform distributions. References 1. Norman Matloff. Probability and Statistics for Data Science, CRC Press/Taylor & Francis Group, 2020. 2. Wackerly, Dennis D., Mendenhall III, W., Scheaffer, Richard L. Mathematical Statistics with Applications, 7th Edition, Brooks/Cole, 2008. 3. https://en.wikipedia.org/wiki/Skewness 4. https://en.wikipedia.org/wiki/Kurtosis 17