Uploaded by justin.rvd

Random Variables & Properties: Expected Value, Variance, CDF

advertisement
3 Random Variables and Their Properties
3.1 Random Variables
Mathematically, a random variable is a real-valued function for which the domain is a sample space and the
range is the set of real numbers. However, intuitively, a random variable can be treated as a numerical outcome
of the random experiment. The set of all possible values a random variable can take with non-negative
probabilities is called the support of the random variable.
Example 1: Consider a simple experiment of tossing two coins and observing the results. Let Y denote the
number of heads obtained. Note, Y is a random variable that takes values with certain probabilities.
1
Example 2: Consider the daily amount of rainfall at a specified geographical location. With measuring
equipment of perfect accuracy, the amount of rainfall could take on any value in the interval [0,5]. Each of
2
the uncountable infinite number of points in the interval represents a distinct possible value of the amount of
rainfall in a day.
Random variables are of two types:
1. Discrete Random Variable
2. Continuous Random Variable
The number of heads in Example - 1 is a discrete random variable as it takes a finite number of values only
whereas the amount of rainfall in Example - 2 is a continuous random variable as it takes infinite number of
values in an interval.
3.2 Discrete Random Variables
A random variable Y is said to be a discrete random variable (DRV), if it can assume only a finite or
countably infinite number of distinct values. The set of values the random variable can take on is called the
support, denoted as Sy , of the random variable. Note that the list of probabilities of Y for all the values
in its support is called probability distribution of Y , and is denoted by P (y), where y takes values in Sy .
For a discrete random variable Y , its probability distribution function P (y) is called the probability mass
function.
Example 3: Consider the Bus Ridership model in your recommended text [section 1.1, Probability and
Statistics for Data Science by Norman Matloff] which is also discussed in section 2B.1.3 as follows:
• At each stop, each passenger alights from the bus, independently of the actions of others, with probability
0.2 each.
• Either 0, 1 or 2 new passengers get on the bus, with probabilities 0.5, 0.4 and 0.1, respectively. Passengers
at successive stops act independently.
• Assume the bus is so large that it never becomes full, so the new passengers can always board.
• Suppose the bus is empty when it arrives at its first stop.
Let L1 and L2 denote the number of passengers on the bus as it leaves the first and second stop respectively.
Note L1 and L2 are two discrete random variable with supports SL1 = {0, 1, 2} and SL2 = {0, 1, 2, 3, 4}
respectively.
3
Example 4: Consider an experiment, in which we toss a coin until we get a head. Let X denote the number
of tosses needed. Then X is a discrete random variable with support {1, 2, 3, . . .}. Note that this support is
countably infinite set.
4
You can visualize tossing a coin until you get a head as follows:
H
TH
TTH
..
.
TT . . . H
At First Toss
At Second Toss
At Third Toss
..
.
At yth Toss
3.2.1 Expected Value of a Discrete Random Variable
Definition: Let Y be a DRV with the probability mass function P (y). Then the mathematical definition of
the expected value of Y , denoted as E(Y ) is given by
X
E(Y ) =
yP (y),
y∈Sy
given that the sum exists, i.e
y∈Sy |y|P (y) < ∞.
P
Intuitively, the expected value of a DRV Y , is defined as the long-run average value of Y , as we repeat the
experiment indefinitely. The long-run average value of Y can be written as:
lim
n→∞
Y1 + Y2 + . . . + Yn
n
Note,
E(Y )
=
=
lim
X
Y1 + Y2 + . . . + Yn
n
yP (Y = y)
n→∞
(1)
(2)
y∈Sy
Note that the expected value of a DRV Y with support Sy is a weighted sum of the values in the support of
Y , with the weights being the probabilities of the those values.
3.2.2 Variance of Discrete RV
The variance of a random variable (RV) X, for which the expected value exists, is defined mathematically as
V ar(X) = E(X − E(X))2 .
Note V ar(X) is the expected squared difference of X from E(X). The positive square root of V ar(X) is
called the standard deviation of the RV X.
3.2.3 The Bus Ridership model - EV and Variance
Consider the Bus Ridership model in your recommended text [section 1.1, Probability and Statistics for
Data Science by Norman Matloff] which is also discussed in section 2B.1.3 as follows: - At each stop, each
passenger alights from the bus, independently of the actions of others, with probability 0.2 each. - Either 0, 1
or 2 new passengers get on the bus, with probabilities 0.5, 0.4 and 0.1, respectively. Passengers at successive
stops act independently. - Assume the bus is so large that it never becomes full, so the new passengers can
always board. - Suppose the bus is empty when it arrives at its first stop.
5
Let L1 and L2 denote the number of passengers on the bus as it leaves the first and second stop respectively.
Note L1 and L2 are two discrete random variable with supports SL1 = {0, 1, 2} and SL2 = {0, 1, 2, 3, 4}
respectively.
We want to find the expected value and the variance of L1 and L2 , that is find E(L1 ), V ar(L1 ), E(L2 ), V ar(L2 ).
From the given information you can write the probability mass function (pmf) of L1 as follows:
Table 1: Probability mass function of L1
l1 p(l1 )
0
0.5
1
0.4
2
0.1
Expected Value
E(L1 ) =
2
X
l1 × p(l1 ) = 0 × 0.5 + 1 × 0.4 + 2 × 0.1 = 0.6
l1 =0
Variance
V ar(L1 ) = E(L21 ) − [E(L1 )]2 = [(02 × 0.5) + (12 × 0.4) + (22 × 0.1)] − 0.62 = 0.44
From the given information you can calculate the probability masses for L2 = 0, 1, 2, 3, 4 and hence write pmf
of L2 as follows:
Table 2: Probability mass function of L2
l2
p(l2 )
0
0.292
1 0.4036
2 0.2244
3
0.066
4
0.014
Expected Value
E(L2 ) =
4
X
l2 p(l2 ) = 1.1064
l2 =0
Variance
V ar(L2 ) = E(L22 ) − [E(L2 )]2 =
4
X
l22 p(l2 ) − (1.1064)2 = 0.8952
l2 =0
Expected value and Variance using Monte Carlo Simulation
6
Let us now consider finding the value of the expected number of passengers and the variance of the number
of passengers on the bus as it leaves the tenth stop, L10 . You may try finding the probability mass function
of the random variable, L10 and then apply the definition of expected value. Note that the support for the
number of passengers on the bus as it leaves the third stop L3 , fourth stop L4 , and so on the tenth stop, L10
are as follows:
SL3 = {0, 1, 2, 3, 4, 5, 6}
SL4 = {0, 1, 2, 3, 4, 5, 6, 7, 8}
..
.
SL10 = {0, 1, 2, . . . 20}
Thus it is tedious to find the probability mass function of the random variable, L10 and then use the definition
of expected value to find E(L10 ) and to use the definition of variance to find V ar(L10 ). Instead you can use
the Monte Carlo Simulation to approximate E(L10 ) and V ar(L10 ).
nreps = 10000
nstops = 10
total = 0
total2 = 0
for (i in 1:nreps){
passengers = 0
for (j in 1: nstops) {
if (passengers > 0)
for (k in 1:passengers)
if (runif(1) <0.2)
passengers = passengers -1
newpass = sample(0:2,1,prob = c(0.5,0.4,0.1))
passengers = passengers + newpass
}
total = total + passengers
total2 = total2 + passengers*passengers
}
EV = total/nreps
EV
## [1] 2.7021
VAR = total2/nreps - (total/nreps)ˆ2
VAR
## [1] 2.269756
Thus on average there will be about 3 passengers on the bus as it leaves the tenth stop with variance of 2
passengers.
3.3 Continuous Random Variables
The type of random variable that takes on any value in an interval is called a continuous random variable.
Recall the rainfall amount in Example 2 that takes values in the interval [0,5]. If we let Z to denote the
amount of rainfall taking values in the interval [0,5], SZ = [0, 5] is the support of Z.
7
Other examples of a continuous random variable, are i) the length of time, in years, of a washing machine, ii)
systolic and diastolic blood pressure measurements, iii) intelligence quotient (IQ) etc.
Note: Mathematically it is impossible to assign nonzero probabilities to all the points on an interval and at
the same time satisfy the requirement that the probabilities of the distinct possible values sum to 1.
Thus the probability distribution of a continuous random variable is defined using different methods, based
on rules of calculus.
3.3.1 Cumulative Distribution Function (CDF) of a random variable
Definition: Let Y denote any random variable (discrete or continuous). The distribution function of Y ,
denoted by FY (y) is defined as
FY (y) = P (Y ≤ y), −∞ < y < ∞
Note, if the CDF of a RV is discrete in nature, the random variable is discrete, whereas if the CDF of a RV is
continuous, the random variable is continuous.
Example 5: Let Y be a discrete random variable with the probability mass function given by,
pY (y) =
y 2−y
1
2
1
, y = 0, 1, 2.
2
2
y
Accordingly, p(0) = 1/4, p(1) = 1/2, p(2) = 1/4.
The cdf of Y can be written as:

0,



1/4,
FY (y) = P (Y ≤ y) =
3/4,



1,
for
for
for
for
y<0
0≤y<1
1≤y<2
y≥2
Note, the CDF of Y is a step function with a jump at each values y takes 0, 1, 2. Also note that F (−2) = 0
and F (1.5) = P (Y ≤ 1.5) = P (y = 0) + P (y = 1) = 1/4 + 1/2 = 3/4 which you get directly from the step
function. Since the CDF FY (y) is discontinuous, the associated RV is not continuous, but discrete.
8
Properties of a CDF
The CDF of a random variable Y FY (y) satisfies the following properties:
1.
F (−∞) = lim F (y) = 0
y→−∞
2.
F (∞) = lim F (y) = 1
y→∞
3. F (y) is a nondecreasing function of y. That is, if y1 and y2 are any values of y, such that y1 < y2 , then
F (y1 ) ≤ F (y2 )
Definition: Let Y denote a random variable with CDF F (y), then Y is said to be a continuous random
variable, if the CDF F (y) is a continuous function for −∞ < y < ∞.
Note for a continuous random variable Y , for any real value y, P (Y = y) = 0
Example 6: Let the random variable Y have the following CDF.

 0, for y < 0
y, for 0 ≤ y ≤ 1
FY (y) = P (Y ≤ y) =

1, for y > 1
You can sketch the graph of FY (y) as follows:
9
10
Density function
Definition: Let FY (y) be the distribution function for a continuous random variable Y . Then fY (y) given
by
fY (y) =
dFY (y)
= F ′ (y),
dy
wherever the derivative exists, is called the probability density function (pdf) of the random variable Y . From
the fundamental principle of calculus, it follows that
FY (y) =
Z y
f (u)du,
−∞
where f (.) is the probability density function (pdf) and the u is the variable of integration.
Example 7
Let the random variable Y have the following cdf.

 0, for y < 0
y, for 0 ≤ y ≤ 1
FY (y) = P (Y ≤ y) =

1, for y > 1
Let’s find the pdf of Y and graph it.
 d(0)
 dy = 0, for y < 0
dF (y)  d(y)
fY (y) =
=
dy = 1, for 0 < y < 1

dy
 d(1)
dy = 0, for y > 1
Note f (y) is not defined at y = 0 and y = 1 since the derivative does not exist at those points.
Therefore the pdf of Y is given by
f (y) = 1, 0 < y < 1
The random variable Y is said to have a uniform probability distribution.
11
Properties of a density function
The density function f (y) satisfies the following properties:
1. f (y) ≥ 0 for any value in the support of Y
R∞
2. −∞ f (y)dy = 1
Rb
3. P (a ≤ Y ≤ b) = a f (y)dy
Note : 1) When it is clear that fY (y) and FY (y) are the pdf and cdf of the random variable Y , you can drop
the subscript and denote these by f (y) and F (y) respectively.
2) We will use R to find the cdf of a continuous variable with known pdf and the probability within an
interval (property 3).
3.3.2 Expected Value of a Continuous Random Variable
Definition: The expected value of a continuous random variable Y is defined as
E(Y ) =
Z ∞
yf (y)dy,
−∞
given that the integral exists.
Let g(Y ) be a function of Y , then the expected value of g(Y ) if given by
E[g(Y )] =
Z ∞
g(Y )f (y)dy,
−∞
given that the integral exists.
3.3.3 Variance of a Continuous Random Variable
Let us denote the expected value of Y by Greek symbol µ. That is, E(Y ) = µ. and let g(Y ) = (Y − µ)2
Then variance of Y is simply E[g(Y )]. That is,
12
V ar(Y ) = E[(Y − µ) ] =
2
Z ∞
(Y − µ)2 f (y)dy
−∞
Example 8
Let X be a continuous random variable with density
2x
15 , 1 ≤ x ≤ 4
f (x) =
0,
elsewhere
Find the expected value and variance of X. That is find E(X), V ar(X).
You can use the definition of E(X) and V ar(X) to find these directly as follows:
E(X) =
Z 4
xf (x)dx =
1
Z 4
x
1
2x
dx =
15
V ar(X) = E(X ) − 2.8 =
2
2
Z 4
1
Z 4
2x2
2 34
2
126
dx =
[x ]1 =
[64 − 1] =
= 2.8
15
45
45
45
x f (x)dx − 7.84 =
2
1
Z 4
1
2x3
dx − 7.84
15
2 4
2 x4 4
[ ] − 7.84 =
(4 − 14 ) − 7.87 = 8.5 − 7.84 = 0.66
=
15 4 1
60
You can use the R function integrate() to evaluate the integrals and hence find the expected value and
variance of X as follows.
2
To find E(X), we need to integrate g(x) = 2x
15 over the range [1,4]. And to find V ar(X), we need to integrate
2x3
g(x) = 15 over the range [1,4].
# E(X)
g1 = function(x) 2*xˆ2/15
integrate(g1, 1,4)$value
## [1] 2.8
# Var(X)
g3 = function(x) 2*xˆ3/15
integrate(g3, 1,4)$value - 7.84
## [1] 0.66
3.4 Properties of Expected Value and Variance
3.4.1 Properties of Expected Value
1. If Y is a DRV and g(Y ) be a function of Y then
E(g(Y )) =
X
y∈Sy
13
g(y)p(y)
2. If Y is a DRV and g(Y ) = c where c is a constant then
X
cp(y) = c.
E(g(Y )) = E(c) =
y∈Sy
Note the last equality holds because
y∈Sy p(y) = 1
P
3. For any random variable Y (holds for continuous RVs as well), and constant c
E(cY ) = cE(Y ).
4. For any random variables Y1 and Y2 (holds for continuous RVs as well), and constants c1 and c2 ,
E(c1 Y1 + c2 Y2 ) = c1 E(Y1 ) + c2 E(Y2 ).
This result holds for more than two random variables. Note that for c1 = c2 = 1, we can write
E(Y1 + Y2 ) = E(Y1 ) + E(Y2 ).
5. If Y1 and Y2 are two independent random variables, then
E(Y1 Y2 ) = E(Y1 )E(Y2 )
Note that, if you can find the expected values of one or more random variables, these properties can used to
find expected values under multiple scenarios.
3.4.2 Properties of Variance
1. V ar(c) = 0, where c is a constant
2. V ar(X) = E(X 2 ) − (E(X))2
3. V ar(cX) = c2 V ar(X)
4. V ar(cX + d) = c2 V ar(X)
Why these properties make sense?
1.
V ar(c) = E(c − E(c))2 = E(c − c)2 = 0
(3)
2.
V ar(X)
=
(4)
E(X − E(X))2
= E(X − 2XE(X) + E(X) )
2
(5)
2
= E(X ) − 2E(X)E(X) + E(E(X) )
(6)
= E(X 2 ) − 2E(X)2 + E(X)2
(7)
= E(X ) − E(X)
(8)
2
2
2
2
Note, equality (3) is expansion of binomial formula, equality (4) holds as expectation E is a linear operator,
equality (5) holds since E(X) is constant.
14
3.
V ar(cX)
4.
=
E(cX − E(cX))2
(9)
=
E(cX − cE(X))
2
(10)
=
2
c E(X − E(X))
2
(11)
=
c2 V ar(X)
(12)
V ar(cX + d) = c2 V ar(X),
is left as an exercise.
3.4.3 Chebychev’s Inequality
The central idea behind the Chebychev’s inequality is that it states a bound for the probability on how many
standard deviation away a RV is from it’s mean. For a random variable X with finite mean µ and variance
σ2 ,
P (|X − µ| ≥ cσ) ≤
1
.
c2
Note, µ = E(X), and σ 2 = E(X − µ)2 , and c is a positive constant.
The inequality says that no more than 25% of the data can be two or more standard deviations from the
mean, whatever the underlying distribution.
For example, let X denote the number of claims an insurance company recieves from its customers. Note
that X is RV and without knowing anything about its underlying probability distribution except that the
mean and variance exist, Chebyshev’s inequality tells you that no more than 11% of the claims can be three
or more standard deviations from the mean.
3.4.4 Coefficient of Variation
To compare the variability of random variables with different probability distributions, it is useful to consider
the size of V ar(X) relative to the size of E(X). Coefficient of variation, defined to be the ratio of the standard
deviation to the mean,
Coef of var =
p
V ar(X)
,
E(X)
is a scale-free measure that serves as a good way to compare variance across several random variables.
Few Remarks on Expected Value and Variance
1. For a random X, the function g(θ) = E[(X − θ)2 ] is minimized for θ = E(X). You can use calculus, in
particular, the optimization process, to prove this result.
2. Note, if you plug in θ in the function it becomes the variance of X
3. Note, if we are interested in the unknown constant θ, known as the parameter of interest [such as mean
weight], then X − θ is the prediction error that we want to minimize
4. E[(X − θ)2 ] is expected squared error which is minimized for θ = E(X). Later in the course we will
learn when E(X) = θ, X is called an unbiased estimator of θ
15
5. Another related function is g(θ) = E(|X − θ|), which is minimized when we replace θ by the median of
the distribution.
3.4.5 Covariance
Covariance is a measure of the degree to which two random variables X and Y vary together and is defined as
Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))]
If Cov(X, Y ) is scaled by inverse of the product of the standard deviations of X and Y , you get what is
called the Pearson correlation coefficient, denoted as ρXY . That is,
ρXY =
E[(X − E(X))(Y − E(Y ))]
p
V arX × V ar(Y )
For example, if X and Y denote the height and weights of adults, in general people taller than the average
are heavier and shorter than the average are lighter. Thus
X − E(X) > 0 =⇒ Y − E(Y ) > 0, =⇒ (X − E(X))(Y − E(Y )) > 0
X − E(X) < 0 =⇒ Y − E(Y ) < 0, =⇒ (X − E(X))(Y − E(Y )) > 0
Note, if two variables are positively related as in the above case, their covariance is positive. But if two
variables are negatively related, e.g, price of a commodity and its supply in the market, their covariance is
negative.
Properties of covariance
1.
Cov(X, Y ) = E(XY ) − E(X)E(Y )
2.
V ar(X + Y ) = V ar(X) + V ar(Y ) + 2Cov(X, Y )
3.
V ar(aX + bY ) = a2 V ar(X) + b2 V ar(Y ) + 2abCov(X, Y )
4. If X and Y are independent then
Cov(X, Y ) = E(XY ) − E(X)E(Y ) = E(X)E(Y ) − E(X)E(Y ) = 0.
Then
V ar(X + Y ) = V ar(X) + V ar(Y )
Properties (3) and (4) can be extended as follows:
1.
V ar(a1 X1 + . . . + ak Xk ) =
k
X
a2i V ar(Xi ) + 2
i=1
k
X
1≤i≤j≤k
16
ai aj Cov(Xi , Xj )
2. If Xi s are independent
V ar(a1 X1 + . . . + ak Xk ) =
k
X
a2i V ar(Xi )
i=1
3.5 Other Properties of a RV : Skewness and Kurtosis
The expected value and the variance of a random variable, X are the first and second order moments of
the probability distribution of X that give you the measures of central tendency and the dispersion of the
distribution. Higher order moments are needed to learn about the shape of the distribution. The coefficients
of skewness and kurtosis are two measures of shapes of the probability distribution of a random variable X
and are defined as the standardized third and fourth moments respectively.
The coefficient of skewness is denoted as β1 , is defined as follows:
β1 =
E(X − E(X))3
3
V ar(X) 2
The numerator in the above defition is called the third central moment.
The normal distribution which we will present in few weeks and other symmetric distributions with finite
third moment have skewness of 0.
The coefficient of kurtosis is denoted as β2 , is defined as follows:
β2 =
E(X − E(X))4
[V ar(X)]2
The numerator in the above defition is called the fourth central moment.
The excess kurtosis is defined as kurtosis minus 3. Distributions with zero excess kurtosis are called mesokurtic.
A distribution with positive excess kurtosis is called leptokurtic. In terms of shape, a leptokurtic distribution
has fatter tails. An example of leptokurtic distribution is the Student’s t-distribution. A distribution with
negative excess kurtosis is called platykurtic. In terms of shape, a platykurtic distribution has thinner tails.
An example of platykurtic distribution is the uniform distributions.
References
1. Norman Matloff. Probability and Statistics for Data Science, CRC Press/Taylor & Francis Group, 2020.
2. Wackerly, Dennis D., Mendenhall III, W., Scheaffer, Richard L. Mathematical Statistics with Applications, 7th Edition, Brooks/Cole, 2008.
3. https://en.wikipedia.org/wiki/Skewness
4. https://en.wikipedia.org/wiki/Kurtosis
17
Download