1. Fundamentals 1.1 Probability Theory 1.1.1 Sample Space and

advertisement
1. Fundamentals
1.1 Probability Theory
1.1.1 Sample Space and Probability
1.1.2 Random Variables
1.1.3 Limit Theories
1.1 Probability Theory
A statistical model (probability model) deals with experiment whose
outcome is not precisely repeatable, even under supposedly identical
conditions.
- an experiment involving chance (Zufallsexperiment)
The formulation of a statistical model requires two ingredients:
- sample space (Stichprobenraum)
- probability (Wahrscheinlichkeit)
1.1 Probability Theory
Experiment of chance:
A repeatable operation under specified conditions, whose outcome is not predictable
- tossing a coin
- drawing card from a complete set
Elementary outcomes (Ergebnis eines Experiments):
An elementary outcome is a possible outcome of an experiment of chance
- in experiment ‘tossing a coin’ there are 2 elementary outcomes
- in experiment ‘drawing card’ there are 52 elementary outcomes
1.1 Probability Theory
Sample space:
The sample space of an experiment of chance, Ω, is the set of possible outcomes
of the experiments
- the sample space in the experiment ‘tossing a coin’ is ‘head’ and ‘tail’
Events (Ereignisse):
An event is a list of the elementary outcomes. It can be considered as a subset
of the sample space of an experiment of chance. An event can be specified by
E = { x | x satisfies condition E}
- an event of the experiment ‘drawing card’: heart
- the sample space is the largest event of an experiment of chance
1.1 Probability Theory
Intersections and Unions are operations used to obtain new events
Union or Addition
Definition: a point is in the union of event E and event F if and only if it lies
either in E or in F (or possibly both):
E U F = E + F = {E or F} = {ω |ω is in E or F}
Properties:
E + F = F + E (commutative)
[E+F]+G = E+[F+G] (associative)
Intersection or product
Definition: The intersection of two events is that event whose outcomes are
those lying in both of the events:
E I F = EF = {E and F}={ω |ω is in both E and F}
Properties:
EF = FE, [EF]G = E[FG]
E(F+G)=EF+EG
1.1 Probability Theory
More about operations:
Two events are said to be disjoint or mutually exclusive when their
intersection is empty:
EF=O
Difference E-F of two events is given by
E - F = {ω |ω is in E but not in F}
The complement of an event E, Ec, is defined to be
Ec = Ω - E
One has
E + Ec = Ω, EEc = O
1.1 Probability Theory
Probability
Probability of an event E of a repeatable experiments is given by
P ( E ) = lim N →∝
Number of times when E occurs, N E
Number of total trials, N
Experiment ‘tossing a coin’:
The relative frequency of the
event ‘head up’ as the function
of the number of trials
1.1 Probability Theory
The probability assigned to an event, P(E), cannot be completely arbitrary and
has to satisfy the following axioms
Probability axioms
1. P (Ω) = 1
2. 0 ≤ P ( E ) ≤ 1, for every event E
3. P (∪ Ei ) = P ( E1 ) + P ( E2 ) + L, for every sequence of disjoint events E1 , E2 , L
Consequences
1. P ( E c ) = 1 − P ( E )
2. P( E1 ) ≤ P( E2 ) for events E1 ⊂ E2
1.1 Probability Theory
The discrete case
- the sample space has only a finite or countably infinite number of outcomes
- the probability for each individual outcome is a nonnegative number
- the probability of a given event is given by
P( E ) = ∑ P(ω )
ω∈E
The continuous case
- the sample space is uncountably infinite
- the probability of individual outcomes is zero
- it is necessary to assign probabilities to events rather than to individual points,
whereby defining the probability of an event E, P(E), via probability density f
P( E ) =
where
f (ω )dω
∫
ω
∈E
1 = ∫ f (ω )dω
Ω
1.1 Probability Theory
The addition law
a general rule for determining the probability of a union of two events in
terms of the probabilities of these events
P(E+F) = P(E) + P(F) - P(EF)
The addition law is a consequence of Axiom 3, which is a special addition
for disjoint events. When decompose E+F into three disjoint parts
E+F = EF + EFc + EcF,
one has according to Axiom 3
P(E+F) = P(EF) + P(EFc)+ P(EcF)
= [P(EF)+ P(EFc)] + [P(EF)+ P(EcF)] – P(EF)
=P(E) + P(F) – P(EF)
1.1 Probability Theory
Conditional Probability (Die bedingte Wahrscheinlichkeit)
For any event E contained in an event F of positive probability, the
conditional probability of E given F, written P(E|F), is defined to be
P( E | F ) =
F
P( E )
P( EF )
, when E ⊂ F , or P( E | F ) =
, in general
P( F )
P( F )
F
E
Ω
F= new sample space
E
Ω
= EF = E ∩ F
Example: probability for rain, when temperature is larger than 20 C
- P(F) cannot be zero
- P(E|F) is proportional to P(E)
- P(E|F) satisfies the probability axioms
1.1 Probability Theory
The multiplication law
P( EF ) = P( E | F ) P( F )
Independence
Events E and F are said to be independent if and only if
P ( E | F ) = P( E ) so that P( EF ) = P( E ) P( F )
1.1 Probability Theory
Bayes’ Theorem
Using the multiplicative rule to express the probability of an intersection as a
product
P( EF ) = P( E | F ) P( F ) = P( F | E ) P( E )
with neither P(E) nor P(F) be zero, yields
P( F | E ) =
P( E | F ) P( F )
P( E )
Using
P( E ) = P( EF ) + P ( EF c ) = P( E | F ) P( F ) + P ( E | F c ) P ( F c )
the Bayes’ theorem is obtained
P( F | E ) =
P( E | F ) P( F )
P( E | F ) P( F ) + P( E | F c ) P( F c )
1.1 Probability Theory
Probability Space
a probability space (Ω, F , P ) contains Ω as a sample space, F a collection of
subsets of Ω called events and P(E) the probability assigned to the event E
Random Variable and Vectors
A random variable (vector) is a measurable (vector-valued) function defined
on a probability space
- as a function on a probability space, a random variable maps the events of Ω
into sets of values in the value space of the function, and conversely
- a random vector is a vector-values function on a probability space whose
components are random variables
1.1 Probability Theory
Example I:
An individual tosses a coin with the outcomes being sequences of heads and
tails. The sample space contains all possible sequences. Some of the random
variables are
X(ω)=number of tosses required to equalize the number of heads and the
number of tails for the first time
Y(ω)=1 or 0, depending whether the third toss is heads or tails
For the sequence
H,H,T,H,T,T,T,H,T….
X(ω)=6, Y(ω)=0
and for the sequence
H,T,T,T,H,T,H,H,T…
X(ω)=8, Y(ω)=0
The probability of X(ω)=…,6,7,8…, and the probability of Y(ω)=0 can be worked
out given a fair coin
1.1 Probability Theory
Example II
The pressure measured in Hamburg is a random variable. In this case
Ω= F =ℜ
A possible value of the random variable is
X(ω)=1003 (hPa)
Relative to example I, it is now much harder to assess the probabilities of
X(ω)=…,1002,1003,…
1.1 Probability Theory
The Distribution Function
In general, it is cumbersome to use the sample space and the probability P
to describe the random characteristics of a random variable. One use
instead the distribution function.
The distribution function describes how the probability is distributed by a
random variable in the space ℜ of its values. In particular, the cumulative
distribution function (c.d.f. or distribution function) gives the amount
assigned to the interval from− ∞ up to λ
F (λ ) = P([ X (ω ) ≤ λ ] = P( X ≤ λ )
1.1 Probability Theory
Ω
Given a probability space (Ω, F , P ) and a random variable X(ω) defined on
it, the corresponding distribution function (c.d.f) is completely determined as
F ( x) = P Ω ( X (ω ) ≤ x)
This is function has the following properties
1. F (−∞) = 0
2. F (∞) = 1
3. F ( x) is a non - decreasing function
4. F ( x) is continuous from the right at each x
1.1 Probability Theory
A c.d.f serves as a basis for computing the probability of other event of
interest. For instance, the probability of (a,b] can be computed in terms of
the c.d.f:
Since (-∞, b] = ( −∞, a ] + (a, b], one has F (b) = F (a ) + P( a, b], or
P (a < X ≤ b) = F (b) − F ( a )
1.1 Probability Theory
Fractile or Quantile (das Quantil) of the c.d.f
The range of a distribution is [0,1], and this interval can be divided into
equal parts. If a c.d.f rises steadily from 0 to 1, with no jumps or intervals
of constancy, there is a unique number xp for each p on the interval [0,1]
such that
F ( x p ) = P( X ≤ x p ) = p
xp is called a fractile (quantile) of the distribution
Other notations of xp:
the median (der Median): p=0.5
the kth quartile (das k. Quantil): 4p=k
the kth decile (das k. Dezil): 10p=k
the kth percentile (das k. Perzentil): 100p=k
1.1 Probability Theory
Notice:
In the expression
F (λ ) is the distribution function of X
F refers to the whole function. Thus
F (λ ) = P( X ≤ λ ), F ( x) = P( X ≤ λ )
refer to the same function F(.). A different distribution function results from
using a different random variable
FX (λ ) = P ( X ≤ λ ), FY (λ ) = P (Y ≤ λ )
Throughout this course, upper case (e.g. X) refers a random variable,
whereas the corresponding lower case (i.e. x) indicates a particular value
taken by the random variable X
1.1 Probability Theory
Discrete Random Variables
A random variable, whose distribution function jumps at values x1,x2,…, and
is constant between adjacent jump points is called discrete. One has
pi = P( X = xi ),
p1 + p2 + L = 1
Example:
Three chips are drawn together at random from a bowl containing five
chips numbered 1, 2, 3, 4, and 5. There are 10 possible outcomes in this
experiment:
123, 124, 125, 134, 135, 145, 234, 235, 245, 345
Each having probability 1/10. Let X(ω) denote the sum of the numbers on
the chips in outcome ω. The values of X are
6, 7, 8, 8, 9, 10, 9, 10, 11, 12.
The corresponding probabilities for X=6, 7, 8, 9, 10, 11 and 12 are
0.1, 0.1, 0.2, 0.2, 0.2, 0.1, 0.1
1.1 Probability Theory
Continuous Random Variables
A random variable is said to be continuous, if P(.) is a continuous probability
measure on Ω.
The c.d.f. of a continuous random variable is differentiable almost
everywhere. The derivative of it is called the density function
f (λ ) = F ′(λ ) = lim h→0
1
[F (λ + h / 2) − F (λ − h / 2)]
h
1.1 Probability Theory
The Density Function
The probability model for a continuous distribution can be defined by
specifying a density function f, which has the properties
1. f ( x) ≥ 0
∞
2.
∫ f ( x)dx = 1
−∞
The c.d.f. F(x) can be constructed using f(x) as follows
x
F ( x) =
∫ f (u )du
−∞
so does the probability
b
P(a < X < b) = F (b) − F (a ) = ∫ f (λ )dλ
a
1.1 Probability Theory
Analogy between relations in the discrete and continuous case
Discrete
Continuous
f ( xi ) = P( X = xi )
f ( x)dx ≅ P( x < X < x + dx)
F ( x) = ∑ f ( xi )
x
F ( x) =
xi ≤ x
P( E ) =
∑ f (x )
xi ∈E
i
∑ f (x ) = 1
i
all i
f ( xi ) = F ( xi ) − F ( xi −1 )
∫ f ( λ ) dλ
−∞
P( E ) = ∫ f ( x)dx
E
∞
∫ f ( λ ) dλ = 1
−∞
f ( x) = F ′( x)
1.1 Probability Theory
Bivariate Distributions
A random vector (X(ω),Y(ω)) introduces a probability distribution in the
plane in the ‘value’ space ℜ 2 of the random vector. This distribution is
bivariate and given by
F ( x, y ) ≡ P ( X ≤ x and Y ≤ y )
= P Ω [ X (ω ) ≤ x and Y (ω ) ≤ y ]
(also referred to as the joint distribution function)
The bivariate distribution function satisfies
1. F ( x, ∞) and F (∞, y ) are univariate distribution
functions of x and y, respectively
2. F (−∞, y ) = F ( x,−∞) = 0
3. P( x < X ≤ x + h, y < Y ≤ y + k )
= F ( x + h, y + k ) − F ( x + h, y ) − F ( x, y + k ) + F ( x, y ) ≡ ∆2 F ≥ 0
for every rectangle with sides parallel to the axis
1.1 Probability Theory
A bivariate distribution is said to be of discrete type, if
f ( x, y ) = P ( X = x and Y = y )
A bivariate distribution is said to be of continuous type, if the distribution is
continuous and has a second-order, mixed partial derivative function
∂2
f ( x, y ) =
F ( x, y )
∂x∂y
from which F can be recovered by
x y
F ( x, y ) =
∫ ∫ f (u, v)dvdu
− ∞− ∞
The bivariate density function f satisfies
1. f ( x , y ) ≥ 0
2. ∫∫ f ( x, y )dxdy = 1
ℜ2
1.1 Probability Theory
Marginal Distributions (Randverteilungen)
The following distributions are called the marginal distributions of X and Y
F ( x, ∞) = P( X ≤ x and Y ≤ ∞) = P( X ≤ x),
F (∞, y ) = P( X ≤ ∞ and Y ≤ y ) = P(Y ≤ y )
In the discrete case, with pij=P(X=xi,Y=yj), the probability that X=xi is
obtained by summing over yj’s for fixed xi
P( X = xi ) = ∑ P( X = xi , Y = y j ) = ∑ pij
j
j
In the continuous case, the marginal density function is obtained by
differentiating the marginal distribution function
x ∞
FX ( x) = FX ,Y ( x, ∞) =
∫∫f
X ,Y
(u , y )dydu
− ∞− ∞
to obtain
f X ( x ) = FX′ ( x ) =
∞
∫f
−∞
X ,Y
( x, y ) dy
1.1 Probability Theory
Conditional Distributions
The new distribution in the value space of a random variable X, given event
E with a positive probability, is called a conditional distribution and is
defined by
P ( X ≤ x and E )
FX |E ( x) = P( X ≤ x | E ) =
P( E )
A conditional distribution may be continuous, in which case its density is
f X |E ( x) =
d
FX |E ( x)
dx
Or it may be discrete, characterized by
f X |E ( x) = P( X = x | E ) =
P( X = x and E )
P( E )
1.1 Probability Theory
The most commonly used conditional distributions have to do with two
variables X and Y comprising a bivariate random vector and the condition is
then a condition on the value of the other variable, e.g.
FX |Y = y = P( X ≤ x | Y = y )
If Y is continuous, the distribution is defined through the conditional density
f ( x | y ) ≡ f X |Y = y ( x) ≡
f ( x, y )
fY ( y )
x
F ( x | y ) ≡ FX |Y = y ( x) =
∫ f (u | y)du
−∞
X and Y are independent random variables, when
FX ,Y ( x, y ) = FX ( x) FY ( y )
f X ,Y ( x, y ) = f X ( x) fY ( y ) → f ( x | y ) = f X( x)
1.1 Probability Theory
Expectation
The expectation of random variable X, µ, is defined by
E ( X ) = ∑ xi f ( xi ) or E ( X ) =
i
or in general by
∞
∫ x f ( x)dx
−∞
∞
E( X ) =
∫ xdF ( x)
−∞
It holds
E ( X + Y ) = E ( X ) + E (Y )
E (aX + b) = aE ( X ) + E (Y )
E ( XY ) = E ( X ) E (Y ) when X is independent of Y
1.1 Probability Theory
Example:
Consider the experiment of three independent tosses of a fair coin. Let X(ω)
denotes the number of heads in the sequence o (e.g. HHT=2). The sample
points are
HHH, HHT, HTH, THH, TTH, THT, HTT, TTT
and the value of X are
3, 2, 2, 2, 1, 1, 1, 0
The probabilities of values of X are
P(3)=P(0)=1/8, P(1)=P(2)=3/8
1 3
1
3
3
E( X ) = 0 ⋅ + 1⋅ + 2 ⋅ + 3 ⋅ =
8
8
8
8 2
1.1 Probability Theory
Moments
The k-th moment of a random variable X is defined by
E( X ) =
k
∑
∞
xik
f ( xi ) or E ( X ) =
k
∫
x k f X ( x)dx
−∞
i
and the k-th central moments by
E (( X − E ( X )) ) =
k
∑ (x
i
i
∞
− µ ) f ( xi ) or E (( X − E ( X )) ) =
k
k
∫
( x − µ ) k f X ( x)dx
−∞
1.1 Probability Theory
Variance (2. Central Moment): the most frequently used high-order moment
Var ( X ) = σ 2 = E (( X − µ ) 2 ) = E ( X 2 ) − ( E ( X )) 2
Variance of sum
Consider variance of the sum of random variables X and Y, Z=X+Y,
Var ( Z ) = E (( Z − E ( Z )) 2 ) = Var ( X ) + Var (Y ) + 2 cov( X , Y )
Variance is not additive!
It holds in general
Var ( aX + bY ) = a 2Var ( X ) + b 2Var (Y ) + 2ab cov( X , Y )
and in case X and Y are independent
Var ( X − Y ) = Var ( X ) + Var (Y ), Var ( X + Y ) = Var ( X ) + Var (Y )
1.1 Probability Theory
Covariance and Correlation (2. Moment between Two Random Variables)
The covariance between two random variables X and Y is defined by
cov( X , Y ) = E (( X − µ X )(Y − µY )) = E ( XY ) − ( EX )( EY )
and the correlation by
ρ X ,Y =
cov( X , Y )
σ Xσ Y
One has for independent random variables X and Y
cov( X , Y ) = 0
1.1 Probability Theory
Moments describe various properties of a distribution
- The mean (first moment): the center of gravity
- The variance (2. Central moment): the spread
-The skewness (a scaled version of the 3. Central moment): symmetry
⎛ x−µ ⎞
skewness = γ 1 = ∫ ⎜
⎟ f X ( x)dx
σ ⎠
ℜ⎝
γ1=0: symmetric
3
γ1<0 negatively skewed or skewed to the left*
γ1>0 positively skewed or skewed to the right
-The kurtosis (a scaled and shifted version of 4. central moment): peakedness
⎛ x−µ ⎞
kurtosis = γ 2 = ∫ ⎜
⎟ f X ( x)dx − 3
σ ⎠
ℜ⎝
4
γ2<0: platykurtic (less peaked than the normal distribution)
γ2>0: leptokurtic (more peaked than the normal distribution)
* skewed left: the left tail is heavier than the right tail
negative skewed
positive skewed
1.1 Probability Theory
The Central Limit Theorem
If Xk, k=1,2,…, is an infinite series of independent and identically distributed
random variables with E(Xk)=µ and Var(Xk)=σ2, then the average
1 n
∑ Xk
n k =1
is asymptotically normally distributed. That is
1 n
∑ Xk − µ
n k =1
~ N (0,1)
lim
1
n →∞
σ
n
1.1 Probability Theory
Empirical distribution functions of the amount of precipitation, summed over a day, a
week, a month or a year, at West Glacier, Montana, USA. The amounts have been
normalized by the respective means, and are plotted on a probability scale so that a
normal distribution appears as a straight line.
Download