1. Fundamentals 1.1 Probability Theory 1.1.1 Sample Space and Probability 1.1.2 Random Variables 1.1.3 Limit Theories 1.1 Probability Theory A statistical model (probability model) deals with experiment whose outcome is not precisely repeatable, even under supposedly identical conditions. - an experiment involving chance (Zufallsexperiment) The formulation of a statistical model requires two ingredients: - sample space (Stichprobenraum) - probability (Wahrscheinlichkeit) 1.1 Probability Theory Experiment of chance: A repeatable operation under specified conditions, whose outcome is not predictable - tossing a coin - drawing card from a complete set Elementary outcomes (Ergebnis eines Experiments): An elementary outcome is a possible outcome of an experiment of chance - in experiment ‘tossing a coin’ there are 2 elementary outcomes - in experiment ‘drawing card’ there are 52 elementary outcomes 1.1 Probability Theory Sample space: The sample space of an experiment of chance, Ω, is the set of possible outcomes of the experiments - the sample space in the experiment ‘tossing a coin’ is ‘head’ and ‘tail’ Events (Ereignisse): An event is a list of the elementary outcomes. It can be considered as a subset of the sample space of an experiment of chance. An event can be specified by E = { x | x satisfies condition E} - an event of the experiment ‘drawing card’: heart - the sample space is the largest event of an experiment of chance 1.1 Probability Theory Intersections and Unions are operations used to obtain new events Union or Addition Definition: a point is in the union of event E and event F if and only if it lies either in E or in F (or possibly both): E U F = E + F = {E or F} = {ω |ω is in E or F} Properties: E + F = F + E (commutative) [E+F]+G = E+[F+G] (associative) Intersection or product Definition: The intersection of two events is that event whose outcomes are those lying in both of the events: E I F = EF = {E and F}={ω |ω is in both E and F} Properties: EF = FE, [EF]G = E[FG] E(F+G)=EF+EG 1.1 Probability Theory More about operations: Two events are said to be disjoint or mutually exclusive when their intersection is empty: EF=O Difference E-F of two events is given by E - F = {ω |ω is in E but not in F} The complement of an event E, Ec, is defined to be Ec = Ω - E One has E + Ec = Ω, EEc = O 1.1 Probability Theory Probability Probability of an event E of a repeatable experiments is given by P ( E ) = lim N →∝ Number of times when E occurs, N E Number of total trials, N Experiment ‘tossing a coin’: The relative frequency of the event ‘head up’ as the function of the number of trials 1.1 Probability Theory The probability assigned to an event, P(E), cannot be completely arbitrary and has to satisfy the following axioms Probability axioms 1. P (Ω) = 1 2. 0 ≤ P ( E ) ≤ 1, for every event E 3. P (∪ Ei ) = P ( E1 ) + P ( E2 ) + L, for every sequence of disjoint events E1 , E2 , L Consequences 1. P ( E c ) = 1 − P ( E ) 2. P( E1 ) ≤ P( E2 ) for events E1 ⊂ E2 1.1 Probability Theory The discrete case - the sample space has only a finite or countably infinite number of outcomes - the probability for each individual outcome is a nonnegative number - the probability of a given event is given by P( E ) = ∑ P(ω ) ω∈E The continuous case - the sample space is uncountably infinite - the probability of individual outcomes is zero - it is necessary to assign probabilities to events rather than to individual points, whereby defining the probability of an event E, P(E), via probability density f P( E ) = where f (ω )dω ∫ ω ∈E 1 = ∫ f (ω )dω Ω 1.1 Probability Theory The addition law a general rule for determining the probability of a union of two events in terms of the probabilities of these events P(E+F) = P(E) + P(F) - P(EF) The addition law is a consequence of Axiom 3, which is a special addition for disjoint events. When decompose E+F into three disjoint parts E+F = EF + EFc + EcF, one has according to Axiom 3 P(E+F) = P(EF) + P(EFc)+ P(EcF) = [P(EF)+ P(EFc)] + [P(EF)+ P(EcF)] – P(EF) =P(E) + P(F) – P(EF) 1.1 Probability Theory Conditional Probability (Die bedingte Wahrscheinlichkeit) For any event E contained in an event F of positive probability, the conditional probability of E given F, written P(E|F), is defined to be P( E | F ) = F P( E ) P( EF ) , when E ⊂ F , or P( E | F ) = , in general P( F ) P( F ) F E Ω F= new sample space E Ω = EF = E ∩ F Example: probability for rain, when temperature is larger than 20 C - P(F) cannot be zero - P(E|F) is proportional to P(E) - P(E|F) satisfies the probability axioms 1.1 Probability Theory The multiplication law P( EF ) = P( E | F ) P( F ) Independence Events E and F are said to be independent if and only if P ( E | F ) = P( E ) so that P( EF ) = P( E ) P( F ) 1.1 Probability Theory Bayes’ Theorem Using the multiplicative rule to express the probability of an intersection as a product P( EF ) = P( E | F ) P( F ) = P( F | E ) P( E ) with neither P(E) nor P(F) be zero, yields P( F | E ) = P( E | F ) P( F ) P( E ) Using P( E ) = P( EF ) + P ( EF c ) = P( E | F ) P( F ) + P ( E | F c ) P ( F c ) the Bayes’ theorem is obtained P( F | E ) = P( E | F ) P( F ) P( E | F ) P( F ) + P( E | F c ) P( F c ) 1.1 Probability Theory Probability Space a probability space (Ω, F , P ) contains Ω as a sample space, F a collection of subsets of Ω called events and P(E) the probability assigned to the event E Random Variable and Vectors A random variable (vector) is a measurable (vector-valued) function defined on a probability space - as a function on a probability space, a random variable maps the events of Ω into sets of values in the value space of the function, and conversely - a random vector is a vector-values function on a probability space whose components are random variables 1.1 Probability Theory Example I: An individual tosses a coin with the outcomes being sequences of heads and tails. The sample space contains all possible sequences. Some of the random variables are X(ω)=number of tosses required to equalize the number of heads and the number of tails for the first time Y(ω)=1 or 0, depending whether the third toss is heads or tails For the sequence H,H,T,H,T,T,T,H,T…. X(ω)=6, Y(ω)=0 and for the sequence H,T,T,T,H,T,H,H,T… X(ω)=8, Y(ω)=0 The probability of X(ω)=…,6,7,8…, and the probability of Y(ω)=0 can be worked out given a fair coin 1.1 Probability Theory Example II The pressure measured in Hamburg is a random variable. In this case Ω= F =ℜ A possible value of the random variable is X(ω)=1003 (hPa) Relative to example I, it is now much harder to assess the probabilities of X(ω)=…,1002,1003,… 1.1 Probability Theory The Distribution Function In general, it is cumbersome to use the sample space and the probability P to describe the random characteristics of a random variable. One use instead the distribution function. The distribution function describes how the probability is distributed by a random variable in the space ℜ of its values. In particular, the cumulative distribution function (c.d.f. or distribution function) gives the amount assigned to the interval from− ∞ up to λ F (λ ) = P([ X (ω ) ≤ λ ] = P( X ≤ λ ) 1.1 Probability Theory Ω Given a probability space (Ω, F , P ) and a random variable X(ω) defined on it, the corresponding distribution function (c.d.f) is completely determined as F ( x) = P Ω ( X (ω ) ≤ x) This is function has the following properties 1. F (−∞) = 0 2. F (∞) = 1 3. F ( x) is a non - decreasing function 4. F ( x) is continuous from the right at each x 1.1 Probability Theory A c.d.f serves as a basis for computing the probability of other event of interest. For instance, the probability of (a,b] can be computed in terms of the c.d.f: Since (-∞, b] = ( −∞, a ] + (a, b], one has F (b) = F (a ) + P( a, b], or P (a < X ≤ b) = F (b) − F ( a ) 1.1 Probability Theory Fractile or Quantile (das Quantil) of the c.d.f The range of a distribution is [0,1], and this interval can be divided into equal parts. If a c.d.f rises steadily from 0 to 1, with no jumps or intervals of constancy, there is a unique number xp for each p on the interval [0,1] such that F ( x p ) = P( X ≤ x p ) = p xp is called a fractile (quantile) of the distribution Other notations of xp: the median (der Median): p=0.5 the kth quartile (das k. Quantil): 4p=k the kth decile (das k. Dezil): 10p=k the kth percentile (das k. Perzentil): 100p=k 1.1 Probability Theory Notice: In the expression F (λ ) is the distribution function of X F refers to the whole function. Thus F (λ ) = P( X ≤ λ ), F ( x) = P( X ≤ λ ) refer to the same function F(.). A different distribution function results from using a different random variable FX (λ ) = P ( X ≤ λ ), FY (λ ) = P (Y ≤ λ ) Throughout this course, upper case (e.g. X) refers a random variable, whereas the corresponding lower case (i.e. x) indicates a particular value taken by the random variable X 1.1 Probability Theory Discrete Random Variables A random variable, whose distribution function jumps at values x1,x2,…, and is constant between adjacent jump points is called discrete. One has pi = P( X = xi ), p1 + p2 + L = 1 Example: Three chips are drawn together at random from a bowl containing five chips numbered 1, 2, 3, 4, and 5. There are 10 possible outcomes in this experiment: 123, 124, 125, 134, 135, 145, 234, 235, 245, 345 Each having probability 1/10. Let X(ω) denote the sum of the numbers on the chips in outcome ω. The values of X are 6, 7, 8, 8, 9, 10, 9, 10, 11, 12. The corresponding probabilities for X=6, 7, 8, 9, 10, 11 and 12 are 0.1, 0.1, 0.2, 0.2, 0.2, 0.1, 0.1 1.1 Probability Theory Continuous Random Variables A random variable is said to be continuous, if P(.) is a continuous probability measure on Ω. The c.d.f. of a continuous random variable is differentiable almost everywhere. The derivative of it is called the density function f (λ ) = F ′(λ ) = lim h→0 1 [F (λ + h / 2) − F (λ − h / 2)] h 1.1 Probability Theory The Density Function The probability model for a continuous distribution can be defined by specifying a density function f, which has the properties 1. f ( x) ≥ 0 ∞ 2. ∫ f ( x)dx = 1 −∞ The c.d.f. F(x) can be constructed using f(x) as follows x F ( x) = ∫ f (u )du −∞ so does the probability b P(a < X < b) = F (b) − F (a ) = ∫ f (λ )dλ a 1.1 Probability Theory Analogy between relations in the discrete and continuous case Discrete Continuous f ( xi ) = P( X = xi ) f ( x)dx ≅ P( x < X < x + dx) F ( x) = ∑ f ( xi ) x F ( x) = xi ≤ x P( E ) = ∑ f (x ) xi ∈E i ∑ f (x ) = 1 i all i f ( xi ) = F ( xi ) − F ( xi −1 ) ∫ f ( λ ) dλ −∞ P( E ) = ∫ f ( x)dx E ∞ ∫ f ( λ ) dλ = 1 −∞ f ( x) = F ′( x) 1.1 Probability Theory Bivariate Distributions A random vector (X(ω),Y(ω)) introduces a probability distribution in the plane in the ‘value’ space ℜ 2 of the random vector. This distribution is bivariate and given by F ( x, y ) ≡ P ( X ≤ x and Y ≤ y ) = P Ω [ X (ω ) ≤ x and Y (ω ) ≤ y ] (also referred to as the joint distribution function) The bivariate distribution function satisfies 1. F ( x, ∞) and F (∞, y ) are univariate distribution functions of x and y, respectively 2. F (−∞, y ) = F ( x,−∞) = 0 3. P( x < X ≤ x + h, y < Y ≤ y + k ) = F ( x + h, y + k ) − F ( x + h, y ) − F ( x, y + k ) + F ( x, y ) ≡ ∆2 F ≥ 0 for every rectangle with sides parallel to the axis 1.1 Probability Theory A bivariate distribution is said to be of discrete type, if f ( x, y ) = P ( X = x and Y = y ) A bivariate distribution is said to be of continuous type, if the distribution is continuous and has a second-order, mixed partial derivative function ∂2 f ( x, y ) = F ( x, y ) ∂x∂y from which F can be recovered by x y F ( x, y ) = ∫ ∫ f (u, v)dvdu − ∞− ∞ The bivariate density function f satisfies 1. f ( x , y ) ≥ 0 2. ∫∫ f ( x, y )dxdy = 1 ℜ2 1.1 Probability Theory Marginal Distributions (Randverteilungen) The following distributions are called the marginal distributions of X and Y F ( x, ∞) = P( X ≤ x and Y ≤ ∞) = P( X ≤ x), F (∞, y ) = P( X ≤ ∞ and Y ≤ y ) = P(Y ≤ y ) In the discrete case, with pij=P(X=xi,Y=yj), the probability that X=xi is obtained by summing over yj’s for fixed xi P( X = xi ) = ∑ P( X = xi , Y = y j ) = ∑ pij j j In the continuous case, the marginal density function is obtained by differentiating the marginal distribution function x ∞ FX ( x) = FX ,Y ( x, ∞) = ∫∫f X ,Y (u , y )dydu − ∞− ∞ to obtain f X ( x ) = FX′ ( x ) = ∞ ∫f −∞ X ,Y ( x, y ) dy 1.1 Probability Theory Conditional Distributions The new distribution in the value space of a random variable X, given event E with a positive probability, is called a conditional distribution and is defined by P ( X ≤ x and E ) FX |E ( x) = P( X ≤ x | E ) = P( E ) A conditional distribution may be continuous, in which case its density is f X |E ( x) = d FX |E ( x) dx Or it may be discrete, characterized by f X |E ( x) = P( X = x | E ) = P( X = x and E ) P( E ) 1.1 Probability Theory The most commonly used conditional distributions have to do with two variables X and Y comprising a bivariate random vector and the condition is then a condition on the value of the other variable, e.g. FX |Y = y = P( X ≤ x | Y = y ) If Y is continuous, the distribution is defined through the conditional density f ( x | y ) ≡ f X |Y = y ( x) ≡ f ( x, y ) fY ( y ) x F ( x | y ) ≡ FX |Y = y ( x) = ∫ f (u | y)du −∞ X and Y are independent random variables, when FX ,Y ( x, y ) = FX ( x) FY ( y ) f X ,Y ( x, y ) = f X ( x) fY ( y ) → f ( x | y ) = f X( x) 1.1 Probability Theory Expectation The expectation of random variable X, µ, is defined by E ( X ) = ∑ xi f ( xi ) or E ( X ) = i or in general by ∞ ∫ x f ( x)dx −∞ ∞ E( X ) = ∫ xdF ( x) −∞ It holds E ( X + Y ) = E ( X ) + E (Y ) E (aX + b) = aE ( X ) + E (Y ) E ( XY ) = E ( X ) E (Y ) when X is independent of Y 1.1 Probability Theory Example: Consider the experiment of three independent tosses of a fair coin. Let X(ω) denotes the number of heads in the sequence o (e.g. HHT=2). The sample points are HHH, HHT, HTH, THH, TTH, THT, HTT, TTT and the value of X are 3, 2, 2, 2, 1, 1, 1, 0 The probabilities of values of X are P(3)=P(0)=1/8, P(1)=P(2)=3/8 1 3 1 3 3 E( X ) = 0 ⋅ + 1⋅ + 2 ⋅ + 3 ⋅ = 8 8 8 8 2 1.1 Probability Theory Moments The k-th moment of a random variable X is defined by E( X ) = k ∑ ∞ xik f ( xi ) or E ( X ) = k ∫ x k f X ( x)dx −∞ i and the k-th central moments by E (( X − E ( X )) ) = k ∑ (x i i ∞ − µ ) f ( xi ) or E (( X − E ( X )) ) = k k ∫ ( x − µ ) k f X ( x)dx −∞ 1.1 Probability Theory Variance (2. Central Moment): the most frequently used high-order moment Var ( X ) = σ 2 = E (( X − µ ) 2 ) = E ( X 2 ) − ( E ( X )) 2 Variance of sum Consider variance of the sum of random variables X and Y, Z=X+Y, Var ( Z ) = E (( Z − E ( Z )) 2 ) = Var ( X ) + Var (Y ) + 2 cov( X , Y ) Variance is not additive! It holds in general Var ( aX + bY ) = a 2Var ( X ) + b 2Var (Y ) + 2ab cov( X , Y ) and in case X and Y are independent Var ( X − Y ) = Var ( X ) + Var (Y ), Var ( X + Y ) = Var ( X ) + Var (Y ) 1.1 Probability Theory Covariance and Correlation (2. Moment between Two Random Variables) The covariance between two random variables X and Y is defined by cov( X , Y ) = E (( X − µ X )(Y − µY )) = E ( XY ) − ( EX )( EY ) and the correlation by ρ X ,Y = cov( X , Y ) σ Xσ Y One has for independent random variables X and Y cov( X , Y ) = 0 1.1 Probability Theory Moments describe various properties of a distribution - The mean (first moment): the center of gravity - The variance (2. Central moment): the spread -The skewness (a scaled version of the 3. Central moment): symmetry ⎛ x−µ ⎞ skewness = γ 1 = ∫ ⎜ ⎟ f X ( x)dx σ ⎠ ℜ⎝ γ1=0: symmetric 3 γ1<0 negatively skewed or skewed to the left* γ1>0 positively skewed or skewed to the right -The kurtosis (a scaled and shifted version of 4. central moment): peakedness ⎛ x−µ ⎞ kurtosis = γ 2 = ∫ ⎜ ⎟ f X ( x)dx − 3 σ ⎠ ℜ⎝ 4 γ2<0: platykurtic (less peaked than the normal distribution) γ2>0: leptokurtic (more peaked than the normal distribution) * skewed left: the left tail is heavier than the right tail negative skewed positive skewed 1.1 Probability Theory The Central Limit Theorem If Xk, k=1,2,…, is an infinite series of independent and identically distributed random variables with E(Xk)=µ and Var(Xk)=σ2, then the average 1 n ∑ Xk n k =1 is asymptotically normally distributed. That is 1 n ∑ Xk − µ n k =1 ~ N (0,1) lim 1 n →∞ σ n 1.1 Probability Theory Empirical distribution functions of the amount of precipitation, summed over a day, a week, a month or a year, at West Glacier, Montana, USA. The amounts have been normalized by the respective means, and are plotted on a probability scale so that a normal distribution appears as a straight line.