Random Variables EE 278 Lecture Notes # 3 Winter 2010–2011 Probability space (Ω, F , P) Random variables, vectors, and processes A (real-valued) random variable is a real-valued function defined on Ω with a technical condition (to be stated) Common to use upper-case letters. E.g., a random variable X is a function X : Ω → R. Y, Z, U, V, Θ, · · · Also common: random variable may take on values only in some subset ΩX ⊂ R (sometimes called the alphabet of X , AX and X also common notations) Intuition: Randomness is in experiment, which produces outcome ω according to probability P ⇒ random variable outcome is X(ω) ∈ ΩX ⊂ R. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 1 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 Examples c R.M. Gray 2011 � 2 Functions of random variables Consider (Ω, F , P) with Ω = R, P determined by uniform pdf on [0, 1) Coin flip from earlier: X : R → {0, 1} by Suppose that X is a rv defined on (Ω, F , P) and suppose that g : ΩX → R is another real-valued function. Observe X , do not observe outcome of fair spin. Can express the previous examples as W = V 2, Z = eV , L = −V ln V , Lots of possible random variables, e.g., W(r) = r2, Z(r) = er , V(r) = r, L(r) = −r ln r (require r ≥ 0), Y(r) = cos(2πr), etc. Similarly, 1/W , sinh(Y), L3 are all random variables 0 if r ≤ 0.5 X(r) = . 1 otherwise Then the function g(X) : Ω → R defined by g(X)(ω) = g(X(ω)) is also a real-valued mapping of Ω, i.e., a real-valued function of a random variable is a random variable Y = cos(2πV) Can think of rvs as observations or measurements made on an underlying experiment. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 3 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 4 Random vectors and random processes Derived distributions In general: “input” probability space (Ω, F , P) + random variable X ⇒ “output” probability space, say (ΩX , B(ΩX ), PX ), where ΩX ⊂ R and PX is distribution of X PX (F) = Pr(X ∈ F) A finite collection of random variables (defined on a common probability space (Ω, F , P) is a random vector E.g., (X, Y), (X0, X1, · · · , Xk−1) Typically PX described by pmf pX or pdf fX An infinite collection of random variables (defined on a common probability space) is a random process For binary quantizer special case derived PX . E.g., {Xn, n = 0, 1, 2, · · · }, {X(t); t ∈ (−∞, ∞)} Idea generalizes and forces a technical condition on definition of random variable (and hence also on random vector and random process) So theory of random vectors and random processes mostly boils down to theory of random variables. c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 5 c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 6 Inverse image formula Given (Ω, B(Ω), P) and a random variable X , X −1(F) find PX F Basic method: PX (F) = the probability computed using P of all the original sample points that are mapped by X into the subset F : Inverse image method: Pr(X ∈ F) = P({ω : X(ω) ∈ F}) = P(X −1(F)) PX (F) = P({ω : X(ω) ∈ F}) Shorthand way to write formula in terms of inverse image of an event F ∈ B(ΩX ) under the mapping X : Ω → ΩX : X −1(F) = {r : X(r) ∈ F}: −1 PX (F) = P(X (F)) Written informally as PX (F) = Pr(X ∈ F) = P{X ∈ F} = “probability that random variable X assumes a value in F ” EE278: Introduction to Statistical Signal Processing, winter 2010–2011 X c R.M. Gray 2011 � 7 inverse image formula — fundamental to probability, random processes, signal processing. Shows how to compute probabilities of output events in terms of the input probability space does the definition make sense? i.e., is PX (F) = P(X −1(F)) well-defined for all output events F ?? Yes if include requirement in definition of random variable — EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 8 Careful definition of a random variable • Most every function we encounter is measurable, but calculus of Given a probability space (Ω, F , P), a (real-valued) random variable X is a function X : Ω → ΩX ⊂ R with the property that In simple binary quantizer example, X is measurable (easy to show since F = B([0, 1)) contains intervals) Recall probability rests on this property and advanced courses prove measurability of important functions. if F ∈ B(ΩX ), then X −1(F) ∈ F PX ({0}) = P({r : X(r) = 0}) = P(X −1({0})) = P({r : 0 ≤ r ≤ 0.5}) = P([0, 0.5]) = 0.5 Notes: PX ({1}) = P(X −1({1})) = P((0.5, 1.0]) = 0.5 • In English: X : Ω → ΩX ⊂ R is a random variable iff the inverse PX (ΩX ) = PX ({0, 1}) = P(X −1({0, 1}) = P([0, 1)) = 1 image of every output event is an input event and therefore PX (F) = P(X −1(F)) is well-defined for all events F . PX (∅) = P(X −1(∅)) = P(∅) = 0, In general, find PX by computing pmf or pdf, as appropriate. Many shortcuts, but basic approach is inverse image formula. • Another name for a function with this property: measurable function EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 9 Random vectors 10 Can be discrete (discribed by multidimensional pmf) or continuous (e.g., described by multidimensional pdf) or mixed All theory, calculus, applications of individual random variables useful for studying random vectors and random processes since random vectors and processes are simply collections of random variables. One k-dimensional random vector = k 1-dimensional random variables defined on a common probability space. Similarly, a real-valued function of a random vector (several random variables) is a random variable. E.g., if X0, X1, . . . Xn−1 are random variables, then n−1 is a random variable defined by n−1 1� S n(ω) = Xk (ω) n k=0 Several notations used, e.g., X k = (X0, X1, . . . , Xk−1) is shorthand for X k (ω) = (X0(ω), X1(ω), . . . , Xk−1)(ω) or X or {Xn; n = 0, 1, . . . , k − 1} or {Xn; n ∈ Zk } c R.M. Gray 2011 � Recall that a real-valued function of a random variable is a random variable. 1� Sn = Xk n k=0 Earlier example: two coin flips, k-coin flips (first k binary coefficients of fair spinner) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 11 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 12 Inverse image formula for random vectors Random processes A random vector is a finite collection of rvs defined on a common probability space −1 PX(F) = P(X (F)) = P({ω : X(ω) ∈ F}) = P({ω : (X0(ω), X1(ω), . . . , Xk−1(ω)) ∈ F}) A random process is an infinite family of rvs defined on a common probability space. Many types: where the various forms are equivalent and all stand for Pr(X ∈ F) Technically, the formula holds for suitable events F ∈ B(R)k , the Borel field of Rk (or some suitable subset). See book for discussion. One multidimensional event of particular interest is a Cartesian product of 1D events (called a rectangle): {Xt ; t ∈ R} (continuous-time, two-sided) Also called stochastic process PX(F) = P({ω : X0(ω) ∈ F0, X1(ω) ∈ F1, . . . , Xk−1(ω) ∈ Fk−1}) c R.M. Gray 2011 � 13 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 14 Keep in mind the suppressed argument ω — e.g., each Xt is Xt (ω), a function defined on the sample space In general: {Xt ; t ∈ T } or {X(t); t ∈ T } Other notations: {X(t)}, {X[n]} (for discrete-time) X(t) is X(t, ω), it can be viewed as a function of two arguments Sloppy but common: X(t), context tells rp and not single rv Have seen one example — fair coin flips, a Bernoulli random process Also called a stochastic process. Discrete-time random processes are also called time series Another, simpler, example: Random sinusoids Suppose that A and Θ are two random variables with a joint pdf fA,θ (a, θ) = fA(a) fΘ(θ). For example, Θ ∼ U([0, 2π)) and A ∼ N(0, σ2). Define a continuous-time random process X(t) for all t ∈ R Always: a random process is an indexed family of random variables, T is index set For each t, Xt is a random variable. All Xt defined on a common probability space X(t) = A cos(2πt + Θ) Or, making the dependence on ω explicit, index is usually time, in some applications it is space, e.g., random field {X(t, s); t, s ∈ [0, 1)} models a random image, {V(x, y, t); x, y ∈ [0, 1); t ∈ [0, ∞)} models analog video. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 {Xn; n ∈ Z} (discrete-time, two-sided) {Xt ; t ∈ [0, ∞)} (continuous-time, one-sided) k F = ×k−1 i=0 F i = {x : xi ∈ F i; i = 0, . . . , k − 1} EE278: Introduction to Statistical Signal Processing, winter 2010–2011 {Xn; n = 0, 1, 2, . . .} (discrete-time, one-sided) c R.M. Gray 2011 � X(t, ω) = A(ω) cos(2πt + Θ(ω)) 15 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 16 Derived distributions for random variables Cumulative distribution functions General problem: Given probability space (Ω, F , P) and a random variable X with range space (alphabet) ΩX . Find the distribution PX . Define cumulative distribution function (cdf) by F X (x) ≡ If ΩX is discrete, then PX described by a pmf pX (x) = P(X −1({x})) = P({ω : X(ω) = x}) � PX (F) = pX (x) = P(X −1(F)) � x −∞ fX (r)dr = Pr(X ≤ x) This is a probability and inverse image formula works F X (x) = P(X −1((−∞, x])) x∈F and from calculus If ΩX is continuous, then need a pdf. But a pdf is not a probability so inverse image formula does not apply immediately ⇒ alter approach c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 17 Notes: • If a ≥ b, then since (−∞, a] = (−∞, b] ∪ (b, a] is the union of disjoint intervals, then F X (a) = F X (b) + PX ((b, a]) and hence PX ((a, b]) = � a b fX (x) = d F X (x) dx So first find cdf F X (x), then differentiate to find fX (x) 18 If original space (Ω, F , P) is a discrete probability space, then rv X defined on (Ω, F , P) is also discrete Inverse image formula ⇒ pX (x) = PX ({x}) = P(X −1({x})) = fX (x) dx = F X (b) − F X (a) c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 � p(ω) ω:X(ω)=x ⇒ F X (x) is monontonically nondecreasing • cdf is well defined for discrete rvs: F X (r) = Pr(X ≤ r) = � pX (x), x:x≤r but not as useful. Not needed for derived distributions EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 19 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 20 Example: discrete derived distribution pY (1) = � ω: ω = Ω = Z+, P determined by the geometric pmf even (1 − p)k−1 p = p (1 − p) � (1 − p)k−1 p k=2,4,... ∞ ∞ � � ((1 − p)2)k = p(1 − p) ((1 − p)2)k k=1 k=0 (1 − p) 1− p = = p 2 2− p 1 − (1 − p) 1 pY (0) = 1 − pY (1) = 2− p 1 if ω even Define a random variable Y : Y(ω) = 0 if ω odd Using the inverse image formula for the pmf for Y(ω) = 1: c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 21 P(F) = r∈F g(r) dr; F ∈ B(R). −1 PX (F) = P(X (F)) = Quantizer example did this. Square of a random variable (R, B(R), P) with P induced by a Gaussian pdf. X a rv. Inverse image formula ⇒ If X discrete, find the pmf pX (x) = 22 Example: continuous derived distribution Suppose original space is (Ω, F , P) = (R, B(R), P) where P is described by a pdf g: � c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 � � Define W : R → R by W(r) = r2; r ∈ R. r: X(r)∈F r: X(r)=x Find pdf fW . First find cdf FW , then differentiate. If w < 0, FW (w) = 0. If w ≥ 0, g(r) dr. FW (w) = Pr(W ≤ w) = P({ω : W(ω) = ω2 ≤ w}) � w1/2 1/2 1/2 = P([−w , w ]) = g(r) dr g(r) dr −w1/2 If X is continuous, want the pdf. First find cdf then differentiate. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � This can be complicated, but don’t need to plug in g yet 23 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 24 Example: continuous derived distribution Use integral differentiation formula to get pdf directly — d dw � b(w) a(w) g(r) dr = g(b(w)) db(w) da(w) − g(a(w)) dw dw The max and min functions In our example fW (w) = g(w1/2) � w � −1/2 2 Let X ∼ fX (x) and Y ∼ fY (y) be independent so that fX,Y (x, y) = fX (x) fY (y). � −1/2 � w 1/2 − g(−w ) − 2 Define U = max{X, Y}, V = min{X, Y} E.g., if g =N(0, σ2), then where w−1/2 −w/2σ2 fW (w) = √ e ; w ∈ [0, ∞). 2πσ2 — a chi-squared pdf with one degree of freedom EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 25 Find the pdfs of U and V . x max(x, y) = y y min(x, y) = x EE278: Introduction to Statistical Signal Processing, winter 2010–2011 if x ≥ y otherwise if x ≥ y otherwise c R.M. Gray 2011 � 26 Thus fV (v) = fX (v) + fY (v) − fX (v)FY (v) − fY (v)F X (v) To find the pdf of U , we first find its cdf. U ≤ u iff both X and Y are ≤ u, so using independence FU (u) = Pr(U ≤ u) = Pr(X ≤ u, Y ≤ u) = F X (u)FY (u) Using the product rule for derivatives, fU (u) = fX (u)FY (u) + fY (u)F X (u) To find the pdf of V , first find the cdf. V ≤ v iff either X or Y ≤ v so that using independence FV (v) = Pr(X ≤ v or Y ≤ v) = 1 − Pr(X > v, Y > v) = 1 − (1 − F X (v))(1 − FY (v)) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 27 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 28 Directly-given random variables Implies output probability space in trivial way: PV (F) = P(V −1(F)) = P(F) All named examples of pmfs (uniform, Bernoulli, binomial, geometric, Poisson) and pdfs (uniform, exponential, Gaussian, Laplacian, chi-squared, etc.) and the probability spaces they imply can be considered as describing random variables: Suppose (Ω, F , P) is a probability space with Ω ⊂ R. A random variable is said to be Bernoulli, binomial, etc. if its distribution is determined by a Bernoulli, binomial, etc. pmf (or pdf) Two random variables V and X (possibly defined on different experiments) are said to be equivalent or identically distributed if PV = PX , i.e., PV (F) = PX (F) all events F Define a random variable V : Ω → Ω V(ω) = ω E.g., both continuous with same pdf, or both discrete with same pmf — the identity mapping, random variable just reports original sample value ω EE278: Introduction to Statistical Signal Processing, winter 2010–2011 If original space discrete (continuous), so is random variable, and random variable is described by pmf (pdf) c R.M. Gray 2011 � 29 Example: Binary random variable defined as quantization of fair spinner vs. directly given as above. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 30 Derived distributions: random vectors Note: Two ways to describe random variables: 1. Describe a probability space (Ω, F , P) and define a function X on it. Together these imply distribution PX for rv (by a pmf or pdf) As in the scalar case, distribution can be described by probability functions — cdf’s and either pmfs or pdfs (or both) 2. (Directly given) Describe distribution PX directly (by a pmf or pdf). Implicitly (Ω, F , P) = (ΩX , B(ΩX ), PX ) and X(ω) = ω. If random vector has a discrete range space, then the distribution can be described by a multidimensional pmf pX(x) = PX({x}) = Pr(X = x) as Both representations are useful. PX(F) = � x∈F EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � c R.M. Gray 2011 � 31 pX(x) = � pX0,X1,...,Xk−1 (x0, x1, . . . , xk−1) (x0,x1,...,xk−1)∈F If the random vector X has a continuous range space, then distribution � can be described by a multidimensional pdf fX PX(F) = F fX(x) dx Use multidimensional cdf to find pdf EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 32 Given a k-dimensional random vector X, define cumulative distribution function (cdf) FX by Other ways to express multidimensional cdf: � � FX(x) = PX ×k−1 i=0 (−∞, xi] = P({ω : Xi(ω) ≤ xi; i = 0, 1, . . . , k − 1}) k−1 � = P Xi−1((−∞, xi]) . i=0 Integration and differentiation are inverses of each other ⇒ FX(x) = F X0,X1,...,Xk−1 (x0, x1, . . . , xk−1) fX0,X1,...,Xk−1 (x0, x1, . . . , xk−1) = PX({α : αi ≤ xi; i = 0, 1, . . . , k − 1}) = = Pr(Xi ≤ xi; i = 0, 1, . . . , k − 1) � x0 � x1 � xk−1 = ··· fX0,X1,...,Xk−1 (α0, α1, . . . , αk−1)dα0dα1 · · · dαk−1 −∞ −∞ ∂k F X ,X ,...,X (x0, x1, . . . , xk−1). ∂x0∂x1 . . . ∂xk−1 0 1 k−1 −∞ EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 33 c R.M. Gray 2011 � 34 For example, if X = (X0, X1, . . . , Xk−1) is discrete, described by a pmf pX, then distribution for PX0 is described by pmf pX0 (x0) which can be computed as Joint and marginal distributions Random vector X = (X0, X1, . . . , Xk−1) is a collection of random variables defined on a common probability space (Ω, F , P) pX0 (x0) = P({ω : X0(ω) = x0}) Alternatively, X is a random vector that takes on values randomly as described by a probability distribution PX, without explicit reference to the underlying probability space. x1,x2,...,xk−1 In general we have for cdfs that E.g., finding the distributions of individual components of the random vector. c R.M. Gray 2011 � = P({ω : X0(ω) = x0, Xi(ω) ∈ ΩX ; i = 1, 2, . . . , k − 1}) � = pX(x0, x1, x2, . . . , xk−1) In English, all of these are Pr(X0 = x0) Either the original probability measure P or the induced distribution PX can be used to compute probabilities of events involving the random vector. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 35 F X0 (x0) = P({ω : X0(ω) ≤ x0) = P({ω : X0(ω) ≤ x0, Xi(ω) ∈ ΩX ; i = 1, 2, . . . , k − 1}) = FX(x0, ∞, ∞, . . . , ∞) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 36 ⇒ if the pdfs exist, fX0 (x0) = � Sum or integrate over all of the dummy variables corresponding to the unwanted random variables in the vector to obtain the pmf or pdf for the random variable Xi fX(x0, x1, x2, . . . , xk−1)dx1dx2 . . . dxk−1 F Xi (α) = FX(∞, ∞, . . . , ∞, α, ∞, . . . , ∞), Can find distributions for any of the components in this way: or Pr(Xi ≤ α) = Pr(Xi ≤ α and X j ≤ ∞, all j � i) pXi (α) = � pX0,X1,...,Xk−1 (x0, x1, . . . , xi−1, α, xi+1, . . . , xk−1) x0,x1,...,xi−1,xi+1,...,xk−1 or fXi (α) = � dx0 . . . dxi−1dxi+1 . . . dxk−1 fX0,...,Xk−1 (x0, . . . , xi−1, α, xi+1, . . . , xk−1) c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 37 2D random vectors These relations are called consistency relationships — a random vector distribution implies many other distributions, and these must be consistent with each other. c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 38 If the range space of the vector (X, Y) is continuous and the cdf is differentiable so that fX,Y (x, y) exists, (X, Y) a random vector. Ideas are clearest when only 2 rvs: Similarly can find cdfs/pmfs/pdfs for any pairs or triples of random variables in the random vector or any other subvector (at least in theory) marginal distribution of X is obtained from the joint distribution of X and Y by leaving Y unconstrained fX (x) = � ∞ −∞ fX,Y (x, y) dy, with similar expressions for the distribution for rv Y . Joint distributions imply marginal distributions. PX (F) = PX,Y ({(x, y) : x ∈ F, y ∈ R}); F ∈ B(R). The opposite is not true without additional assumptions, e.g., independence. Marginal cdf of X is F X (α) = F X,Y (α, ∞) If the range space of the vector (X, Y) is discrete, pX (x) = � pX,Y (x, y). y EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 39 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 40 Examples of joint and marginal distributions Thus in the special case of a product distribution, knowing the marginal pmfs is enough to know the joint distribution. Thus marginal distributions + independence ⇒ the joint distribution. Example Pair of fair coins provides an example: Suppose rvs X and Y are such that the random vector (X, Y) has a pmf of the form 1 pXY (x, y) = pX (x)pY (y) = ; x, y = 0, 1 4 1 pX (x) = pY (y) = ; x = 0, 1 2 pX,Y (x, y) = r(x)q(y), where r and q are both valid pmfs. ( pX,Y is a product pmf) Then pX (x) = � pX,Y (x, y) = y � r(x)q(y) y � = r(x) q(y) = r(x). y c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 41 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 Example of where marginals not enough pXY (x, y) 0 1 42 Another example Flip two fair coins connected by a piece of flexible rubber 0 0.4 0.1 c R.M. Gray 2011 � Loaded pair of six-sided dice have property the sum of the two dice = 7 on every roll. 1 0.1 0.4 All 6 combinations possible combinations ( (1,6), (2,5), (3,4), (4,3), (5,2), (6,1)) have equal probability. ⇒ pX (x) = pY (y) = 1/2, x = 0, 1 Suppose outcome of one die is X , other is Y Not a product distribution, but same marginals as product distribution case (X, Y) is a random vector taking values in {1, 2, . . . , 6}2 1 pX,Y (x, y) = , x + y = 7, (x, y) ∈ {1, 2, . . . , 6}2. 6 Quite different joints can yield the same marginals. Marginals alone do not tell the story. Find marginal pmfs EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 43 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 44 pX (x) = � y Continuous example 1 pXY (x, y) = pXY (x, 7 − x) = , x = 1, 2, . . . , 6 6 Same as if product distribution. marginals alone do not imply joint (X, Y) a rv with a pdf that is constant on the unit disk in the XY plane: C x2 + y2 ≤ 1 fX,Y (x, y) = 0 otherwise Find marginal pdfs. Is it a product pdf? Need C : � x2+y2≤1 C dx dy = 1. Integral = area of a circle multiplied by C ⇒ C = 1/π. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 fX (x) = � √ + 1−x2 √ − 1−x2 c R.M. Gray 2011 � 45 +1 −1 46 2D Gaussian pdf with k = 2, m = (0, 0)t , and Λ = {λ(i, j) : λ(1, 1) = λ(2, 2) = 1, λ(1, 2) = λ(2, 1) = ρ}. √ Inverse matrix is 2C 1 − x2 dx = πC = 1, � or C = π−1. Thus c R.M. Gray 2011 � Joints and marginals: Gaussian pair √ C dy = 2C 1 − x2 , x2 ≤ 1. Could now also find C by a second integration: � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 1 ρ ρ 1 �−1 � � 1 1 −ρ = , 1 − ρ2 −ρ 1 the joint pdf for the random vector (X, Y) is √ fX (x) = 2π 1 − x2 , x2 ≤ 1. −1 � � 1 2 2 exp − 2(1−ρ (x + y − 2ρxy) 2) fX,Y (x, y) = , (x, y) ∈ R2. � 2 2π 1 − ρ By symmetry Y has the same pdf. fX,Y not a product pdf. Note marginal pdf is not constant, even though the joint pdf is. ρ called “correlation coefficient” EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 47 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 48 Consistency & directly given processes Need ρ2 < 1 for Λ to be positive definite To find the pdf of X , integrate joint over y Do this using standard trick: complete the square: Have seen two ways to describe (specify) a random variable – as a probability space + a function (random variable), or a directly given rv (a distribution — pdf or pmf) x2 + y2 − 2ρxy = (y − ρx)2 − ρ2 x2 + x2 = (y − ρx)2 + (1 − ρ2)x2 Same idea works for random vectors. � � � � � 2� (y−ρx)2 (y−ρx)2 x2 x exp − 2(1−ρ − exp − 2) 2 2(1−ρ2) exp − 2 fX,Y (x, y) = = � . � √ 2π 2π 1 − ρ2 2π(1 − ρ2) What about random processes? E.g., direct definition of fair coin flipping process. Part of joint is N(ρx, 1 − ρ2), which integrates to 1. Thus 2 fX (x) = (2π)−1/2e−x /2. Note marginals the same regardless of ρ! EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 49 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 50 The axioms of probability ⇒ that these pmfs for any choice of K and k1, . . . , kK must be consistent in the sense that if any of the pmfs is used to compute the probability of an event, the answer must be the same. E.g., For simplicity, consider discrete time, discrete alphabet random process, say {Xn}. Given random process, can use inverse image formula to compute pmf for any finite collection of samples (Xk1 , Xk2 , . . . , XkK ), e.g., pXk1 ,Xk2 ,...,XkK (x1, x2, . . . , xK ) = Pr(Xki = xi; i = 1, . . . , K) pX1 (x1) = � pX1,X2 (x1, x2) x2 = P({ω : Xki (ω) = xi; i = 1, . . . , K}) = For example, in the fair coin flipping process � pX0,X1,X2 (x0, x1, x2) x0,x2 = pXk1 ,Xk2 ,...,XkK (x1, x2, . . . , xK ) = 2−K , all (x1, x2, . . . , xK ) ∈ {0, 1}K � x3,x5 pX1,X3,X5 (x0, x2, x5) since all of these computations yield the same probability in the original probability space Pr(X1 = x1) = P({ω : X1(ω) = x1}) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 51 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 52 To completely describe a random process, you need only provide a formula for a consistent family of pmfs for finite collections of samples. Bottom line If given a discrete time discrete alphabet random process {Xn; n ∈ Z}, then for any finite K and collection of K sample times k1, . . . , kK can find the joint pmf pXk ,Xk ,...,XkK (x1, x2, . . . , xK ) and 1 2 this collection of pmfs must be consistent. The same result holds for continuous time random processes and for continuous alphabet processes (family of pdfs) Kolmogorov proved a converse to this idea now called the Kolmogorov extension theorem, which provides the most common method for describing a random process: Difficult to prove, but most common way to specify model. Kolmogorov or directly-given representation of a random process – describe consistent family of vector distributions. For completeness: Theorem. Kolmogorov extension theorem for discrete time processes Given a consistent family of finite-dimensional pmfs pXk1 ,Xk2 ,...,XkK (x1, x2, . . . , xK ) for all dimensions K and sample times k1, . . . , kK , then there is a random process {Xn; n ∈ Z described by these marginals. c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 53 Theorem. Kolmogorov extension theorem Suppose that one is given a consistent family of finite-dimensional distributions PXt0 ,Xt1 ,...,Xtk−1 for all positive integers k and all possible sample times ti ∈ T ; i = 0, 1, . . . , k − 1. Then there exists a random process {Xt ; t ∈ T } that is consistent with this family. In other words, to describe a random process completely, it is sufficient to describe a consistent family of finite-dimensional distributions of its samples. Example: Given a pmf p, define a family of vector pmfs by pXk1 ,Xk2 ,...,XkK (x1, x2, . . . , xK ) = K � c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 54 The continuous alphabet analogy is defined in terms of a pdf f — define the vector pdfs by fXk1 ,Xk2 ,...,XkK (x1, x2, . . . , xK ) = K � f (xk ) i=1 A discrete time continuous alphabet process is iid if its joint pdfs factor in this way. p(xk ), i=1 then there is a random process {Xn} having these vector pmfs for finite collections of samples. A process of this form is called an iid process. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 55 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 56 Independent random variables If X , Y discrete, choosing F = {x}, Y = {y} ⇒ pXY (x, y) = pX (x)pY (y) all x, y Return to definition of independent rvs, with more explanation. Definition of independent random variables an application of definition of independent events. Conversely, if joint pmf = product of marginals, then evaluate Pr(X ∈ F, Y ∈ G) as Defined events F and G to be independent if P(F ∩ G) = P(F)P(G) P(X −1(F) ∩ Y −1(G)) = � � pXY (x, y) = x∈F,y∈G pX (x)pY (y) x∈F,y∈G � � = pX (x) pY (y) = P(X −1(F))P(Y −1(G)) Two random variables X and Y defined on a probability space are independent if the events X −1(F) and Y −1(G) are independent for all F and G in B(R), i.e., if x∈F P(X −1(F) ∩ Y −1(G)) = P(X −1(F))P(Y −1(G)) y∈G ⇒ independent by general definition Equivalently, Pr(X ∈ F, Y ∈ G) = Pr(X ∈ F) Pr(Y ∈ G) or PXY (F × G) = PX (F)PY (G) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 57 For general random variables, consider F = (−∞, x], G = (−∞, y]. Then if X, Y independent, F XY (x, y) = F X (x)FY (y) all x, y. If pdfs exist, this implies that fXY (x, y) = fX (x) fY (y) c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 58 A collection of rvs {Xi, i = 0, 1, . . . , k − 1} is independent or mutually independent if all collections of events of the form {Xi−1(Fi); i = 0, 1, . . . , k − 1} are mutually independent for any Fi ∈ B(R); i = 0, 1, . . . , k − 1. A collection of discrete random variables Xi; i = 0, 1, . . . , k − 1 is mutually independent iff Conversely, if this relation holds for all x, y, then P(X −1(F) ∩ Y −1(G)) = P(X −1(F))P(Y −1(G)) and hence X and Y are independent. pX0,...,Xk−1 (x0, . . . , xk−1) = k−1 � i=0 pXi (xi); ∀xi. A collection of continuous random variables is independent iff the joint pdf factors as fX0,...,Xk−1 (x0, . . . , xk−1) = k−1 � fXi (xi). i=0 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 59 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 60 Conditional distributions A collection of general random variables is independent iff the joint cdf factors as F X0,...,Xk−1 (x0, . . . , xk−1) = k−1 � i=0 F Xi (xi); (x0, x1, . . . , xk−1) ∈ Rk . Apply conditional probability to distributions. Can express joint probabilities as products even if rvs not independent The random vector is independent, identically distributed (iid) if the components are independent and the marginal distributions are all the same. E.g., distribution of input given observed output (for inference) There are many types: conditional pmfs, conditional pdfs, conditional cdfs Elementary and nonelementary conditional probability EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 61 Discrete conditional distributions EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 62 Define for each x ∈ AX for which pX (x) > 0 the conditional pmf pY|X (y|x) = P(Y = y|X = x) P(Y = y, X = x) = P(X = x) P({ω : Y(ω) = y} ∩ {ω : X(ω) = x}) = P({ω : X(ω) = x}) pX,Y (x, y) = , pX (x) Simplest, direct application of elementary conditional probability to pmfs Consider 2D discrete random vector (X, Y) elementary conditional probability that Y = y given X = x alphabet AX × AY joint pmf pX,Y (x, y) marginal pmfs pX and pY EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 63 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 64 Properties of conditional pmfs Can compute conditional probabilities by summing conditional pmfs, P(Y ∈ F|X = x) = For fixed x, pY|X (·|x) is a pmf: � pY|X (y|x) = y∈AY = P(X ∈ G, Y ∈ F) = Y 1 pX (x) = 1. pX (x) = � x,y:x∈G,y∈F � pX (x) x∈G = The joint pmf can be expressed as a product as � x∈G pX,Y x, y � y∈F pY|X (y | x) pX (x)P(F | X = x) � Later: define nonelementary conditional probability to mimic this formula pX,Y (x, y) = pY|X (y|x)pX (x). EE278: Introduction to Statistical Signal Processing, winter 2010–2011 pY|X (y|x) y∈F Can write probabilities of events of the form X ∈ G, Y ∈ F (rectangles) as � pX,Y (x, y) 1 � = pX,Y (x, y) pX (x) pX (x) y∈A y∈A Y � c R.M. Gray 2011 � 65 c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 66 Example of Bayes rule: Binary Symmetric Channel If X and Y are independent, then pY|X (y|x) = pY (y) Given pY|X , pX , Bayes rule for pmfs: pX|Y (x|y) = pY|X (y|x)pX (x) pX,Y (x, y) =� , pY (y) u pY|X (y|u)pX (u) Consider the following binary communication channel Z ∈ {0, 1} a result often referred to as Bayes’ rule. X ∈ {0, 1} ✲ ❄ ✛✘ + ✚✙ ✲ Y ∈ {0, 1} Bit sent is X ∼ Bern(p), 0 ≤ p ≤ 1, noise is Z ∼ Bern(�), 0 ≤ � ≤ 0.5, bit received is Y = (X + Z) mod 2 = X ⊕ Z , and X and Z are independent Find 1) pX|Y (x|y), 2) pY (y), and 3) Pr{X � Y}, the probability of error EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 67 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 68 1. To find pX|Y (x|y) use Bayes rule pX|Y (x|y) = � x�∈AX Therefore pY|X (0 | 0) = pZ (0 ⊕ 0) = pZ (0) = 1 − � pY|X (y|x) pX (x) pY|X (y|x�)pX (x�) pY|X (0 | 1) = pZ (0 ⊕ 1) = pZ (1) = � pY|X (1 | 0) = pZ (1 ⊕ 0) = pZ (1) = � pY|X (1 | 1) = pZ (1 ⊕ 1) = pZ (0) = 1 − � Know pX (x), but we need to find pY|X (y|x): pY|X (y|x) = Pr{Y = y | X = x} = Pr{X ⊕ Z = y | X = x} = Pr{x ⊕ Z = y | X = x} = Pr{Z = y ⊕ x | X = x} = Pr{Z = y ⊕ x} since Z and X are independent = pZ (y ⊕ x) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 69 c R.M. Gray 2011 � 70 c R.M. Gray 2011 � 72 2. We already found pY (y) as Plugging into Bayes rule: pY|X (0|0) (1 − �)(1 − p) pX (0) = pY|X (0|0)pX (0) + pY|X (0|1)pX (1) (1 − �)(1 − p) + � p �p pX|Y (1|0) = 1 − pX|Y (0|0) = (1 − �)(1 − p) + � p pY|X (1|0) �(1 − p) pX|Y (0|1) = pX (0) = pY|X (1|0)pX (0) + pY|X (1|1)pX (1) (1 − �)p + �(1 − p) (1 − �)p pX|Y (1|1) = 1 − pX|Y (0|1) = (1 − �)p + �(1 − p) pX|Y (0|0) = EE278: Introduction to Statistical Signal Processing, winter 2010–2011 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 71 pY (y) = pY|X (y|0)pX (0) + pY|X (y|1)pX (1) (1 − �)(1 − p) + � p for y = 0 = �(1 − p) + (1 − �)p for y = 1 3. Now to find the probability of error Pr{X � Y}, consider Pr{X � Y} = pX,Y (0, 1) + pX,Y (1, 0) = pY|X (1|0)pX (0) + pY|X (0|1)pX (1) = �(1 − p) + � p = � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 Conditional pmfs for vectors An interesting special case is � = 12 . Here, Pr{X � Y} = 12 , which is the worst possible (no information is sent), and pY (0) = 12 p + 12 (1 − p) = 1 2 = pY (1) Therefore Y ∼ Bern( 12 ), independent of the value of p ! Random vector (X0, X1, . . . , Xk−1) In this case, the bit sent X and the bit received Y are independent (check this) pmf pX0,X1,...,Xk−1 Define conditional pmfs (assuming denominators � 0) pXl|X0,...,Xl−1 (xl|x0, . . . , xl−1) = EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 73 ⇒ chain rule c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 74 Continuous conditional distributions Continuous distributions more complicated pX0,X1,...,Xn−1 (x0, x1, . . . , xn−1) � � pX0,X1,...,Xn−1 (x0, x1, . . . , xn−1) pX0,X1,...,Xn−2 (x0, x1, . . . , xn−2) = pX0,X1,...,Xn−2 (x0, x1, . . . , xn−2) .. n−1 � pX0,X1,...,Xi (x0, x1, . . . , xi) = pX0 (x0) p (x , x , . . . , xi−1) i=1 X0,X1,...,Xi−1 0 1 = pX0 (x0) pX0,...,Xl (x0, . . . , xl) . pX0,...,Xl−1 (x0, . . . , xl−1) n−1 � l=1 Given X, Y with joint pdf fX,Y , marginal pdfs fX , fY , define conditional pdf fY|X (y|x) ≡ analogous to conditional pmf, but unlike conditional pmf, not a conditional probability! pXl|X0,...,Xl−1 (xl|x0, . . . , xl−1) A density of conditional probability Formula plays an important role in characterizing memory in processes. Can be used to construct joint pmfs, and to specify a random process. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 fX,Y (x, y) . fX (x) c R.M. Gray 2011 � Problem: conditioning event has probability 0. Elementary conditional probability not work. 75 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 76 Nonelementary conditional probability Conditional pdf is a pdf: � fY|X (y|x) dy = = = � fX,Y (x, y) dy fX (x) � 1 fX,Y (x, y) dy fX (x) 1 fX (x) = 1, fX (x) � Does P(Y ∈ F|X = x) = F fY|X (y|x) dy. make sense as an appropriate definition of conditional probability given an event of zero probability? Observe that analogous to the �ed result for pmfs, assuming the pdfs all make sense provided require that fX (x) > 0 over the region of integration. P(X ∈ G, Y ∈ F) = Given a conditional pdf fY|X , define (nonelementary) conditional probability that Y ∈ F given X = x by P(Y ∈ F|X = x) ≡ � F = fY|X (y|x) dy. = Resembles discrete form. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 77 � � � x,y:x∈G,y∈F x∈G x∈G fX (x) fX,Y (x, y)dxdy �� � fY|X (y | x)dy dx y∈F fX (x)P(F | X = x) �� c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 78 Bayes rule for pdfs Our definition is ad hoc. But the careful mathematical definition of conditional probability P(F | X = x) for an event of 0 probability is made not by a formula such as we have used to define conditional pmfs and pdfs and elementary conditional probability, but by its behavior inside an integral (like the Dirac delta). In particular, P(F | X = x) is defined as any measurable function satisfying equation �� for all events F and G, which our definition does. Bayes rule: fX|Y (x|y) = fY|X (y|x) fX (x) fX,Y (x, y) =� . fY (y) fY|X (y|u) fX (u) du Example of conditional pdfs: 2D Gaussian U = (X, Y), Gaussian pdf with mean (mX , mY )t and covariance matrix Λ= EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 79 � σ2X ρσX σY ρσX σY σ2Y EE278: Introduction to Statistical Signal Processing, winter 2010–2011 � , c R.M. Gray 2011 � 80 Algebra ⇒ Rearrange det(Λ) = σ2X σ2Y (1 − ρ2) � � 1 1/σ2X −ρ/(σX σY ) −1 Λ = 1/σ2Y (1 − ρ2) −ρ/(σX σY ) � � 1 y−m −(ρσ /σ )(x−m ) 2 Y √ Y X X � � exp − 2 X 2 exp − 12 ( x−m 1−ρ2σY σX ) fXY (x, y) = � � 2πσ2X 2πσ2Y (1 − ρ2) so ⇒ fXY (x, y) 1 1 −1 t = e− 2 (x−mX ,y−mY )Λ (x−mX ,y−mY ) √ 2π det Λ � 1 1 = exp − � 2(1 − ρ2) 2πσX σY 1 − ρ2 � � � �2 x − mX 2 (x − m )(y − m ) y − m X Y Y − 2ρ + × σX σ X σY σY c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 2 fY|X (y|x) = � 1−ρ2σY 2πσ2Y (1 − , ρ2 ) Gaussian with variance σ2Y|X ≡ σ2Y (1 − ρ2), mean mY|X ≡ mY + ρ(σY /σX )(x − mX ) 81 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 82 Chain rule for pdfs Integrate joint over y (as before) ⇒ 2 fX (x) = � � 1 y−m −(ρσ /σ )(x−m ) 2 Y √ Y X X exp − 2 e−(x−mX ) /2σX . � 2 2πσX Assume fX0,X1,...,Xi (x0, x1, . . . , xi) > 0, Similarly, fY (y) and fX|Y (x|y) are also Gaussian fX0,X1,...,Xn−1 (x0, x1, . . . , xn−1) fX0,X1,...,Xn−1 (x0, x1, . . . , xn−1) fX ,X ,...,X (x0, x1, . . . , xn−2) = fX0,X1,...,Xn−2 (x0, x1, . . . , xn−2) 0 1 n−2 .. n−1 � fX0,X1,...,Xi (x0, x1, . . . , xi) = fX0 (x0) f (x , x , . . . , xi−1) i=1 X0,X1,...,Xi−1 0 1 Note: X and Y jointly Gaussian ⇒ also both individually and conditionally Gaussian! = fX0 (x0) n−1 � i=1 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 83 fXi|X0,...,Xi−1 (xi|x0, . . . , xi−1). EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 84 Statistical detection and classification binary symmetric channel (BSC) Given observation Y , what is the best guess X̂(Y) of transmitted value? Simple application of conditional probability mass functions describing discrete random vectors decision rule or detection rule Transmitted: discrete rv X , pmf pX , pX (1) = p Measure quality by probability guess is correct: (e.g., one sample of a binary random process) Pc(X̂) = Pr(X = X̂(Y)) = 1 − Pe, Received: rv Y Conditional pmf (noisy channel) pY|X (y|x) where More specific example as special case: X Bernoulli, parameter p � pY|X (y|x) = 1 − � x�y x=y Pe(X̂) = Pr(X̂(Y) � X). A decision rule is optimal if it yields the smallest possible Pe or maximum possible Pc . c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 85 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 86 This is maximum a posteriori (MAP) detection rule Pr(X̂ = X) = 1 − Pe(X̂) = � = � In binary example: Choose X̂(y) = y if � < 1/2 and X̂(y) = 1 − y if � > 1/2. pX,Y (x, y) (x,y):X̂(y)=x pX|Y (x|y)pY (y) ⇒ minimum (optimal) error probability over all possible rules is min(�, 1 − �) (x,y):X̂(y)=x = � y = � y � pY (y) pX|Y (x|y) In general nonbinary case, statistical detection is statistical classification: Unseen X might be presence or absence of a disease, observation Y the results of various tests. x:X̂(y)=x pY (y)pX|Y (X̂(y)|y). General Bayesian classification allows weighting of cost of different kinds of errors (Bayes risk) so minimize a weighted average (expected cost) instead of only probability of error To maximize sum, maximize pX|Y (X̂(y)|y) for each y. Accomplished by X̂(y) ≡ arg max pX|Y (u|y) which yields pX|Y (X̂(y)|y) = maxu pX|Y (u|y) u EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 87 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 88 Additive noise: Discrete random variables pX,Y (x, y) = Pr(X = x, Y = y) = Pr(X = x, X + W = y) � = pX,W (α, β) = pX,W (x, y − x) Common setup in communications, signal processing, statistics: α,β:α=x,α+β=y Original signal X has random noise W (independent of X ) added to it, observe Y = X + W Typically use observation Y to make inference about X Note: Formula only makes sense if y − x is in the range space of W Thus pY|X (y|x) = Begin by deriving conditional distributions. Intuitive! Discrete case: Have independent rvs X and W with pmfs pX and pW . Form Y = X + W . Find pY pX,Y (x, y) = pW (y − x), pX (x) Marginal for Y : pY (y) = Use inverse image formula: EE278: Introduction to Statistical Signal Processing, winter 2010–2011 = pX (x)pW (y − x). � pX,Y (x, y) = x c R.M. Gray 2011 � 89 � x pX (x)pW (y − x) c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 90 Additive noise: continuous random variables a discrete convolution Above uses ordinary real arithmetic. Similar results hold for other definitions of addition, e.g., modulo 2 arithmetic for binary X , W , fXW (x, w) = fX (x) fW (w) (independent), Y = X + W As with linear systems, convolutions usually be easily evaluated in the transform domain. Will do shortly. Find fY|X and fY Since continuous, find joint pdf by first finding joint cdf F X,Y (x, y) = Pr(X ≤ x, Y ≤ y) = Pr(X ≤ x, X + W ≤ y) � � = fX,W (α, β) dα dβ = = � x �−∞ x −∞ EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 91 α,β:α≤x,α+β≤y � y−α dα −∞ dβ fX (α) fW (β) dα fX (α)FW (y − α). EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 92 Additive Gaussian noise Taking derivatives: fX,Y (x, y) = fX (x) fW (y − x) ⇒ ⇒ fY (y) = � fY|X (y|x) = fW (y − x). fX,Y (x, y) dx = � Assume fX = N(0, σX ), fW = N(0, σ2Y ), fXW (x, w) = fX (x) fW (w), Y = X + W. fX (x) fW (y − x) dx, a convolution integral of the pdfs fX and fW pdf fX|Y follows from Bayes’ rule: Gaussian example: fX|Y (x|y) = � 2 fX (x) fW (y − x) fX (α) fW (y − α) dα EE278: Introduction to Statistical Signal Processing, winter 2010–2011 . which is N(x, σ2W ). c R.M. Gray 2011 � 93 To find fX|Y using Bayes’ rule, need fY : fY (y) = � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 Integrand resembles ∞ fY|X (y|α) fX (α) dα � � � � � ∞ exp − 12 (y − α)2 exp − 12 α2 2σW 2σX = dα � � −∞ 2πσ2W 2πσ2X 2 � ∞ 1 y − 2αy + α2 α2 1 = exp − + 2 dα 2πσX σW −∞ 2 σ2W σX � 2 � exp − 2σy 2 � ∞ 1 2 1 1 2αy W = exp − α ( 2 + 2 ) − 2 dα 2πσX σW −∞ 2 σX σW σW −∞ c R.M. Gray 2011 � c R.M. Gray 2011 � 94 � � 1 α−m 2 exp − ( ) . 2 σ which has integral � √ 1 α−m 2 exp − ( ) dα = 2πσ2 2 σ −∞ � ∞ � (Gaussian pdf integrates to 1) Compare Can integrate by completing the square (later see an easier way using tranforms, but this trick is not difficult) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 2 e−(y−x) /2σW fY|X (y|x) = fW (y − x) = � 2πσ2W 95 � � 1 2 1 1 2αy 1 � α − m �2 1 α2 αm m2 − α 2 + 2 − 2 vs. − =− −2 2 + 2 . 2 2 σ 2 ���������������� σ2 σ σ σX σW σW ���������������������������������������������� EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 96 ⇒ The braced terms will be the same if choose 1 1 1 = + ⇒ σ2 = σ2 σ2W σ2X � σ2X σ2W , σ2X + σ2W and ⇒ y m σ2 = ⇒ m = y. σ2W σ2 σ2W ⇒ 1 1 2αy � α − m �2 m2 α 2 + 2 − 2 = − 2 σ σ σX σW σW Sum of two independent 0 mean Gaussian rvs is another 0 mean Gaussian rv, the variance of the sum = sum of the variances “completing the square.’ c R.M. Gray 2011 � 97 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 fX|Y (x|y) = N For a posteriori probability fX|Y use Bayes’ rule + algebra fX|Y (x|y) = fY|X (y|x) fX (x)/ fY (y) � � � � � � 2 exp − 2σ12 (y − x)2 exp − 2σ12 x2 exp − 12 σ2 y+σ2 W X W X /� = � � 2 2 2 2πσW 2π(σX + σ2W ) 2πσX � �2 �� y2 1 y −2yx+x2 x2 exp − 2 + σ2 − σ2 +σ2 σ2W X X W = � 2πσ2X σ2W /(σ2X + σ2W ) � � 1 2 2 2 2 exp − 2σ2 σ2 /(σ (x − yσ /(σ + σ )) 2 +σ2 ) X X W X W X W = . � 2πσ2X σ2W /(σ2X + σ2W ) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 � � � � 2 2 � 2 � exp − 1 2 y 2 exp − 12 σy2 √ 2 σ +σ m W X W fY (y) = 2πσ2 exp = � 2 2πσX σW 2σ 2π(σ2X + σ2W ) So fY = N(0, σ2X + σ2W ) 2 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 1 2 1 1 2αy exp − α ( 2 + 2 ) − 2 dα 2 σX σW σW −∞ � �� �� � 2� � ∞ √ 1 α − m �2 m2 m 2 = exp − − 2 dα = 2πσ exp 2 2 σ σ 2σ2 −∞ ∞ c R.M. Gray 2011 � c R.M. Gray 2011 � 98 σ2X σ2W . y, σ2X + σ2W σ2X + σ2W σ2X The mean of a conditional distribution called a conditional mean, the variance of a conditional distribution called a conditional variance 99 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 100 Continuous additive noise with discrete input continuous cases Most important case of mixed distributions in communications applications FY|X (y|x) = Pr(Y ≤ y | X = x) = Pr(X + W ≤ y | X = x) = Pr(x + W ≤ y | X = x) = Pr(W ≤ y − x | X = x) Typical: Binary random variable X , Gaussian random variable W , X and W independent, Y = X + W = Pr(W ≤ y − x) = FW (y − x) Previous examples do not work, one rv discrete, other continuous Similar signal processing issue: Observe Y , guess X Differentiating, As before, may be one sample of a random process, in practice have {Xn}, {Wn}, {Yn}. At time n, observe Yn, guess Xn Conditional cdf FY|X (y|x) for Y given X = x is an elementary conditional probability. Analogous to purely discrete and purely c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 fY|X (y|x) = 101 � pX (x) F � = pX (x) F � G � G Pr(Y ∈ G) = = � pX (x) pX (x) � G � G fW (y − x) dy. pX|Y (x|y) = fY (y) = pX (x) fY|X (y|x) = EE278: Introduction to Statistical Signal Processing, winter 2010–2011 � fY|X (y|x)pX (x) fY|X (y|x)pX (x) =� , fY (y) α pX (α) fY|X (y|α) but this is not an elementary conditional probability, conditioning event has probability 0! fY|X (y|x) dy Can be justified in similar way to conditional pdfs: fW (y − x) dy. Pr(X ∈ F and Y ∈ G) = Choosing G = (−∞, y] yields cdf FY (y) ⇒ � 102 Continuing analogy Bayes’ rule suggests conditional pmf: fY|X (y|x) dy Choosing F = R yields � c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 a convolution, analogous to pure discrete and pure continuous cases Joint distribution combined by a combination of pmf and pdf. Pr(X ∈ F and Y ∈ G) = d d FY|X (y|x) = FW (y − x) = fW (y − x) dy dy = pX (x) fW (y − x), c R.M. Gray 2011 � � �G G 103 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 dy fY (y) Pr(X ∈ F|Y = y) dy fY (y) � pX|Y (x|y) F c R.M. Gray 2011 � 104 Binary detection in Gaussian noise so that pX|Y (x|y) satisfies Pr(X ∈ F|Y = y) = � pX|Y (x|y) F Apply to binary input and Gaussian noise: the conditional pmf of the binary input given the noisy observation is fW (y − x)pX (x) fY (y) fW (y − x)pX (x) = � ; y ∈ R, x ∈ {0, 1}. α pX (α) fW (y − α) The derivation of the MAP detector or classifier extends immediately to a binary input random variable and independent Gaussian noise As in the purely discrete case, MAP detector X̂(y) of X given Y = y is given by pX|Y (x|y) = X̂(y) = argmax pX|Y (x|y) = argmax � x x fW (y − x)pX (x) . α pX (α) fW (y − α) Denominator of the conditional pmf does not depend on x, the denominator has no effect on the maximization Can now solve classical binary detection in Gaussian noise. X̂(y) = argmax pX|Y (x|y) = argmax fW (y − x)pX (x). x c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 105 Assume for simplicity that X is equally likely to be 0 or 1: EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 106 Error probability of optimal detector: Pe = Pr(X̂(Y) � X) 1 (x − y)2 X̂(y) = argmax pX|Y (x|y) = argmax � exp − 2 σ2W x x 2 2πσW 1 = Pr(X̂(Y) � 0|X = 0)pX (0) + Pr(X̂(Y) � 1|X = 1)pX (1) = Pr(Y > 0.5|X = 0)pX (0) + Pr(Y < 0.5|X = 1)pX (1) = argmax pX|Y (x|y) = argmin |x − y| x x = Pr(W + X > 0.5|X = 0)pX (0) + Pr(W + X < 0.5|X = 1)pX (1) x = Pr(W > 0.5|X = 0)pX (0) + Pr(W + 1 < 0.5|X = 1)pX (1) Minimum distance or nearest neighbor decision, choose closest x to y A threshold detector 0 y < 0.5 X̂(y) = . 1 y > 0.5 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 = Pr(W > 0.5)pX (0) + Pr(W < −0.5)pX (1) using the independence of W and X . In terms of Φ function: � � � � �� � � 1 0.5 0.5 1 Pe = 1 − Φ +Φ − =Φ − . 2 σW σW 2σW c R.M. Gray 2011 � 107 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 108 Statistical estimation Will later introduce another quality measure (MSE) and optimize. Now mention other approaches. Examples of estimation or regression instead of detection In detection/classification problems, goal is to guess which of a discrete set of possibilities is true. MAP rule is an intuitive solution. Different if (X, Y) continuous, observe Y , and guess X . Examples: X, W independent Gaussian, Y = X + W . What is best guess of X given Y ? {Xn} is a continuous alphabet random process (perhaps Gaussian). Observe Xn−1. What is best guess for Xn? What if observe X0, X1, X2, . . . , Xn−1? Quality criteria for discrete case no longer works, Pr(X̂(Y) = Y) = 0 in general. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 109 MAP Estimation 110 Maximum Likelihood Estimation The maximum likelihood (ML) estimate of X given Y = the value of x that maximizes the conditional pdf fY|X (y|x) (instead of the a posteriori pdf fX|Y (x|y)) Mimic map detection, maximize conditional probability function X̂MAP(y) = argmax x fX|Y (x|y) Easy to describe, application of conditional pdfs + Bayes. X̂ML(y) = argmax fY|X (y|x). But can not argue “optimal” in sense of maximizing quality x Advantage: Do not need to know prior fX and use Bayes to find fX|Y (x|y). Simple Example: Gaussian signal plus noise Found fX|Y (x|y) = Gaussian with mean yσ2X /(σ2X + σ2W ) In the Gaussian case, X̂ML(Y) = y. Gaussian pdf maximized at its mean ⇒ MAP estimate of X given Y = y is the conditional mean yσ2X /(σ2X + σ2W ). EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � Will return to estimation when consider expectations in more detail. 111 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 112 Characteristic functions For discrete rv with pmf pX , define characteristic function MX MX ( ju) = When sum independent random variables, find derived distribution by convolution of pmfs or pdfs Can be complicated, avoidable using transforms as in linear systems Summing independent random variables arises frequently in signal analysis problems. E.g., iid random process {Xk } is put into a linear � filter to produce an output Yn = nk=1 hn−k Xk . What is distribution of Yn? n-fold convolution a mess. Describe shortcut. pX (x)e jux x where u is usually assumed to be real. A discrete exponential transform. Sometimes φ, Φ, j not included. (∼ notational differences in Fourier transforms) Alternative useful form: Recall definition of expectation of a random variable g defined on a discrete probability space described by a pmf � g: E(g) = ω p(ω)g(ω) Consider probability space (ΩX , B(ΩX ), PX ) with PX described by pmf pX Transforms of probability functions called characteristic functions. Variation on Fourier/Laplace transforms. Notation varies. c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 � This is directly-given representation for rv X , X is the identity function on ΩX : X(x) = x 113 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 114 c R.M. Gray 2011 � 116 MX ( ju) = F−u/2π(pX ) = Ze ju (pX ) jux Define random � variable g(X) on this space g(X)(x) = e . Then jux E[g(X)] = pX (x)e so that Properties of characteristic functions follow from those of Fourier/Laplace/z/exponential transforms. x MX ( ju) = E[e juX ] Characteristic functions, like probabilities, can be viewed as special cases of expectation Resembles discrete time Fourier transform Fν(pX ) = � pX (x)e− j2πνx x and the z-transform Zz(pX ) = � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 pX (x)z x. x c R.M. Gray 2011 � 115 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 Characteristic functions and summing independent rvs Can recover pmf from MX by suitable inversion. E.g., given pX (k); k ∈ ZN , 1 2π � π/2 1 MX ( ju)e−iuk du = 2π −π/2 = � � −π/2 pX (x) x = � π/2 � x 1 2π pX (x)e jux e−iuk du � π/2 Two independent random variables X , W with pmfs pX and pW and characteristic functions MX and MW e ju(x−k) du Y = X+W −π/2 pX (x)δk−x = pX (k). To find characteristic function of Y x MY ( ju) = But usually invert by inspection or from tables, avoid inverse transforms � pY (y)e juy y use the inverse image formula pY (y) = � pX,W (x, w) x,w:x+w=y c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 117 to obtain c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 118 Iterate: � � MY ( ju) = � � juy pX,W (x, w) e juy = p (x, w)e X,W y x,w:x+w=y y x,w:x+w=y � � � = ju(x+w) = p (x, w)e pX,W (x, w)e ju(x+w) X,W y x,w:x+w=y x,w Last sum factors: MY ( ju) = � pX (x)pW (w)e juxe juw = x,w = MX ( ju)MW ( ju), � x pX (x)e jux � pW (w)e juw w Theorem 1. If {Xi; i = 1, . . . , N} are independent random variables with characteristic functions MXi , then the characteristic function of �N the random variable Y = i=1 Xi is MY ( ju) = N � MXi ( ju). i=1 If the Xi are independent and identically distributed with common characteristic function MX , then MY ( ju) = MXN ( ju). ⇒ transform of the pmf of the sum of independent random variables is the product of their transforms EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 119 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 120 Example: X Bernoulli with parameter p = pX (1) = 1 − pX (0) MX ( ju) = 1 � e juk k=0 pX (k) = (1 − p) + pe {Xi; i = 1, . . . , n} iid Bernoulli random variables, Yn = ju n MYn ( ju) = [(1 − p) + pe ] with binomial theorem ⇒ ju pYn (k) = �n k=1 Xi, then MX ( ju) = Fν ( f X ) = c R.M. Gray 2011 � 121 � � MX ( ju) = E e juX . EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 122 Paralleling the discrete case, and the Laplace transform L s( fX ) = fX (x)e jux dx. Consider again two independent random variables X and Y with pdfs fX and fW , characteristic functions MX and MW fX (x)e− j2πνx dx � � As in the discrete case, Relates to the continuous-time Fourier transform � � n (1 − p)n−k pk ; k ∈ Zn+1. k For a continous random variable X with pdf fX , define the characteristic function MX of the random variable (or of the pdf) as pYn (k) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 � Same idea works for continuous rvs n � pYn (k)e juk = ((1 − p) + pe ju)n k=0 n � n � � juk n−k k = k (1 − p) p e , k=0 ������������������������������������ MYn ( ju) = Uniqueness of transforms ⇒ MY ( ju) = MX ( ju)MW ( ju). fX (x)e−sx dx Will later see simple and general proof. by MX ( ju) = F−u/2π( fX ) = L− ju( fX ) Hence can apply results from Fourier/Laplace transform theory. E.g., given a well-behaved density fX (x); x ∈ R MX ( ju), can invert transform � ∞ fX (x) = 1 2π −∞ EE278: Introduction to Statistical Signal Processing, winter 2010–2011 MX ( ju)e− jux du. c R.M. Gray 2011 � 123 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 124 Summing Independent Gaussian rvs As in the discrete case, iterating gives result for many independent rvs: X ∼ N(m, σ2) If {Xi; i = 1, . . . , N} are independent random variables with characteristic functions MXi , then the characteristic function of the �N random variable Y = i=1 Xi is MY ( ju) = N � Characteristic function found by completing the square: MXi ( ju). i=1 If the Xi are independent and identically distributed with common characteristic function MX , then MY ( ju) = MXN ( ju). = e jum−u 2 σ2 /2 Thus N(m, σ2) ↔ e jum−u c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 125 MYn ( ju) = [e jum−u 2 σ2 /2 n ] = e ju(nm)−u 2 (nσ2 )/2 . 2 σ2 /2 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 126 c R.M. Gray 2011 � 128 Gaussian random vectors {Xi; i = 1, . . . , n} iid Gaussian random variables with pdfs N(m, σ2) � Yn = nk=1 Xi Then � ∞ 1 2 2 MX ( ju) = E(e ) = e−(x−m) /2σ e jux dx 2 1/2 −∞ (2πσ ) � ∞ 1 2 2 2 2 = e−(x −2mx−2σ jux+m )/2σ dx 2 1/2 −∞ (2πσ ) �� � ∞ 1 2 2 −(x−(m+ juσ2))2/2σ2 = e dx e jum−u σ /2 2 1/2 −∞ (2πσ ) juX A random vector is Gaussian if its density is Gaussian , = characteristic function of N(nm, nσ2) Moral: Use characteristic functions to derive distributions of sums of independent rvs. Component rvs are jointly Gaussian Description is complicated, but many nice properties Multidimensional characteristic functions help derivation Random vector X = (X0, . . . , Xn−1) vector argument u = (u0, . . . , un−1) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 127 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 n-dimensional characteristic function: So exists more generally, only need Λ to be nonnegative definite (instead of strictly positive definite). Define Gaussian rv more generally as a rv having a characteristic function of this form (inverse transform will have singularities) � t � MX( ju) = MX0,...,Xn−1 ( ju0, . . . , jun−1) = E e ju X n−1 � = E exp j uk Xk k=0 Can be shown using multivariable calculus: Gaussian rv with mean vector m and covariance matrix Λ has characteristic function t t MX( ju) = e ju m−u Λu/2 n−1 n−1 � n−1 � � = exp j uk mk − 1/2 uk Λ(k, m)um k=0 k=0 m=0 Same basic form as Gaussian pdf, but depends directly on Λ, not Λ−1 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 129 Further examples of random processes: EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � Gaussian random processes Have seen two ways to define rps: Indirectly in terms of an underlying probability space or directly (Kolmogorov representation) by describing consistent family of joint distributions (via pmfs, pdfs, or cdfs). A random process {Xt ; t ∈ T } is Gaussian if the random vectors (Xt0 , Xt1 , . . . , Xtk−1 ) are Gaussian for all positive integers k and all possible sample times ti ∈ T ; i = 0, 1, . . . , k − 1. Used to define discrete time iid processes and processes which can be constructed from iid processes by coding or filtering. Consistent family? In particular: Gaussian random processes and Markov processes c R.M. Gray 2011 � Works for continuous and discrete time. Yes if all mean vectors and covariance matrices drawn from a common mean function m(t); t ∈ T and covariance function Λ(t, s); t, s ∈ T ; i.e., for any choice of sample times t0, . . . , tk−1 ∈ T the random vector (Xt0 , Xt1 , . . . , Xtk−1 ) is Gaussian with mean (m(t0), m(t1), . . . , m(tk−1)) and the covariance matrix is Λ = {Λ(tl, t j); l, j ∈ Zk }. Introduce more classes of processes and develop some properties for various examples. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 130 131 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 132 Discrete time Markov processes Gaussian random processes in both discrete and continuous time are extremely common in analysis of random systems and have many nice properties. An iid process is memoryless because present independent of past. A Markov process allows dependence on the past in a structured way. Introduce via example. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 133 A binary Markov process 134 Since process iid n p (x ) = Xn n−1 � i=0 {Xn; n = 0, 1, . . .} is a Bernoulli process with n n pX (xi) = pw(x )(1 − p)n−w(x ), where w(xn) = Hamming weight of the binary vector xn. Let {Xn} be input to a device which produces an output binary process {Yn} defined by x=1 p pXn (x) = , 1 − p x = 0 n=0 Y0 Yn = , Xn ⊕ Yn−1 n = 1, 2, . . . p ∈ (0, 1) a fixed parameter where Y0 is a binary equiprobable random variable ( pY0 (0) = pY0 (1) = 0.5), independent of all of the Xn and ⊕ is mod 2 addition Since the pmf pXn (x), abbreviate to pX : pX (x) = p x(1 − p)1−x; x = 0, 1. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 (linear filter using mod 2 arithmetic) c R.M. Gray 2011 � 135 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 136 Alternatively: Use inverse image formula: 1 if Xn � Yn−1 Yn = . 0 if Xn = Yn−1 pY n (yn) = Pr(Y n = yn) = Pr(Y0 = y0, Y1 = y1, Y2 = y2, . . . , Yn−1 = yn−1) This process is called a binary autoregressive process. As will be seen, it is also called the symmetric binary Markov process = Pr(Y0 = y0, X1 ⊕ Y0 = y1, X2 ⊕ Y1 = y2, . . . , Xn−1 ⊕ Yn−2 = yn−1) = Pr(Y0 = y0, X1 ⊕ y0 = y1, X2 ⊕ y1 = y2, . . . , Xn−1 ⊕ yn−2 = yn−1) = Pr(Y0 = y0, X1 = y1 ⊕ y0, X2 = y2 ⊕ y1, . . . , Xn−1 = yn−1 ⊕ yn−2) Unlike Xn, Yn depends strongly on past values. Since p < 1/2, Yn is more likely to equal Yn−1 than not = pY0,X1,X2,X3,...,Xn−1 (y0, y1 ⊕ y0, y2 ⊕ y1, . . . , yn−1 ⊕ yn−2) = pY0 (y0) i=1 If p is small, Yn is likely to have long runs of 0s and 1s. n n c R.M. Gray 2011 � 137 Plug in specific forms of pY0 and pX ⇒ n � pY0,Y1 (y0, y1) = y0 1 = ; y1 = 0, 1. 2 Unlike the iid {Xn} process n pY n (y ) � 1 � y1⊕y0 p (1 − p)1−y1⊕y0 2 y pY (yi) (provided p � 1/2) {Yn} not iid 0 Joint not product of marginals, but can use chain rule with conditional probabilities to write as product of conditional pmfs, given by 1 pYn (y) = ; y = 0, 1; n = 0, 1, 2, . . . 2 c R.M. Gray 2011 � n−1 � i=0 In a similar fashion it can be shown that the marginals for Yn are all the same: EE278: Introduction to Statistical Signal Processing, winter 2010–2011 138 Note: Would not be the same with different initialization, e.g., Y0 = 1 n−1 Marginal pmfs for Yn evaluated by summing out the joints (total probability), e.g., pY1 (y1) = c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 Hence drop subscript and abbreviate pmf to pY 1 � yi⊕yi−1 p (y ) = p (1 − p)1−yi⊕yi−1 . 2 i=1 Yn pX (yi ⊕ yi−1). Used the facts that (1) a ⊕ b = c iff a = b ⊕ c, (2) Y0, X1, X2, . . . , Xn−1 mutually independent, and (3) Xn are iid. n Task: Find joint pmfs for new process: pY n (y ) = Pr(Y = y ) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 n−1 � pYl|Y0,Y1,...,Yl−1 (yl|y0, y1, . . . , yl−1) = 139 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 pY l+1 (yl+1) = pX (yl ⊕ yl−1) pY l (yl) c R.M. Gray 2011 � 140 The binomial counting process Note: Conditional probability of current output Yl given entire past Yi; i = 0, 1, . . . , l − 1 depends only on the most recent past output Yl−1! This property can be summarized nicely by also deriving the conditional pmf pYl|Yl−1 (yl|yl−1) = Next filter binary Bernoulli process using ordinary arithmetic. pYl−1,Yl (yl, yl−1) pYl−1 (yl−1) {Xn} iid binary random process with marginal pmf pX (1) = p = 1 − pX (0). = pyl⊕yl−1 (1 − p)1−yl⊕yl−1 ⇒ pYl|Y0,Y1,...,Yl−1 (yl|y0, y1, . . . , yl−1) = pYl|Yl−1 (yl|yl−1). A discrete time random process with this property is called a Markov process or Markov chain n=0 Y0 = 0 Yn = . �n Xk = Yn−1 + Xn n = 1, 2, . . . k=1 Yn = output of a discrete time time-invariant linear filter with Kronecker delta response hk given by hk = 1 for k ≥ 0 and hk = 0 otherwise. By definition, The binary autoregressive process is a Markov process! Yn = Yn−1 or Yn = Yn−1 + 1; n = 2, 3, . . . c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 141 A discrete time process with this property is called a counting process. Will later see a continuous time counting process which also can only increase by 1 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 pY1,...,Yn (y1, . . . , yn) = pY1 (y1) l=1 = Pr(Yn = yn|Yl = yl; l = 1, . . . , yn−1) = Pr(Xn = yn − yn−1|Yl = yl; l = 1, . . . , n − 1) = Pr(Xn = yn − yn−1|X1 = y1, Xi = yi − yi−1; i = 2, 3, . . . , n − 1) pYl|Y1,...,Yl−1 (yl|y1, . . . , yl−1) Follows since since conditioning event {Yi = yi; i = 1, 2, . . . , n − 1} is the event {X1 = y1, Xi = yi − yi−1; i = 2, 3, . . . , n − 1} and, given this event, the event Yn = yn is the event Xn = yn − yn−1. Already found marginal pmf pYn (k) using transforms to be binomial ⇒ binomial counting process Thus pYn|Yn−1,...,Y1 (yn|yn−1, . . . , y1) = pXn|Xn−1,...,X2,X1 (yn − yn−1|yn−1 − yn−2, . . . , y2 − y1, y1) Find conditional pmfs, which imply joints via chain rule. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 142 pYn|Yn−1,...,Y1 (yn|yn−1, . . . , y1) To completely describe this process need a formula for the joint pmfs n � c R.M. Gray 2011 � c R.M. Gray 2011 � 143 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 144 Xn iid ⇒ Similar derivation ⇒ pYn|Yn−1,...,Y1 (yn|yn−1, . . . , y1) = pX (yn − yn−1) pYn|Yn−1 (yn|yn−1) = Pr(Yn = yn|Yn−1 = yn−1) = Pr(Xn = yn − yn−1|Yn−1 = yn−1). Hence chain rule + definition y0 = 0 ⇒ pY1,...,Yn (y1, . . . , yn) = n � i=1 Conditioning event, depends only on values of Xk for k < n, hence pYn|Yn−1 (yn|yn−1) = pX (yn − yn−1) ⇒ {Yn} is Markov pX (yi − yi−1) Similar derivation works for sum of iid rvs with any pmf pX to show that For binomial counting process, use Bernoulli pX : pY1,...,Yn (y1, . . . , yn) = n � i=1 pYn|Yn−1,...,Y1 (yn|yn−1, . . . , y1) = pYn|Yn−1 (yn|yn−1) or, equivalently, p(yi−yi−1)(1 − p)1−(yi−yi−1), Pr(Yn = yn|Yi = yi ; i = 1, . . . , n − 1) = Pr(Yn = yn|Yn−1 = yn−1), where ⇒ Markov yi − yi−1 = 0 or 1, i = 1, 2, . . . , n; y0 = 0. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 145 Discrete random walk c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 146 binomial theorem ⇒ Slight variation: Let Xn be binary iid with alphabet {1, −1} and Pr(Xn = −1) = p n=0 0 Yn = , �n Xk n = 1, 2, . . . k=1 Also has autoregressive format Yn = Yn−1 + Xn, n = 1, 2, . . . Transform of the iid random variables is pYn (k) MX ( ju) = (1 − p)e ju + pe− ju, EE278: Introduction to Statistical Signal Processing, winter 2010–2011 MYn ( ju) = ((1 − p)e ju + pe− ju)n � n �� � � n n−k k = (1 − p) p e ju(n−2k) k k=0 � � � n (n+k)/2 (n−k)/2 e juk . = (1 − p) p (n − k)/2 ������������������������������������������������������������������������ k=−n, −n+2,...,n−2,n c R.M. Gray 2011 � 147 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 148 ⇒ The discrete time Wiener process � � n pYn (k) = (1 − p)(n+k)/2 p(n−k)/2 , (n − k)/2 k = −n, −n + 2, . . . , n − 2, n. Note that Yn must be even or odd depending on whether n is even or odd. This follows from the nature of the increments. {Xn} iid N(0, , σ2). As with the counting process, define n=0 0 Yn = , � n k=1 Xk n = 1, 2, . . . discrete time Wiener process Handle in essentially the same way, but use cdfs and then pdfs Previously found marginal fYn using transforms to be N(0, nσ2X ) c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 149 To find the joint pdfs use conditional pdfs and chain rule fY1,...,Yn (y1, . . . , yn) = n � l=1 c R.M. Gray 2011 � EE278: Introduction to Statistical Signal Processing, winter 2010–2011 150 Differentiating the conditional cdf to obtain the conditional pdf ⇒ fYl|Y1,...,Yl−1 (yl|y1, . . . , yl−1). fYn|Y1,...,Yn−1 (yn|y1, . . . , yn−1) = ∂ F X (yn − yn−1) = fX (yn − yn−1), ∂yn To find conditional pdf fYn|Y1,...,Yn−1 (yn|y1, . . . , yn−1), first find conditional cdf P(Yn ≤ yn|Yn−i = yn−i; i = 1, 2, . . . , n − 1) . Analogous to the discrete case: pdf chain rule ⇒ P(Yn ≤ yn|Yn−i = yn−i; i = 1, 2, . . . , n − 1) = P(Xn ≤ yn − yn−1|Yn−i = yn−i; i = 1, 2, . . . , n − 1) fY1,...,Yn (y1, . . . , yn) = = P(Xn ≤ yn − yn−1) = F X (yn − yn−1), EE278: Introduction to Statistical Signal Processing, winter 2010–2011 n � i=1 c R.M. Gray 2011 � 151 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 fX (yi − yi−1). c R.M. Gray 2011 � 152 and hence If fX = N(0, σ2) � 2� � � y (y −y )2 n exp − i i−1 exp − 2σ12 � 2σ2 fY n (yn) = √ √ 2πσ2 i=2 2πσ2 n 1 � 2 −n/2 2 2 = (2πσ ) exp − 2 ( (yi − yi−1) + y1) . 2σ fYn|Y1,...,Yn−1 (yn|y1, . . . , yn−1) = fYn|Yn−1 (yn|yn−1). As in discrete alphabet case, a process with this property is called a Markov process i=2 This is a joint Gaussian pdf with mean vector 0 and covariance matrix KX (m, n) = σ2 min(m, n), m, n = 1, 2, . . . A similar argument implies that Combine the discrete alphabet and continuous alphabet definitions into a common definition: a discrete time random process {Yn} is said to be a Markov process if the conditional cdf’s satisfy the relation Pr(Yn ≤ yn|Yn−i = yn−i; i = 1, 2, . . .) = Pr(Yn ≤ yn|Yn−1 = yn−1) for all yn−1, yn−2, . . . fYn|Yn−1 (yn|yn−1) = fX (yn − yn−1) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 153 More specifically, such a {Yn} is frequently called a first-order Markov process because it depends on only the most recent past value. An extended definition to nth-order Markov processes can be made in the obvious fashion. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 155 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 c R.M. Gray 2011 � 154