Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Introduction to Machine Learning CSE 474/574: Introduction to Probabilistic Methods Varun Chandola <chandola@buffalo.edu> Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Outline 1 Introduction to Probability 2 Random Variables 3 Bayes Rule More About Conditional Independence 4 Continuous Random Variables 5 Different Types of Distributions 6 Handling Multivariate Distributions 7 Transformations of Random Variables 8 Information Theory - Introduction Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Frequentist and Bayesian Interpretations Outline 1 Introduction to Probability 2 Random Variables 3 Bayes Rule More About Conditional Independence 4 Continuous Random Variables 5 Different Types of Distributions 6 Handling Multivariate Distributions 7 Transformations of Random Variables 8 Information Theory - Introduction Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Frequentist and Bayesian Interpretations What is Probability? [3, 1] Probability that a coin will land heads is 50%1 What does this mean? 1 Dr. Persi Diaconis showed that a coin is 51% likely to land facing the same way up as it is started. Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Varun Chandola Frequentist and Bayesian Interpretations Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Frequentist and Bayesian Interpretations Frequentist Interpretation Number of times an event will be observed in n trials What if the event can only occur once? My winning the next month’s powerball. Polar ice caps melting by year 2020. Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Frequentist and Bayesian Interpretations Frequentist Interpretation Number of times an event will be observed in n trials What if the event can only occur once? My winning the next month’s powerball. Polar ice caps melting by year 2020. Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Frequentist and Bayesian Interpretations Bayesian Interpretation Uncertainty of the event Use for making decisions Should I put in an offer for a sports car? Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Notation Basic Rules Outline 1 Introduction to Probability 2 Random Variables 3 Bayes Rule More About Conditional Independence 4 Continuous Random Variables 5 Different Types of Distributions 6 Handling Multivariate Distributions 7 Transformations of Random Variables 8 Information Theory - Introduction Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Notation Basic Rules What is a Random Variable (X )? Can take any value from X Discrete Random Variable - X is finite/countably finite Continuous Random Variable - X is infinite P(X = x) or P(x) is the probability of X taking value x an event What is a distribution? Examples 1 2 Coin toss (X = {heads, tails}) Six sided dice (X = {1, 2, 3, 4, 5, 6}) Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Notation Basic Rules Notation, Notation, Notation X - random variable (X if multivariate) x - a specific value taken by the random variable ((x if multivariate)) P(X = x) or P(x) is the probability of the event X = x p(x) is either the probability mass function (discrete) or probability density function (continuous) for the random variable X at x Probability mass (or density) at x Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Notation Basic Rules Basic Rules - Quick Review For two events A and B: P(A ∨ B) = P(A) + P(B) − P(A ∧ B) Joint Probability P(A, B) = P(A ∧ B) = P(A|B)P(B) Also known as the product rule Conditional Probability P(A|B) = P(A,B) P(B) Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Notation Basic Rules Chain Rule of Probability Given D random variables, {X1 , X2 , . . . , XD } P(X1 , X2 , . . . , XD ) = P(X1 )P(X2 |X1 )P(X3 |X1 , X2 ) . . . P(XD |X1 , X2 , . . . , XD ) Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Notation Basic Rules Marginal Distribution Given P(A, B) what is P(A)? Sum P(A, B) over all values for B X X P(A) = P(A, B) = P(A|B = b)P(B = b) b b Sum rule Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Bayes Theorem: Example Classification Using Bayes Rule Independence and Conditional Independence More About Conditional Independence Outline 1 Introduction to Probability 2 Random Variables 3 Bayes Rule More About Conditional Independence 4 Continuous Random Variables 5 Different Types of Distributions 6 Handling Multivariate Distributions 7 Transformations of Random Variables 8 Information Theory - Introduction Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Bayes Theorem: Example Classification Using Bayes Rule Independence and Conditional Independence More About Conditional Independence Bayes Rule or Bayes Theorem Computing P(X = x|Y = y ): Bayes Theorem P(X = x|Y = y ) = = P(X = x, Y = y ) P(Y = y ) P(X = x)P(Y = y |X = x) P 0 0 x 0 P(X = x )P(Y = y |X = x ) Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Bayes Theorem: Example Classification Using Bayes Rule Independence and Conditional Independence More About Conditional Independence Example Medical Diagnosis Random event 1: A test is positive or negative (X ) Random event 2: A person has cancer (Y ) – yes or no What we know: 1 2 3 4 5 Test has accuracy of 80% Number of times the test is positive when the person has cancer P(X = 1|Y = 1) = 0.8 Prior probability of having cancer is 0.4% P(Y = 1) = 0.004 Question? If I test positive, does it mean that I have 80% rate of cancer? Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Bayes Theorem: Example Classification Using Bayes Rule Independence and Conditional Independence More About Conditional Independence Base Rate Fallacy Ignored the prior information What we need is: P(Y = 1|X = 1) =? More information: False positive (alarm) rate for the test P(X = 1|Y = 0) = 0.1 P(Y = 1|X = 1) = P(X = 1|Y = 1)P(Y = 1) P(X = 1|Y = 1)P(Y = 1) + P(X = 1|Y = 0)P(Y = 0) Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Bayes Theorem: Example Classification Using Bayes Rule Independence and Conditional Independence More About Conditional Independence Classification Using Bayes Rules Given input example X, find the true class P(Y = c|X) Y is the random variable denoting the true class Assuming the class-conditional probability is known P(X|Y = c) Applying Bayes Rule P(Y = c)P(X|Y = c) 0 0 c P(Y = c ))P(X|Y = c ) P(Y = c|X) = P Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Bayes Theorem: Example Classification Using Bayes Rule Independence and Conditional Independence More About Conditional Independence Independence One random variable does not depend on another A ⊥ B ⇐⇒ P(A, B) = P(A)P(B) Joint written as a product of marginals Conditional Independence A ⊥ B|C ⇐⇒ P(A, B|C ) = P(A|C )P(B|C ) Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Bayes Theorem: Example Classification Using Bayes Rule Independence and Conditional Independence More About Conditional Independence More About Conditional Independence Alice and Bob live in the same town but far away from each other Alice drives to work and Bob takes the bus Event A - Alice comes late to work Event B - Bob comes late to work Event C - A snow storm has hit the town P(A|C ) - Probability that Alice comes late to work given there is a snowstorm Now if I know that Bob has also come late to work, will it change the probability that Alice comes late to work? What if I do not observe C ? Will B have any impact on probability of A happening? Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Bayes Theorem: Example Classification Using Bayes Rule Independence and Conditional Independence More About Conditional Independence More About Conditional Independence Alice and Bob live in the same town but far away from each other Alice drives to work and Bob takes the bus Event A - Alice comes late to work Event B - Bob comes late to work Event C - A snow storm has hit the town P(A|C ) - Probability that Alice comes late to work given there is a snowstorm Now if I know that Bob has also come late to work, will it change the probability that Alice comes late to work? What if I do not observe C ? Will B have any impact on probability of A happening? Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Bayes Theorem: Example Classification Using Bayes Rule Independence and Conditional Independence More About Conditional Independence More About Conditional Independence Alice and Bob live in the same town but far away from each other Alice drives to work and Bob takes the bus Event A - Alice comes late to work Event B - Bob comes late to work Event C - A snow storm has hit the town P(A|C ) - Probability that Alice comes late to work given there is a snowstorm Now if I know that Bob has also come late to work, will it change the probability that Alice comes late to work? What if I do not observe C ? Will B have any impact on probability of A happening? Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Useful Statistics Expectation Outline 1 Introduction to Probability 2 Random Variables 3 Bayes Rule More About Conditional Independence 4 Continuous Random Variables 5 Different Types of Distributions 6 Handling Multivariate Distributions 7 Transformations of Random Variables 8 Information Theory - Introduction Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Useful Statistics Expectation Continuous Random Variables X is continuous Can take any value How does one define probability? P x P(X = x) = 1 Probability that X lies in an interval [a, b]? P(a < X ≤ b) = P(x ≤ b) − P(x ≤ a) F (q) = P(x ≤ q) is the cumulative distribution function P(a < X ≤ b) = F (b) − F (a) Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Useful Statistics Expectation Continuous Random Variables X is continuous Can take any value How does one define probability? P x P(X = x) = 1 Probability that X lies in an interval [a, b]? P(a < X ≤ b) = P(x ≤ b) − P(x ≤ a) F (q) = P(x ≤ q) is the cumulative distribution function P(a < X ≤ b) = F (b) − F (a) Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Useful Statistics Expectation Probability Density Probability Density Function p(x) = ∂ F (x) ∂x Z P(a < X ≤ b) = b p(x)dx a Can p(x) be greater than 1? Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Useful Statistics Expectation Some Useful Statistics Quantiles F −1 (α): Value of xα such that F (xα ) = α α quantile of F What is F −1 (0.5)? What are quartiles? Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Useful Statistics Expectation Expectation Expected value of a random variable E(X ) What is most likely to happen in terms of X ? For discrete variables X E[X ] , xp(x) x∈X For continuous variables Z E[X ] , xp(x)dx X Mean of X (µ) Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Useful Statistics Expectation Variance Spread of the distribution var [X ] , E((X − µ)2 ) = E(X 2 ) − µ2 Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Discrete Distributions Continuous Distributions Outline 1 Introduction to Probability 2 Random Variables 3 Bayes Rule More About Conditional Independence 4 Continuous Random Variables 5 Different Types of Distributions 6 Handling Multivariate Distributions 7 Transformations of Random Variables 8 Information Theory - Introduction Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Discrete Distributions Continuous Distributions What is a Probability Distribution? Continuous Discrete Binomial,Bernoulli Gaussian (Normal) Degenerate pdf Multinomial, Multinolli Poisson Laplace Empirical Beta Gamma Pareto Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Discrete Distributions Continuous Distributions Binomial Distribution X = Number of heads observed in n coin tosses Parameters: n, θ X ∼ Bin(n, θ) Probability mass function (pmf) n k Bin(k|n, θ) , θ (1 − θ)n−k k Bernoulli Distribution Binomial distribution with n = 1 Only one parameter (θ) Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Discrete Distributions Continuous Distributions Multinomial Distribution Simulates a K sided die Random variable x = (x1 , x2 , . . . , xK ) Parameters: n, θ θ ← <K θj - probability that j th side shows up Y K n x Mu(x|n, θ) , θj j x1 , x2 , . . . , xK j=1 Multinoulli Distribution Multinomial distribution with n = 1 x is a vector of 0s and 1s with only one bit set to 1 Only one parameter (θ) Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Discrete Distributions Continuous Distributions Gaussian (Normal) Distribution N (x|µ, σ 2 ) , √ 1 1 2πσ 2 e − 2σ2 (x−µ) 2 0.4 0.3 0.2 Parameters: 1 2 0.1 µ = E[X ] σ 2 = var [X ] = E[(X − µ)2 ] 0 −4 −2 0 2 4 6 −4 −2 0 2 4 6 X ∼ N (µ, σ 2 ) ⇔ p(X = x) = N (µ, σ 2 ) X ∼ N (0, 1) ⇐ X is a standard normal random variable Cumulative distribution function: Z x 2 Φ(x; µ, σ ) , N (z|µ, σ 2 )dz 1 0.8 0.6 0.4 0.2 −∞ Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction References E. Jaynes and G. Bretthorst. Probability Theory: The Logic of Science. Cambridge University Press Cambridge:, 2003. K. B. Petersen and M. S. Pedersen. The matrix cookbook, nov 2012. Version 20121115. L. Wasserman. All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer, Oct. 2004. Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Joint Probability Distributions Covariance and Correlation Multivariate Gaussian Distribution Outline 1 Introduction to Probability 2 Random Variables 3 Bayes Rule More About Conditional Independence 4 Continuous Random Variables 5 Different Types of Distributions 6 Handling Multivariate Distributions 7 Transformations of Random Variables 8 Information Theory - Introduction Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Joint Probability Distributions Covariance and Correlation Multivariate Gaussian Distribution Joint Probability Distributions Multiple related random variables p(x1 , x2 , . . . , xD ) for D > 1 variables (X1 , X2 , . . . , XD ) Discrete random variables? Continuous random variables? What do we measure? Covariance How does X vary with respect to Y For linear relationship: cov [X , Y ] , E[(X − E[X ])(Y − E[Y ])] = E[XY ] − E[X ]E[Y ] Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Joint Probability Distributions Covariance and Correlation Multivariate Gaussian Distribution Covariance and Correlation x is a d-dimensional random vector cov [X] , E[(X − E[X])(X − E[X])> ] var [X1 ] cov [X2 , X1 ] = .. . cov [X1 , X2 ] var [X2 ] .. . ··· ··· .. . cov [Xd , X1 ] cov [Xd , X2 ] ··· cov [X1 , Xd ] cov [X2 , Xd ] .. . var [Xd ] Covariances can be between 0 and ∞ Normalized covariance ⇒ Correlation Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Joint Probability Distributions Covariance and Correlation Multivariate Gaussian Distribution Correlation Pearson Correlation Coefficient corr [X , Y ] , p cov [X , Y ] var [X ]var [Y ] What is corr [X , X ]? −1 ≤ corr [X , Y ] ≤ 1 When is corr [X , Y ] = 1? Y = aX + b Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Joint Probability Distributions Covariance and Correlation Multivariate Gaussian Distribution Correlation Pearson Correlation Coefficient corr [X , Y ] , p cov [X , Y ] var [X ]var [Y ] What is corr [X , X ]? −1 ≤ corr [X , Y ] ≤ 1 When is corr [X , Y ] = 1? Y = aX + b Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Joint Probability Distributions Covariance and Correlation Multivariate Gaussian Distribution Multivariate Gaussian Distribution Most widely used joint probability distribution 1 1 > −1 (x − µ) Σ (x − µ) N (X|µ, Σ) , exp − 2 (2π)D/2 |Σ|D/2 Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Linear Transformations General Transformations Approximate Methods Outline 1 Introduction to Probability 2 Random Variables 3 Bayes Rule More About Conditional Independence 4 Continuous Random Variables 5 Different Types of Distributions 6 Handling Multivariate Distributions 7 Transformations of Random Variables 8 Information Theory - Introduction Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Linear Transformations General Transformations Approximate Methods Linear Transformations of Random Variables What is the distribution of f (X) (X ∼ p())? Linear transformation: Y = a> X + b Y = AX + b E[Y ]? E[Y]? var [Y ]? cov [Y]? The Matrix Cookbook [2] http://orion.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf Available on Piazza Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Linear Transformations General Transformations Approximate Methods General Transformations f () is not linear Example: X - discrete Y = f (X ) = 1 0 if X is even if X is odd What is pmf for Y ? pY (y ) = X pX (x) x:f (x)=y Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Linear Transformations General Transformations Approximate Methods General Transformations for Continuous Variables For continuous variables, work with cdf FY (y ) , P(Y ≤ y ) = P(f (X ) ≤ y ) = P(X ≤ f −1 (y )) = FX (f −1 (y )) For pdf pY (y ) , d d dx d dx FY (y ) = FX (f −1 (y )) = FX (x) = pX (x) dy dy dy dx dy x = f −1 (y ) Example Let X be Uniform(−1, 1) Let Y = X 2 1 pY (y ) = 12 y − 2 Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Linear Transformations General Transformations Approximate Methods Monte Carlo Approximation Generate N samples from distribution for X For each sample, xi , i ∈ [1, N], compute f (xi ) Use empirical distribution as approximate true distribution Approximate Expectation Z E[f (X )] = f (x)p(x)dx ≈ N 1 X f (xi ) N i=1 Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction KL Divergence Mutual Information Outline 1 Introduction to Probability 2 Random Variables 3 Bayes Rule More About Conditional Independence 4 Continuous Random Variables 5 Different Types of Distributions 6 Handling Multivariate Distributions 7 Transformations of Random Variables 8 Information Theory - Introduction Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction KL Divergence Mutual Information Introduction to Information Theory Quantifying uncertainty of a random variable 1 0.8 Entropy 0.6 H(X ) or H(p) H(X ) , − K X 0.4 0.2 p(X = k) log2 p(X = k) 0 0 0.2 0.4 k=1 Variable with maximum entropy? Lowest entropy? Varun Chandola Introduction to Machine Learning 0.6 0.8 1 Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction KL Divergence Mutual Information Comparing Two Distributions Kullback-Leibler Divergence (or KL Divergence or relative entropy) KL(p||q) , K X p(k) log k=1 = X pk qk p(k) log p(k) − k = X p(k) log q(k) k −H(p) + H(p, q) H(p, q) is the cross-entropy Is KL-divergence symmetric? Important fact: H(p, q) ≥ H(p) Varun Chandola Introduction to Machine Learning Introduction to Probability Random Variables Bayes Rule Continuous Random Variables Different Types of Distributions References Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction KL Divergence Mutual Information Mutual Information What does learning about one variable X tell us about another, Y ? Correlation? Mutual Information I(X ; Y ) = XX x p(x, y ) log y p(x, y ) p(x)p(y ) I(X ; Y ) = I(Y ; X ) I(X ; Y ) ≥ 0, equality iff X ⊥ Y Varun Chandola Introduction to Machine Learning