Calculus I for Machine Learning Some Applications of Concepts of Sequence and Series Mohammed Nasser Professor, Dept. of Statistics, RU,Bangladesh Email: mnasser.ru@gmail.com 1 P.C. Mahalanobis(1893-1972), the pioneer of statistics in ASIA “A good mathematician may not be a good statistician but a good statistician must be good mathematician” 2 Andrey Nikolaevich Kolmogorov (Russian) (25 April 1903 – 20 October 1987) In 1933, Kolmogorov published the book, Foundations of the Theory of Probability, laying the modern axiomatic 3 foundations of probability theory Statistics+Machine Learning Vladimir Vapnik Jerome H. Friedman 4 Learning and Inference The inductive inference process: Observe a phenomenon Construct a model of the phenomenon Make predictions →This is more or less the definition of natural sciences ! →The goal of Machine Learning is to automate this process →The goal of Learning Theory is to formalize it. 5 What is Learning? • ‘The action of receiving instruction or acquiring knowledge’ •‘A process which leads to the modification of behaviour or the acquisition of new abilities or responses, and which is additional to natural development by growth or maturation’ 6 Machine Learning • Negnevitsky: ‘In general, machine learning involves adaptive mechanisms that enable computers to learn from experience, learn by example and learn by analogy’ (2005:165) •Callan: ‘A machine or software tool would not be viewed as intelligent if it could not adapt to changes in its environment’ (2003:225) •Luger: ‘Intelligent agents must be able to change through the course of their interactions with the world’ (2002:351) 7 The Sub-Fields of ML Classification • Supervised Learning Regression Clustering Unsupervised Learning Density estimation Reinforcement Learning 8 Classical Problem What is the wt of the elephant? What is the wt/distance of sun? 9 Classical Problem What is the wt/size of baby in the womb? What is the wt of a DNA molecule? 10 Solution of the Classical Problem Let us suppose somehow we have x1,x2,- - -xn measurements One million dollar question: How can we choose the optimum one among infinite possible alternatives to combine these n obs. to estimate the target,μ What is the optimum n? 11 We need the concepts: ith observations Probability distributions - Target that we want to estimate Probability measures, X i i , ~ F (x / ) 12 Our Targets We want to chose T s.t.T(Xi,….,Xn) is always very near to μ How do we quantify the problem? Let us elaborate this issue through examples. 13 Inference with a Single Observation Population ? Sampling Parameter: Inference Observation Xi • Each observation Xi in a random sample is a representative of unobserved variables in population • Each observation is an estimator of μ but its variance is as much as the poppulation variance 14 Normal Distribution • In this problem normal distribution is the most popular model for our overall population • Can calculate the probability of getting observations greater than or less than any value • Usually we don’t have a single observation, but instead the mean of a set of observations 15 Inference with Sample Mean Population ? Sampling Sample Parameter: Inference Estimation Statistic: x • Sample mean is our estimate of population mean • How much would the sample mean change if we took a different sample? • Key to this question: Sampling Distribution of x 16 Sampling Distribution of Sample Mean • Distribution of values taken by statistic in all possible samples of size n from the same population • Model assumption: our observations xi are sampled from a population with mean and variance 2 Population Unknown Parameter: Sample 1 of size n Sample 2 of size n Sample 3 of size n Sample 4 of size n Sample 5 of size n Sample 6 of size n Sample 7 of size n Sample 8 of size n . . . x x x x x x x x Distribution of these values? 17 Points to Be Remembered If population is finite If population is countably infinite If population is uncountably infinite No of sample means are finite No of sample means are countably infinite No of sample means are uncountably infinite 18 Meaning of Sampling Distribution Replications B=10000 19 • Comparing the sampling distribution of the sample mean when n = 1 vs. n = 10 20 Examination on a Real Data Set We also consider a real set of health data of 1491 Japanese adult male students from various districts of Japan as population. Four head measurements: head length, head breadth, head height and headcircumference and two physical measurements: stature and weight Data were taken by one observer, Funmio Ohtsuki (Hossain et al. 2005) using the technique of Martin and Saller (1957). 21 Histogram and Density of Head Length (Truncated at the Left) 22 Basic Information about Two Populations Type Mean Variance b1 b2 size original 178.99 37.13 .08 2.98 1491 Truncated 181.85 19.63 .80 3.45 1063 23 Sampling Distributions n (X n ) Xn , n=10. 20, 100 & 500 , n=10. 20, 100 & 500 Replications=10000 24 Boxplots of Means for Original Population Xn n (X n ) Replications=10000 25 Descriptive Statistics of Sampling Distribution of Means for Original Population biassim [1,] 0.0221 varsim varasim 3.5084 35.0836 [2,] -0.0230 1.8560 37.1210 [3,] 0.0022 0.3634 36.3167 [4,] 0.0041 0.0715 35.7484 37.13 26 Density of Means for Original Population 27 Histograms of Means for Truncated Population 28 Boxplots of Means for Truncated Population 29 Descriptive Statistics of Sampling Distribution of Means for Truncated Population [1,] biassim -0.0105 [2,] -0.0002 [3,] [4,] -0.0014 varsim varasim 2.0025 20.0249 0.9810 19.62088 0.1958 19.5790 -0.0029 0.0395 19.7419 19.63 30 Chi-square with Two D.F. 31 Boxplots of Means for 2 2 32 Histogram of Means for 2 2 33 Central Limit Theorem • If the sample size is large enough, then the sample mean x has an approximately Normal distribution • This is true no matter what the shape of the distribution of the original data!!!! 34 Histogram of 100000 Obs from Standard Cauchy 35 Xn N=500 36 n (X n ) N=500 37 Central Limit Theorem This is a special case of convergence in distribution (x ) 1 2 e 1 2 2 t2 dt FX (x ) Pr ( n (X n ) x ) n Subject to existence of mean and variance x FX (x ) (x ) x as n n Research is going on to relax i.i.d. condition 38 How many sequences in CLT? Basic random functional sequence, X n (w) Derived random functional sequence, Real sequence, 1/ X n (w) n to compare convergence of X n (w) to 0. Another real nonnegative functional sequence, FX ( x ) n 39 Significance of CLT From mathematics we know, we could approximate an by a as accurate as we wish when an → a Sampling distribution of means can be approximated by normal distribution when CLT holds and sample size is fairly large. It justifies to build confidence interval for μ using sample mean and normal table in non-normal cases 40 More Topics Worth-studying To have error bounds like sup FX n (x ) F (x ) g(n ) n n To characterize extreme fluctuations using sequences like logn, loglogn etc Law of the iterated Logarithm (Hartman and Wintner,1941) 41 Berry Essen Theorem(1941,1945) Check uniformity of convergence:Uniform convergence is better than simple pointwise convergence.Polya theorem guarantees that since normal cdf is everywhere continuous, covergence in CLT is uniform. Why do we use x to estimate μ? P1. E(X )= μ Meaning of “E”?? P2. V(X )=E[( X -μ)2 ]=V(X)/n P.3 What is its significance?? X converges to μ in probability a n ( ) Lt an ( ) Lt Pr( X n ) 1 0 n Subject to n t [1 F (t ) F ( t )] 0 as t 42 1 Why do we use x to estimate μ? Pr( : Lt X n () 0) 1 n Subject to E( X1 ) 2 Condition 2 implies condition1 P.4 X converges to μ almost surely. 43 Difference between Two Limits Lt an ( ) Lt P r( X n ) 1 0 n n Probability is calculated first, P r( : Lt n then limit is taken. X n () 0) 1 Limit is calculated first, then probability is calculated. 44 Why do we use x? P.5 Let X ~N(μ.σ2)↔ε~N(0, σ2). Then X~ N(μ.σ2/n) So we can make statements like Pr[a(Xn)<μ<b(Xn)]=1-α≈1 P.6 Centre Limit Theorem justifies using statements like Pr[a(Xn)<μ<b(Xn)]=1-α≈1 when X is not normal p.7 V(Xn)<=V(Tn) whnever E(Tn)=μ and X ~N(μ.σ2)↔ε~N(0, σ2). 45 Meaning of Epectation of g(X), E(g(X)) If X is discrete, E(g(X))=∑g(xi)p(xi) if E(|g(X)|)= ∑|g(xi)|p(xi) <∞ absolute convergence implies conditional convergence, but the converse is not true. Rearrangement of the terms does not change the limit If X is absolutely continuous, E(g(X))=∫g(x)f(x)dx if E(|g(X)|)= ∫|g(x)|f(x) dx<∞ In Riemann sense 46 Binomial Distribution Mathematically very simple model Effective model for several practical situations Computationally It is very troublesome > choose(100,50) [1] 1.008913e+29 > choose(10000,5000) [1] Inf > Binomial Table of Thousand pages is not quite sufficient to cover all cases. 47 Stirling’s Approximation (1730) The formula as typically used in applications is Very hard to prove The next term in the O(log(n)) is 1⁄2ln(2πn); a more precise variant of the formula is therefore often written 48 Journey from Binomial to Normal Abraham de Moivre 1667 - 1754 Johann Carl Friedrich Gauss 1777 - 1855 To know this heroic journey read the attached file and - - - 49 Computional Advantages of Normal Distribution One page table is enough for almost all applications Using wonderful properties of power-series of infinite radius of convergence we could approximate cdf of standard normal as much as we want. x 1 1 12 x 2 (x ) e dx 2 0 2 3 1 1 x (x x 5 / 5.2!22 x 7 / 7.3!23 3.2 2 2 ) 50 51 52 53 Meaning of Measure ( (A An ) n 1 n 1 n ) Whenever An A Does rearrangement beget any problem? 54 Analytic Concepts versus Probability Concepts Bounded (Big “Oh”) Little “oh” or convergence in measure Pointwise or convergence a.e. Stochastically bounded Convergence in probability 1. Convergence in law/ distribution/weak convergence 2. A.s. convergence 3. rth mean convergence 55 Some Definitions 56 Some Definitions 57 Example 58 Definitions of Convergence in Law and Convergence in probability 59 Definitions of Almost Sure Convergence and Convergence in Lp norm 60 61 62 Relation between Various Modes of Cnvergence 63 Equivalent Definitions of Convergence in Distribution 64 Continuity Mapping Theorem and Slutsky Theorem 65 Classical Delta Theorem Scheffe Theorem 66 Thanks 67