STATISTICS Inductive learning method Methods: Bayesian Inference and other probabilistic methods (use probabilities; input and output variables can be numeric or non-numeric) Regression (using training data, construct a model of the data; estimate output for a “new” sample) o Predictive regression (input and output are all numeric) Use ANOVA to estimate the regression error o Logistic regression (inputs are numeric or non-numeric, the output is a “binary” categorical value) o Log-linear model (input and output are non-numeric) Linear discriminant analysis (inputs are numeric, output is non-numeric). Bayesian Inference Used to find conditional probability, i.e. probabilistic “influence” of two variables Y and X, for example probability that variable Y has a certain value given that we know the value of variable X. This is written as P(Y if X), or P(Y/X) for short. General formula: P(Y/X) = P(Y and X) / P(X) = P(X/Y)*P(Y)/P(X) Used to predict (new) sample classification, i.e. to find: P(the new sample belongs to class C / the sample’s values are known). We assume that the sample is from the input samples for which we know Bayesian probabilities. Assume: we have a sample set S = {S1, S2, …, Sm) and each Si is a sample containing an ndimensional vector Xi and classification Ck (i.e. each Si is a row in the input file and each Xi is a vector of xij values for that column, and each Si is assigned output classification Ck.) i = 1, ..,m, j = 1, …, n, k=1, …, c, where c is the max number of output classes. Assume that P(X) = const for all classes (i.e. the data will not change classification in time. Also, we know P(X) for the training set.) Given a “new” sample Sz (i.e. a sample for which we know Xz, i.e. the xzj values, but we don’t know the output classification Ck) we can predict the output classification. We can predict for each Ck, k = 1, …, c: P(Xz belongs to class Ck / we know the xzj values of Xz). In other words, we predict: P(Xz/C1) = PRODUCT[j=1,n](P(xzj / C1) P(Xz/C2) = PRODUCT[j=1,n](P(xzj / C2) ….. P(Xz/Cc) = PRODUCT[j=1,n](P(xzj / Cc) What good is this information? It is not certain – it is all probabilistic. The caveat: since it is all probabilistic, events with the highest probabilities will happen the most – so our model will be very “accurate” but will miss low probability events. How can we analyze events with low probabilities (e.g. fraud)? Example p.97: Sample Attribute A1 1 1 2 0 3 2 4 1 5 0 6 2 7 1 Attribute A2 2 0 1 2 1 2 0 Attribute A3 1 1 2 1 2 2 1 Output class C 1 1 2 2 1 2 1 Regression Assume that the input table has m rows and n columns, and that the last column is the “output classification” column. ------------The long story useful for coding using matrix packages--------------Assume that the input table is in form: x11, x12, …, x1n, y1 x21, x22, …., x2n, y2 … xm1, xm2, …, xmn, ym Assume that X1, .., Xm are input row vectors, where each vector is a “row” of the input table. X1 = <x11, x12, …, x1n> X2 = <x21, x22, …., x2n> … Xm = <xm1, xm2, …, xmn> The input matrix X is X = Transpose<X1 X2 … Xm>. FYI: Assume that X1’, .., Xn’ are input column vectors, where each vector is a “column” of the input table. X1’ = Transpose<x11, x21, …, xm1> … Xn’ = Transpose<x1n, x2n, …, xnm>. Vector Y is the output vector, i.e. a column vector, and contains just the output column. Y = Transpose< y1, y2, …, ym> 1. Predictive Regression Input and output are all numeric. Today, we use mostly linear regression, i.e. the output is a straight line going through the input points: Y = X’ * B where Y, X and B are matrices Matrix B is a vector: B = Transpose<a b1 b2 b3 … bn>. Matrix X’ is obtained when we insert column with all 1’s before the first column of matrix X. When we list all rows and columns out, we get a set of m linear equations for j=1, …,m: y1 = a + b1*x11 + b2*x12 + … + bn*x1n + error1 … yj = a + b1*xj1 + b2*xj2 + … + bn*xjn + errorj … ym = a + b1*xm1 + b2*xm2 + … + bn*xmn + errorm --------- The shortcut notation, useful for understanding – watch the change of ij’s!--------Now X1, .., Xn are the columns, and xij is the value of row j in column i: Y = a + b1*X1 + b2*X2 + … + bn*Xn + Error yj = a + b1*x1j + b2*x2j + … + bn*xnj + errorj How to use linear regression: Linear regression is a way to build a *model* of input data. In the case of linear regression, the model has the form Y = X’*B. Therefore, the problem can be stated: given Y and X of the training data set of size mxn, find B; therefore you have the model of input data and can predict value of y for a new sample x, i.e. sample xz, for z > m. Solution: B = Inverse((Transpose(X’)*X’))*(Transpose(X’)*Y)) In case of one-variable regression, i.e. Y = a + bX, things are simpler. The simplest regression looks only at one input column and one output column, i.e. we assume Y = a + b*X, i.e. yj = a + b*xj j =1, …, m. In other words: we have a set of input data points X, and we are approximating them with a regression line Y. Calculate the parameters a and b by minimizing the squared error, i.e. the distance between the data points and the regression line. After some math, we get: a = mean(X) – b*mean(Y) b = Correlation(X,Y)/Correlation(X,X) where Correlation formulas are given in the PCA handout Correlation(X,Y) = (1/m) * SUM[i=1,m]((X-mean(x)) *(Y-mean(y))) Example: multivariate (multiple) regression 1. We will first use multiple regression formulas but apply them to only one variable, in order to have it look more manageable. The same principles would apply if we had more than one feature. Suppose we have 5x1 input data X and one column for output classification Y. Sample 1 2 3 4 5 X 6 7 8 9 7 Y 1 2 3 3 4 Row vectors are: X1 = <6> X2 = <7> Etc. Input matrix X is: X= 6 7 8 9 7 Now append a column of 1’s to the input vector X: X’ = 1 6 1 7 1 8 1 9 1 7 Transpose(X’) = 1 6 1 7 1 8 Matrix Y is: Y= 1 2 3 3 4 Transpose(X’)*X = 1 37 Transpose(X’)*Y = 13 99 Inverse(Transpose(X’)*X) = 10.7 -1.4 37 279 -1.4 0.2 1 9 1 7 B= -1.4 .54 Therefore, y = -1.4 + .54x 2. Let’s apply linear regression to two features: Sample 1 2 3 4 5 F1 6 7 8 9 7 F2 60 70 80 90 70 Y 1 2 3 3 4 Row vectors are: X1 = <6 60> X2 = <7 70> Etc. Input matrix X is: X= 6 60 7 70 8 80 9 90 7 70 Now append a column of 1’s to the input vector X: X’ = 1 6 60 1 7 70 1 8 80 1 9 90 1 7 70 Matrix Y is the same as in the first example, and matrix B has the same “shape” but the values will be different. Continue calculating B using matrix operations. At the end, the result should look like: Y = a + b1*x11 + b2*x12+ … + bn*x1n Our book (and other literature) writes this in shortcut as: Y = a + b1*F1 + b2*F2 + … + bn*Fn Or: Yk = a + b1*F1k + b2*F2k + … + bn*Fnk In our case, n=2, so our Y will be: Y = a + b1*x11 + b2*x12 Or, in shortcut: Y = a + b1*F1 + b2*F2 Or: Yk = a + b1*F1k + b2*F2k Example p.100: one variable regression: Table 5.2 X mean b= a= Y 1 8 11 4 3 5.4 0.920245 1.030675 3 9 11 5 2 6 X-mx -4.4 2.6 5.6 -1.4 -2.4 Y-my * -3 3 5 -1 -4 13.2 7.8 28 1.4 9.6 (X-mx)^2 19.36 6.76 31.36 1.96 5.76 12 y = 0.9202x + 1.0307 10 8 6 4 2 0 0 2 4 6 8 10 12 Issues What is the most important issue about regression? To find the columns which should be included in the regression line. This is different than feature reduction which we used in Ch.3 (e.g. entropy, PCA, mean-variance, and other methods.) Feature reduction from Ch.3. was for a generalized case, to reduce the size of data set and get the data ready for mining. Picking features to be used for regression is an additional step – once the data set is as small as it can be, pick the features used for this particular kind of mining. How can we go about picking features for regression? In the good old ways that work for so many other approaches: 1. sequential search: start with each feature being a separate set or with a group of selected features, keep on adding features until some criteria is satisfied (and then check, e.g. with ANOVA) 2. combinatorial approach: perform search across all possible combinations of features and see which combo gives the best regression model (and then check, e.g. with ANOVA) Analysis of Variance (ANOVA) Assume that we constructed the regression model, so now we have set of real Y values and set of estimated Y values, called Y*. Our goal is to find out if we included all the features that we should have included. Assume that the original number of features is n, and we included m of them into the regression equation. The pseudocode: Calculate an estimate of variance by using the square error: Var* = SQRT( SUM[i=1,m]((yi-y i *)^2)/(m-n) Then eliminate one feature Calculate Var* again. If the new variance is much greater than the old one, this feature must be included in the regression model. If the new variance is about equal to the old variance, this feature is not important for the regression model and can be omitted. 2. Logistic Regression The right side of regression equation is the same; the left side is related to a binary categorical variable Y. Y = a + b1*X1 + b2*X2 + … + bn*Xn yj = log(pj/qj) = a + b1*x1j + b2*x2j + … + bn*xnj where pj is the probability that yj=0, and qj = 1- pj. log(pj/qj) is written as logit(p). Example p.107: Given logit(p) = 1.5-0.6*x1 + 0.4*x2 – 0.3*x3 and new sample {1,0,1}, estimate Prob(output of this sample = 1). logit(p) = log(p/(1-p)) = 1.5-0.6+0-0.3 p = e^0.6 / (1 + e^ 0.6) = 0.35 3. Log Linear Models The right side of the expression is the same; the left side is related to a Poisson variable with expected value μ. All X and Y are non-numeric. Y = a + b1*X1 + b2*X2 + … + bn*Xn yj = log(μ j) = a + b1*x1j + b2*x2j + … + bn*xnj Example on p.108: how to use log linear modeling when there is no output variable and we want to analyze dependency between two columns. We want to investigate if a person’s support of abortion is related to their gender, and we interview 1100 people of both genders and collect the following data: Sample ID 1 2 3 … 1100 Gender M F F … M Support for abortion Y N Y … N How do we analyze non-numeric values? Similar to Hamming distance: we count occurrences. Then we can tabulate that in a contingency table (i.e. the “Sij” table). (Then we can analyze it using Chi-square.) 1. Calculate contingency table: Support Gender Female Male Total Yes 309 319 628 No 191 281 472 Total 500 600 1100 2. Convert the table into expected values table: Eij = (row i total * column j total)/total Gender Female Male Total Expected Support Yes No 500*628/1100 500*472/1100 600*628/1100 600*472/1100 628 472 Total 500 600 1100 3. Calculate Chi-square for a contingency table of m rows and n columns: Chi_square = SUM[i=1,m]SUM[j=1,n] ((Xji-Eji)^2)/Eji Pick the confidence interval for the Chi-square (usually a=0.05 or 0.1). The degrees of freedom for Chi-square is equal to (m-1)*(n-1). If Chi_square ≥ T(a), then the columns are dependent. Find T(a) using the chi-square tables. http://www.itl.nist.gov/div898/handbook/eda/section3/eda3674.htm In this case, Chi_square = 8.3 and T(a) = 3.84, so there is dependency between gender and abortion opinion.