Machine Learning Week 3 Lecture 1 Programming Competition http://users-cs.au.dk/jasn/algocomp/programming_contest.pdf Hand In Student Comparison Efter 5 minutter: Train: MCC: 0.9481 0.9458 0.8737 0.8686 0.9017 0.8407 0.9316 0.9088 0.8493 0.8644 [[5693 0 23 19 13 44 48 6 68 9] [ 1 6508 33 33 6 38 7 19 87 10] [ 47 64 5192 100 123 23 103 108 163 35] [ 27 32 142 5355 5 252 39 70 141 68] [ 12 31 45 6 5378 8 61 13 43 245] [ 91 46 44 220 70 4522 109 29 210 80] [ 42 18 55 7 54 77 5605 3 54 3] [ 42 63 108 29 76 10 4 5718 18 197] [ 29 139 84 154 25 146 48 31 5095 100] [ 38 31 37 91 211 43 4 192 63 5239]] Test: MCC: 0.9525 0.9643 0.8855 0.8848 0.9060 0.8585 0.9308 0.9068 0.8543 0.8777 [[ 957 0 2 2 0 5 9 1 4 0] [ 0 1102 2 4 0 2 4 1 20 0] [ 11 4 903 15 15 1 14 18 42 9] [ 4 0 21 909 0 32 3 11 20 10] [ 1 2 6 1 913 1 11 1 8 38] [ 11 2 5 41 11 750 16 11 38 7] [ 13 3 4 3 15 11 906 1 2 0] [ 3 12 25 7 9 0 0 936 3 33] [ 9 8 9 25 8 20 12 15 859 9] [ 11 8 5 11 42 10 0 20 8 894]] Today • • • • Recap Learning with infinite hypothesis sets Bias Variance The end of learning theory…. Recap The Test Set – Take a look at Eout Fixed hypothesis h, N independent data points, and any ε>0 • • • • • Split your data into two parts D-train,D-test Train on D-train and select hypothesis h Test h on D-test, error Apply Hoeffding bound to Cannot be used for anything else Generalization Bound Goal: Extend to infinite hypothesis spaces Dichotomies Dichotomy bit string of length N Fixed set of N points X = (x1,..,xN) Hypothesis set Each gives a dichotomy Growth, Shattering, Breaks If then we say that shatters (x1,…,xN) If no data set of size K can be shattered by then K is a break point for Revisit Examples • Positive Rays • Intervals a1 • Convex sets Impossible Dichotomy a a2 2D Hyperplanes 3 is not a break point. Shatter Some Point Set of size 3 2D Hyperplanes 4 is break point Impossible Dichotomies Triangle With Point Inside 3 Points on a line Else Very Important Theorem If has a break point then the growth function is polynomial Polynomial of max degree k-1 Growth Function Generalization Bound VC Theorem Proof: Book Appendix (Intuition in book) If growth function is polynomial we can learn with infinite hypothesis sets!!! VC Dimension The VC Dimension of a hypothesis set is maximal number of points it can shatter, e.g. max N such that It is denoted dvc The smallest break point minus one Revisit Examples Unbreakable • Positive Rays a • Intervals a1 VC Dim. 1 a2 2 • Convex sets ∞ • 2D Hyperplanes 3 VC Dimension For Linear Classification VC Dimension for d dimensional inputs (d+1 parameters with the bias) is d+1 Proof Coming Up Show dvc ≥ d+1 Show dvc < d+2 dvc ≥ d+1 Find a set of d+1 points we can shatter Idea: Make points (Vectors) independent, by using one dimension for each point X is a Matrix (d+1)x(d+1) Rows = Points Invertible dvc ≥ d+1 Consider Dichotomy Find θ such that Solve Solve dvc < d+2 For any d+2 points we must prove there is a dichotomy hyperplanes can not capture, Consider d+1 dimensional points (vectors) They must be linearly dependent (more vectors than dimensions) Ignore zero terms dvc < d+2 Dichotomy: Claim. This dichotomy is impossible to capture with hyperplanes dvc = d+1 For Hyperplanes dvc= number of free parameters Vapnik-Chervonenskis (VC) Generalization Bound For any tolerance δ >0 With Probability at least 1-δ Quote Book “The VC Generalization bound is the most important mathematical result in the theory of learning” VC Generaliation Bound • Independent of learning algorithm, target function, P(x), and out of sample error – General Indeed • We need to bound Growth function for all of our hypothesis spaces. We showed #free parameters for hyperplanes. Exercise 2.5 with N=100, Probability that a to be within 0.1 of ? Happens with probability 1-δ < 0 e.g. Ridiculous Cost of Generality • Growth Function was really worst case • Independent of P(x), target, out of sample error Sample Complexity Fix tolerance δ , (success probability ≥ 1-δ) Consider generalization error at most ε. How big N? Upper bound it more by using VC dim polynomial Sampling Complexity Plug in ε,δ = 0.1 Lets Plot function sc_vcdim() dat = 5000:2500:100000; hold off; plot(dat,dat,'r-','linewidth',3); hold on for i=3:9 tmp = 800*log(40*((2.*dat).^i+1)); plot(dat,tmp,'b-','linewidth',3); end Sampling Complexity Book Statement. In Practice VC Interpretation • We can learn with infinite hypothesis sets. • VC Dimension captures Effective number of parameters/ degrees of Freedom In Sample Error + Model Complexity As a figure In Sample Error + Model Complexity Out of sample error Model Complexity In Sample Error VC Dimension Balance These Model Selection t Models m1,…,mt Which is better? Ein(m1) + Ω(m1) Ein(m2) + Ω(m2) . . . Ein(mt) + Ω(mt) Pick the minimum one Problem, Slack in Ω Vapnik-Chervonenkis (VC) Theorem For any delta > 0 With probability 1-delta Test Set Estimate (M=1) a lot tighter Learning Summary • • • • There is theory for regression as well. We will not go there Move on to bias variance Last learning theory in this course. Bias Variance Decomposition • Consider Least Squares Error Measure again and see if we can understand out of sample Error. • For simplicity, assume our target function is noiseless, e.g. it is an exact function of the input. Experiments • Take two functions ax+b, cx2+ dx+e • Target function which is ¼ < x < 3/4 • Repeatedly pick 3 random data points (x,target(x)) and fit our models • Plot it and see Bias Variance • Out of sample error we get depends on hypothesis. • Hypothesis is result of learning algorithm which is a function of the training data • Training data effects out of sample error • Think of data as a random variable and analyze what happens. • What happens if we repeatedly sample data and run our learning algorithm. Notation = Hypothesis learned on D Bias Variance Variance Bias Bias Variance Learning Curves Plot of In sample and out of sample error as a function of input size