An Introduction to Logic Regression DC Data Science Meetup October 25, 2011 John Dennison dennison.john@gmail.com @johnsarealtwit SSN: 249-543-0833 BMI: 20.9 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyyA8wePstPC69PeuHFtOwyTecByonsHFAjHbVnZ+h0dp omvLZxUtbknNj3+c7MPYKqKBOx9gUKV/diR/mIDqsb405MlrI1kmNR9zbFGYAAwIH/Gxt0Lv5ffw aqsz7cECHBbMojQGEz3IH3twEvDfF6cu5p00QfP0MSmEi/eB+W+h30NGdqLJCziLDlp409jAfXb Qm/4Yx7apLvEmkaYSrb5f/pfvYv1FEV1tS8/J7DgdHUAWo6gyGUUSZJgsyHcuJT7v9Tf0xwiFWO WL9WsWXa9fCKqTeYnYJhHlqfinZRnT/+jkz0OZ7YmXo6j4Hyms3RCOqenIX1W6gnIn+eQIkw== This is the key's comment Other then getting Drunk on Marck’s Beer, What should I get out of Tonight? • The thought process behind technique selection. • What are the questions you ask yourself when deciding upon a algorithm/technique. • I’ll share some quick thoughts and then I hope to spark a more open discussion. • An introduction to Logic Regression and LogicForest • Basic intro to CART and RandomForest • What is a Logic Tree? • Simulated Annealing • A Short R Demo – Because Nothing is as exciting as watching a algorithm train The Supervised Classification Problem • Supervised vs. Unsupervised • Labeled vs Unlabeled Data • Supervised Learning: • Use a set of pre-label/categorized data to train a learning classifier which can predict the ‘label’ of previously unidentified observations. • Example: • Spam Filtering: Given a trove of previously classified emails as (Spam vs Non Spam) train an algorithm to predict the ‘label’ of newly received emails. • Heart Attack Prediction: Given a set of health indicators and historical epidemiology records predict the chance of a heart attack in new patients. • Terms: • “dependent” variable ,response or outcome – What you are trying to predict • “predictor” or “independent” variables – What you use to predict • Test/Training Set – randomly sub-setted data used to validate and test for overfitting Considerations with Technique Selection • How is the model going to be used? • Production vs. Exploration • Who is going to be exposed to it? • Personal Analysis vs. Public/Internal Consumption • Business/Users vs. Nerds • White Box vs. Black Box • Eg. CART vs. SVM • The importance of Interpretability • Interpretability vs. Accuracy – Ideally a false dichotomy • What fits the data. • High Dimensionality - How the F’ do you deal with so many possible predictors • Would recommend Breiman’s Wonderful article “The Two Cultures” Where are these vexing questions of morality, ontology and efficient algorithm design answered? Noob armed with a computer • The Double Edge sword of R? • library(caret) • Classification and Regression Training – WONDERFUL • Fits.of.Numbing.Variety <- train(TrainData, Response, method = “XXXX“,…) • If you can’t understand it you probably shouldn’t be using it. • If you can’t explain it, you probably don’t understand it. • Fetishism of Complexity • Thoughts? Logic Regression• Not Logistic Regression • a GLM (generalized linear model) predicting probability of an outcome. But used in many of the same problems. • Main Paper - Ruczinski I, Kooperberg C, LeBlanc ML (2003): Logic Regression, Journal of Computational and Graphical Statistics, 12 (3), pp 475-511. • Very Readable • Published in R • library(LogicReg) Logic Reg. Cont. • “Logic Regression is an adaptive regression methodology that attempts to construct predictors as Boolean combinations of binary covariates.” • The most important contribution that Logic Regression to the field is the focus on the interactions of dependent binary variables. • This is a major difference to rival techniques. The effect of each predictor upon the response is measured independently. When these interactions are considered it is only in a 2 or 3 way manner. • Again, this focus is purely application specific. It may not be important to you. However this can be very interesting where the binary predictors are highly correlated. Example Logic Tree Boolean Expression The not so English English Translation (One and not Two) and either [(Three and Four) or (Five and either (Six or not Three))] Image Credit: Ruczinski et al. How? The Highly Unscientific Explanation: A decision tree with a touch of Monte Carlo Magic. – Simulated Annealing CART: CART algorithms implements a “greedy” impurity reduction strategy. Using Gini, Entropy(information gain) or misclassification rate. 1. Search all attributes. – Calculate the potential reduction in impurity 2. Split on the attribute with the greatest gain(in the binary world this T or F) 3. Repeat with remaining attributes until no further reduction in impurity or the preset maximum size is met Simulated Annealing • A search technique for locating a good approximation of the global optimum in a large search space. • The inspiration come from metal work. • An adaptation of the Monte Carlo method. • Monte Carlo Methods use repeated random sampling in order calculate results. • Personally found this application to be a useful introduction to the much larger field of Monte Carlo Methods. • Metropolis-Hastings Algorithm is a application of Simulated Annealing where Temperature is kept constant. Basic Components of Simulated Annealing General Definitions: • • State Space –“Search space” – All difference solutions to a problem Neighborhood System – How all possible state are related to each other • Neighbors of a State – Possible moves from current state that result in altering the current solution, i.e. they are connect by a “Move” • Temperature – a globally time-sensitive parameter that reflects the algorithm’s acceptance probability. Depending on what point in the annealing chain this probability gets increasingly strict. Application Specific Definitions: 1. Two solution are said to be adjacent when they are with in one move. Which can be: 1. Alternate Leaf 2. Alternate Operator – change just one 3. Purne Branch/ Delete Leaf 4. Split Leaf 5. Grow Branch Image Credit: Ruczinski et al. Simulated Annealing - Application Steps: 1. Given a certain state, move to adjacent state in the search space. 2. If the state is an improvement(by misclassification rate) accept, else test to see if the score difference is within the acceptance probability 1. The annealing chain is given temperature bounds(high starting spot), and an end to dictate the cooling period. Because the temperature rate reduces over time(the cooling period) less and less moves are accepted 3. This is repeated until the chain reaches a predetermined number of iterations or reaches a “breakout point”(where repeated iterations that lead to no moves) . A subtle point is that while adjacent states are compared and either accepted or rejected. This does not always mean that a “move” is an improvement over the originating state, it can be worse. A move is accepted if it an improvement or it is within a certain probability(i.e. the Temperature). This is the key difference between a greedy search strategy and SA. The ability for the chain to move through locally sub-optimal solution allows it to eventually reach a global optimum. If run time is unbounded, this algorithm will find the global optimum. Acceptance Rate Over Scores Acceptance function = min(1,exp(-diff(scores)/temp)) Visualization of Metropolis-Hastings Algorithm Est. of Global Optimum The “Burn” – Early Discarded simulations Image Credit: Wikipedia Parameters and Acceptance • logreg.anneal.control(start = 1, end = -2, iter = 50000) • Start and End are on log10 scale • A start temp that is too high can lead to wasted time. Where the chain is wondering, accepting every possible move. (essentially an drug-addled unemployed graduate on his gap year in India). • The chain should end without accepting more then 5% of moves, otherwise it will not converge properly. This can be monitored with the update parameter. • If the end temperature is too low then then chain spend most of its time at the end rejecting the last moves and wasting time. Not the end of the world and the creators of the packages have implemented an optional early exit parameter. If a certain number iterations go by without a large number of acceptances then the chain terminates. • Like all Markov Chain applications – Trial and Error is the name of the game. LogicForest • Ensemble Technique using LR as component classifiers • Described in paper: • “Logic Forest: an ensemble classifier for discovering logical combinations of binary markers.” Bioinformatics. 2010 Sep 1;26(17):2183-9. Epub 2010 Jul 13. • Less Readable • Implemented in R in the package: LogicForest • Introduces some powerful features and improves upon the basic LR model. http://www.ncbi.nlm.nih.gov/pmc/articles/P MC3025651/pdf/btq354.pdf Basics of RandomForest • Invented(and trademarked) by Leo Breiman and Adele Cutler. It uses both Breiman’s “bagging” and controlled variation(stochastic discrimination) to create an incredible powerful classifier Overly simple Explanation of Random Forest Algorithm 1. Select a random subset of attributes and observation from the training set. 2. Grow a tree using CART methodology to maximum size and do not prune. 3. Repeat n numbers of times. 4. Have each one of these un-prunned trees predict the label of a new observation and take the majority rule of all the trees. • One of the most interesting aspects of RandomForest is because the attributes are iteratively subseted(or “masked”) one can calculate the importance that attributes has on the outcome of the model. • Also this randomization allows for a measure of independence between the models and is the heart of the ensemble technique. • -A great intuitive discussion of this on Day 11 of Statistical Aspect of Data Mining(youtube) http://www.springerlink.com/content/u0p06167n6173512/fulltext.pdf LogicForest • Unlike Random Forest, LogicForest do not build each competent LR using a random subset of variables. • The biggest Difference from a classic RF is the maximum size of the LR tree is randomized – not the search space. • The inherent randomness of Simulated Annealing and randomized tree size introduces the model independence that lies at the heart of ALL ensemble techniques • Also implements a RandomForest style variable importance calculation • Iterative Variable Masking leading to variable and Interaction importance. • It also provides the proportion of trees that voted for a particular classification giving a very useful ranking of predictive certainty • http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025651/pdf/btq354.pdf Conclusion: Why LogicForest is the Bee’s Knees • Variable Importance • Feature Selection • Interaction Importance • High level of Accessibility. Almost a Clear-Box. • Classification Confidence • Provides an ordinal ranking of confidence • The use of Simulated Annealing to escape early split traps and the looming specter of the local optimum • An interesting approach to binary data that is understandable. • Its still just trees. • Makes for a good introduction to more complex and advanced technique. (i.e. Markov Chain) • Free (Let the angels sing on high for R-core team and package contributors) Fire up the Servers…its Demo Time