An introduction to Logic Regression

advertisement
An Introduction to Logic Regression
DC Data Science Meetup
October 25, 2011
John Dennison
dennison.john@gmail.com
@johnsarealtwit
SSN: 249-543-0833
BMI: 20.9
ssh-rsa
AAAAB3NzaC1yc2EAAAABIwAAAQEAyyA8wePstPC69PeuHFtOwyTecByonsHFAjHbVnZ+h0dp
omvLZxUtbknNj3+c7MPYKqKBOx9gUKV/diR/mIDqsb405MlrI1kmNR9zbFGYAAwIH/Gxt0Lv5ffw
aqsz7cECHBbMojQGEz3IH3twEvDfF6cu5p00QfP0MSmEi/eB+W+h30NGdqLJCziLDlp409jAfXb
Qm/4Yx7apLvEmkaYSrb5f/pfvYv1FEV1tS8/J7DgdHUAWo6gyGUUSZJgsyHcuJT7v9Tf0xwiFWO
WL9WsWXa9fCKqTeYnYJhHlqfinZRnT/+jkz0OZ7YmXo6j4Hyms3RCOqenIX1W6gnIn+eQIkw==
This is the key's comment
Other then getting Drunk on Marck’s Beer,
What should I get out of Tonight?
• The thought process behind technique selection.
• What are the questions you ask yourself when deciding upon a
algorithm/technique.
• I’ll share some quick thoughts and then I hope to spark a more
open discussion.
• An introduction to Logic Regression and LogicForest
• Basic intro to CART and RandomForest
• What is a Logic Tree?
• Simulated Annealing
• A Short R Demo – Because Nothing is as exciting as watching a
algorithm train
The Supervised Classification Problem
• Supervised vs. Unsupervised
• Labeled vs Unlabeled Data
• Supervised Learning:
• Use a set of pre-label/categorized data to train a learning classifier which can
predict the ‘label’ of previously unidentified observations.
• Example:
• Spam Filtering: Given a trove of previously classified emails as (Spam vs Non
Spam) train an algorithm to predict the ‘label’ of newly received emails.
• Heart Attack Prediction: Given a set of health indicators and historical
epidemiology records predict the chance of a heart attack in new patients.
• Terms:
• “dependent” variable ,response or outcome – What you are trying to predict
• “predictor” or “independent” variables – What you use to predict
• Test/Training Set – randomly sub-setted data used to validate and test for overfitting
Considerations with Technique Selection
• How is the model going to be used?
• Production vs. Exploration
• Who is going to be exposed to it?
• Personal Analysis vs. Public/Internal Consumption
• Business/Users vs. Nerds
• White Box vs. Black Box
• Eg. CART vs. SVM
• The importance of Interpretability
• Interpretability vs. Accuracy – Ideally a false dichotomy
• What fits the data.
• High Dimensionality - How the F’ do you deal with so many possible
predictors
• Would recommend Breiman’s Wonderful article “The Two
Cultures”
Where are these vexing questions of morality, ontology and
efficient algorithm design answered?
Noob armed with a computer
• The Double Edge sword of R?
• library(caret)
• Classification and Regression Training – WONDERFUL
• Fits.of.Numbing.Variety <- train(TrainData, Response, method = “XXXX“,…)
• If you can’t understand it you probably shouldn’t be using it.
• If you can’t explain it, you probably don’t understand it.
• Fetishism of Complexity
• Thoughts?
Logic Regression• Not Logistic Regression
• a GLM (generalized linear model) predicting probability of an
outcome. But used in many of the same problems.
• Main Paper - Ruczinski I, Kooperberg C, LeBlanc ML
(2003): Logic Regression, Journal of Computational and
Graphical Statistics, 12 (3), pp 475-511.
• Very Readable
• Published in R
• library(LogicReg)
Logic Reg. Cont.
• “Logic Regression is an adaptive regression methodology
that attempts to construct predictors as Boolean
combinations of binary covariates.”
• The most important contribution that Logic Regression to
the field is the focus on the interactions of dependent
binary variables.
• This is a major difference to rival techniques. The effect of each
predictor upon the response is measured independently. When
these interactions are considered it is only in a 2 or 3 way manner.
• Again, this focus is purely application specific. It may not be
important to you. However this can be very interesting where the
binary predictors are highly correlated.
Example Logic Tree
Boolean Expression
The not so English English Translation
(One and not Two) and either [(Three and Four) or (Five and
either (Six or not Three))]
Image Credit: Ruczinski et al.
How?
The Highly Unscientific Explanation: A
decision tree with a touch of Monte Carlo
Magic. – Simulated Annealing
CART:
CART algorithms implements a “greedy” impurity reduction strategy.
Using Gini, Entropy(information gain) or misclassification rate.
1. Search all attributes. – Calculate the potential reduction in impurity
2. Split on the attribute with the greatest gain(in the binary world this T
or F)
3. Repeat with remaining attributes until no further reduction in impurity
or the preset maximum size is met
Simulated Annealing
• A search technique for locating a good approximation of
the global optimum in a large search space.
• The inspiration come from metal work.
• An adaptation of the Monte Carlo method.
• Monte Carlo Methods use repeated random sampling in order
calculate results.
• Personally found this application to be a useful introduction to the
much larger field of Monte Carlo Methods.
• Metropolis-Hastings Algorithm is a application of Simulated
Annealing where Temperature is kept constant.
Basic Components of Simulated Annealing
General Definitions:
•
•
State Space –“Search space” – All difference solutions to a problem
Neighborhood System – How all possible state are related to each other
• Neighbors of a State – Possible moves from current state that result in
altering the current solution, i.e. they are connect by a “Move”
• Temperature – a globally time-sensitive parameter that reflects the algorithm’s
acceptance probability. Depending on what point in the annealing chain this
probability gets increasingly strict.
Application Specific Definitions:
1. Two solution are said to be adjacent when they are with in one move.
Which can be:
1. Alternate Leaf
2. Alternate Operator – change just one
3. Purne Branch/ Delete Leaf
4. Split Leaf
5. Grow Branch
Image Credit: Ruczinski et al.
Simulated Annealing - Application
Steps:
1. Given a certain state, move to adjacent state in the search space.
2. If the state is an improvement(by misclassification rate) accept, else
test to see if the score difference is within the acceptance probability
1. The annealing chain is given temperature bounds(high starting
spot), and an end to dictate the cooling period. Because the
temperature rate reduces over time(the cooling period) less and
less moves are accepted
3. This is repeated until the chain reaches a predetermined number of
iterations or reaches a “breakout point”(where repeated iterations that
lead to no moves) .
A subtle point is that while adjacent states are compared and either accepted or
rejected. This does not always mean that a “move” is an improvement over the
originating state, it can be worse. A move is accepted if it an improvement or it is
within a certain probability(i.e. the Temperature). This is the key difference between
a greedy search strategy and SA. The ability for the chain to move through locally
sub-optimal solution allows it to eventually reach a global optimum. If run time is
unbounded, this algorithm will find the global optimum.
Acceptance Rate Over Scores
Acceptance function = min(1,exp(-diff(scores)/temp))
Visualization of Metropolis-Hastings Algorithm
Est. of Global
Optimum
The “Burn” – Early Discarded
simulations
Image Credit: Wikipedia
Parameters and Acceptance
• logreg.anneal.control(start = 1, end = -2, iter = 50000)
• Start and End are on log10 scale
• A start temp that is too high can lead to wasted time. Where the chain is
wondering, accepting every possible move. (essentially an drug-addled
unemployed graduate on his gap year in India).
• The chain should end without accepting more then 5% of moves, otherwise it will
not converge properly. This can be monitored with the update parameter.
• If the end temperature is too low then then chain spend most of its time at the end
rejecting the last moves and wasting time. Not the end of the world and the
creators of the packages have implemented an optional early exit parameter. If a
certain number iterations go by without a large number of acceptances then the
chain terminates.
• Like all Markov Chain applications – Trial and Error is the name of the game.
LogicForest
• Ensemble Technique using LR as component classifiers
• Described in paper:
• “Logic Forest: an ensemble classifier for discovering logical
combinations of binary markers.” Bioinformatics. 2010 Sep
1;26(17):2183-9. Epub 2010 Jul 13.
• Less Readable
• Implemented in R in the package: LogicForest
• Introduces some powerful features and improves upon
the basic LR model.
http://www.ncbi.nlm.nih.gov/pmc/articles/P
MC3025651/pdf/btq354.pdf
Basics of RandomForest
• Invented(and trademarked) by Leo Breiman and Adele Cutler. It uses
both Breiman’s “bagging” and controlled variation(stochastic
discrimination) to create an incredible powerful classifier
Overly simple Explanation of Random Forest Algorithm
1. Select a random subset of attributes and observation from the training set.
2. Grow a tree using CART methodology to maximum size and do not prune.
3. Repeat n numbers of times.
4. Have each one of these un-prunned trees predict the label of a new
observation and take the majority rule of all the trees.
• One of the most interesting aspects of RandomForest is because the
attributes are iteratively subseted(or “masked”) one can calculate the
importance that attributes has on the outcome of the model.
• Also this randomization allows for a measure of independence between the
models and is the heart of the ensemble technique.
•
-A great intuitive discussion of this on Day 11 of Statistical Aspect of
Data Mining(youtube)
http://www.springerlink.com/content/u0p06167n6173512/fulltext.pdf
LogicForest
• Unlike Random Forest, LogicForest do not build each competent
LR using a random subset of variables.
• The biggest Difference from a classic RF is the maximum size of
the LR tree is randomized – not the search space.
• The inherent randomness of Simulated Annealing and randomized tree size
introduces the model independence that lies at the heart of ALL ensemble
techniques
• Also implements a RandomForest style variable importance
calculation
• Iterative Variable Masking leading to variable and Interaction importance.
• It also provides the proportion of trees that voted for a particular
classification giving a very useful ranking of predictive certainty
• http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025651/pdf/btq354.pdf
Conclusion: Why LogicForest is the Bee’s Knees
• Variable Importance
• Feature Selection
• Interaction Importance
• High level of Accessibility. Almost a Clear-Box.
• Classification Confidence
• Provides an ordinal ranking of confidence
• The use of Simulated Annealing to escape early split traps
and the looming specter of the local optimum
• An interesting approach to binary data that is
understandable.
• Its still just trees.
• Makes for a good introduction to more complex and
advanced technique. (i.e. Markov Chain)
• Free (Let the angels sing on high for R-core team and
package contributors)
Fire up the Servers…its Demo Time
Download