mahout-intro

advertisement
Introducing Apache Mahout
Scalable Machine Learning for All!
Grant Ingersoll
Lucid Imagination
Overview
• What is Machine Learning?
• Mahout
Definition
• “Machine Learning is programming
computers to optimize a performance
criterion using example data or past
experience”
– Intro. To Machine Learning by E.
Alpaydin
• Subset of Artificial Intelligence
– Many other fields: comp sci., biology,
math, psychology, etc.
Types
• Supervised
– Using labeled training data, create
function that predicts output of unseen
inputs
• Unsupervised
– Using unlabeled data, create function
that predicts output
• Semi-Supervised
– Uses labeled and unlabeled data
Characterizations
• Lots of Data
• Identifiable Features in that Data
• Too big/costly for people to handle
– People still can help
Clustering
• Unsupervised
• Find Natural Groupings
– Documents
– Search Results
– People
– Genetic traits in groups
– Many, many more uses
Example: Clustering
Google News
Collaborative Filtering
• Unsupervised
• Recommend people and products
– User-User
• User likes X, you might too
– Item-Item
• People who bought X also bought Y
Example: Collab Filtering
Amazon.com
Classification/Categorization
•
•
•
•
•
•
Many, many types
Spam Filtering
Named Entity Recognition
Phrase Identification
Sentiment Analysis
Classification into a Taxonomy
Example: NER
NER?
Excerpt from Yahoo News
Example: Categorization
Info. Retrieval
• Learning Ranking Functions
• Learning Spelling Corrections
• User Click Analysis and Tracking
Other
•
•
•
•
Image Analysis
Robotics
Games
Higher level natural language
processing
• Many, many others
What is Apache Mahout?
• A Mahout is an elephant
trainer/driver/keeper, hence…
+ (and other distributed techniques)
Machine Learning
=
What?
• Hadoop brings:
– Map/Reduce API
– HDFS
– In other words, scalability and faulttolerance
• Mahout brings:
– Library of machine learning algorithms
– Examples
Why Mahout?
• Many Open Source ML libraries either:
– Lack Community
– Lack Documentation and Examples
– Lack Scalability
– Lack the Apache License ;-)
– Or are research-oriented
Why Mahout?
• Intelligent Apps are the Present and
Future
• Thus, Mahout’s Goal is:
– Scalable Machine Learning with Apache
License
Current Status
• What’s in it:
– Simple Matrix/Vector library
– Taste Collaborative Filtering
– Clustering
• Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet
– Classifiers
• Naïve Bayes
• Complementary NB
– Evolutionary
• Integration with Watchmaker for fitness function
How?
• Examples
– Taste
– Clustering
– Classification
– Evolutionary
Taste: Movie
Recommendations
• Given ratings by users of movies,
recommend other movies
• http://lucene.apache.org/mahout/taste
.html#demo
Taste Demo
• http://localhost:8080/mahout-tastewebapp/RecommenderServlet?userI
D=12&debug=true
• http://localhost:8080/mahout-tastewebapp/RecommenderServlet?userI
D=43&debug=true
Clustering: Synthetic Control
Data
• http://archive.ics.uci.edu/ml/datasets/Synth
etic+Control+Chart+Time+Series
• Each clustering impl. has an example
Job for running in
<MAHOUT_HOME>/examples
– o.a.mahout.clustering.syntheticcontrol.*
• Outputs clusters…
Classification: NB and CNB
Examples
• 20 Newsgroups
– http://cwiki.apache.org/confluence/displa
y/MAHOUT/TwentyNewsgroups
• Wikipedia
– http://cwiki.apache.org/confluence/displa
y/MAHOUT/WikipediaBayesExample
Evolutionary
• Traveling Salesman
– http://cwiki.apache.org/confluence/displa
y/MAHOUT/Traveling+Salesman
• Class Discovery
– http://cwiki.apache.org/confluence/displa
y/MAHOUT/Class+Discovery
What’s Next?
•
•
•
•
•
•
•
More Examples
Winnow/Perceptron (MAHOUT-85)
Text Clustering
Association Rules (MAHOUT-108)
Logistic Regression
Solr Integration (SOLR-769)
GSOC
When, Who
• When? Now!
– Mahout is growing
• Who? You!
– We want programmers who:
• Are comfortable with math
• Like to work on hard problems
– We want others to:
• Kick the tires
Where?
• http://lucene.apache.org/mahout
– Hadoop - http://hadoop.apache.org
• http://cwiki.apache.org/MAHOUT
• mahout-{user|dev}@lucene.apache.org
– http://www.lucidimagination.com/search/p:mahout
Resources
• “Programming Collective Intelligence”
by Segaran
• “Data Mining - Practical Machine
Learning Tools and Techniques” by
Witten and Frank
• “Taming Text” by Ingersoll and
Morton
Download