Data Mining: An Introduction

advertisement
Data Mining: An Introduction
Billy Mutell
“The Library of Babel” Analogy
Network of bookshelves with every book ever written
All the books one could possibly imagine must exist somewhere in
this library
Books have titles like ‘Axaxxas mlo’, ‘The Bible’ &
‘Tomorrow's Winning Lottery Numbers’
Roughly 251,312,000 or 1.956 x 101,834,097 volumes in library
May be viewed as a metaphor for information in today’s society,
where there’s growing amounts of data and, but not enough
information
Content
•General Information
•Approaches to searching for information
•Project and plans
What is Data Mining?
• The nontrivial extraction of implicit,
previously unknown, and potentially
useful information from data
• The science of extracting useful
information from large data sets or
databases
How Did it Evolve to What We Have Today?
• With increased data, techniques needed to
be created
Information Retrieval
Database Management
Statistics
Data Mining
Algorithms
Machine Learning
Practical Applications
Government Intelligence
Insurance
Bank Finance
Branch Evaluation
Pharmaceutical Reactions in Patients
Content
•General Information
•Approaches to searching for information
•Project and plans
There are two models for mining data
Predictive: Makes projected conclusions about values based on known results
from different data
Includes: Regression, Classification, Time Series Analysis
Classification: Maps data into predefined groups
Example: Identifying potential credit risks
Time Series Analysis: Examining the value of an attribute as it varies over time
Example: Choosing stocks
There are two models for mining data
Descriptive: Identifies patterns or relationships in data
Includes: Clustering, Association Rules, Sequence Discovery
Clustering: Very similar to Classification, but groups are defined by data and not
predefined
Association Rules: Identifies specific types of data pairings
Example: If someone buys jelly, they’re probably buying
peanut butter
Sequence Discovery: Highlights patterns on temporal sequences
Example: If someone buys a CD player, they’ll probably buy
CDs within a week
Information Analysis
• Statistical Based Algorithms
• Decision Tree Based Algorithms
• Rule Based Algorithms
• Distance Based Algorithms
Linear Regression Examples
Regression- Estimation of output value based on input values; takes
input data and fits it into a formula according to output
y i     xi   i
Statistical Based Algorithms
By determining the regression coefficients {c0, c1, …, cn}, we
can estimate the relationship the output parameter, y, and the
input parameters, {x1,…, xn}
y  c0  c1 x1  ...  cn xn
Decision Tree Example: 20 Questions
Dead or Alive?
Alive?
Dead?
Woman?
Man?
Non-Mathematician?
Mathematician?
Modern?
Ancient?
Pythagoras!
Rule Based Algorithms
Works well to perform classification through if-then analysis
r  a, c
Trees have an implied order in which there is splitting; rules
have no order
If 90  grade, thenclass  A
If 80  gradeandgrade  90, thenclass  B
If 70  gradeandgrade  80, thenclass  C
If 60  gradeandgrade  70, thenclass  D
Ifgrade  60, thenclass  F
Parametric vs Nonparametric Models
Parametric Model- Describes the relationship between input and output
through algebraic equations where some parameters aren’t specified
Nonparametric Model- Data driven and more appropriate for mining
applications
Creates models based on input while Parametric Methods
assume models ahead of time
More flexible than Parametric Models and generally easier to
work with
Content
•General Information
•Approaches to searching for information
•Project and plans
NetFlix: A Case Study
• Quest to improve
customer/movie
predictability through
data mining and
linear regression
• Teams win $1,000,000
prize
• Must beat Cinematch,
Netflix’s current
program to predict
movie preferences
• http://www.netflixprize
.com/
What others have done so far:
“If I have seen further, it is by standing on the shoulders of giants.”
-Isaac Newton 1676
There are currently 31,443 contestants on 25,713 teams from 167
different countries.
Important to remember that everyone is given the same amount of
incomplete data, and we have to use that to predict rest of the data
(unknown to us, known to Netflix)
Current Leaders are from Budapest, Hungry and they’ve accurately
predicted the data 8.7% better than Cinematch
K-Nearest Neighbor Algorithm (k-NN)
A set of pairs x1 , 1 , x 2 , 2 ,..., x n , n  is given, where the xi’s take values in a
metric space X upon which is defined a metric d and the θi’s take values in the
set {1,2,…M} of possible classes. Each θi is considered to be an index of the
category to which the ith individual belongs, and each xi is the outcome of the
set of measurements made upon that individual.
A new pair (x,θ) is given, where only the measurement of x is observable, and it
is desired to estimate θ by using information in the set of correctly classified
points. Thus, we will call
xn  x1 , x2 ,..., xn
the nearest neighbor of x if
min d xi , x   d xn , x 
i  1,2,...n
The Nearest-Neighbor classification decision method gives to x the
category θ’n of its nearest neighbor x’n
K-Nearest Neighbor Algorithm (k-NN)
If k=3, we
classify the
dot as a
triangle
x   TRIANGLE
x
If k=5, we
classify the
dot as a
rectangle
x   SQUARE
Name
Gender
Height)
Output
Kristina
F
1.6
Short
Jim
M
2
Tall
Maggie
F
1.9
Medium
Martha
F
1.88
Medium
Stephanie
F
1.7
Short
Bob
M
1.85
Medium
Kathy
F
1.6
Short
Dave
M
1.7
Short
Worth
M
2.2
Tall
Steven
M
2.1
Tall
Debbie
F
1.8
Medium
Todd
M
1.95
Medium
Kim
F
1.9
Medium
Amy
F
1.8
Medium
Wynette
F
1.75
Medium
Suppose we want to know what the
entry <Pat, F, 1.6> would be classified
as…
Set K=5 and find the K nearest neighbors:
<Kristina, F, 1.6> => SHORT
<Kathy, F, 1.6> => SHORT
<Stephanie, F, 1.7> => SHORT
<Dave, M,1.7> => SHORT
<Wynette, F, 1.75> => MEDIUM
Thus KNN would classify <Pat, F, 1.6> as
SHORT
What I plan to do from here:
Take data from Netflix and sift through it
Develop a function that maps non-linear data to a linear format so
that it may be clustered and regressed
Map data to matrices in Rn
Use Support Vector Machines to map input vectors to a higher
dimensional space where a maximal separating hyper-plane is
constructed
Create a way to interpret this data in the form of movie
recommendations
Also…
Use k-NN Approach along with Latent Semantic Indexing
techniques to analyze scripts and key thematic plots and look for
correlations/clusters
Questions?
Download