slides - Spatial Database Group

advertisement
Data Mining Concepts
Emre Eftelioglu
1
What is Knowledge Discovery in Databases?
• Data mining is actually one step of a larger process known as knowledge
discovery in databases (KDD).
• The KDD process model consists of six phases
• Data selection Database
• Data cleansing
Data Warehouses
• Enrichment
• Data transformation
• Data mining
• Reporting and displaying discovered knowledge
Data Selection
Data Cleansing &
Enrichment
Data
transformation
Data mining
Reporting
2
Data Warehouse
• A subject oriented, integrated, non-volatile, time variant collection
of data in support of management’s decisions.
• Understand and improve the performance of an organization
• Designed for query and data retrieval. Not for transaction
processing.
• Contains historical data derived from transaction data, but can
include data from other sources.
• Data is consolidated, aggregated and summarized
• Underlying engine used by the business intelligence environments
which generates these reports.
3
Why Data Mining is Needed?
Data is large scale, high dimensional, heterogeneous,
complex and distributed and it is required to find useful
information using this data.
Commercial Perspective
• Sales can be increased.
• Lots of data is in hand but not interpreted.
• Computers are cheap and powerful.
• There is a competition between companies
• Better service, more sales
• Easy access for customers.
• Customized user experience.
• Etc.
Overall
- There is often information “hidden” in the data that is
not readily evident.
-
Even for a small datasets human analysts may take
weeks to discover useful information
Scientific Perspective
• Data collected and stored at enormous speeds.
• Traditional techniques are infeasible.
• Data mining improve/ease the work of scientists
4
What is the Goal of Data Mining?
• Prediction:
• Determine how certain attributes will behave in the future.
• Credit Request by a Bank Customer: By analyzing credit card
usage a customer’s buying capacity can be predicted.
• Identification:
• Identify the existence of an item, event, or activity.
• Scientists are trying to identify the life on Mars by analyzing
different soil samples.
• Classification:
• Partition data into classes or categories.
• A company can grade their employers by classifying them by
their skills.
• Optimization:
• Optimize the use of limited resources.
• UPS avoids left turns in order to save gas on idle.
5
What is Data Mining?
• Definition: Non-trivial extraction of implicit, previously unknown and
potentially useful information from data.
• Another definition: Discovering new information in terms of patterns
or rules from vast amounts of data.
Data
Mining -does
an interdisciplinary
field
Which disciplines
Data Mining use?
•Databases
•Statistics
•High Performance Computing
•Machine Learning
•Visualization
•Mathematics
Data Mining: "Torturing data until it confesses ... and if you torture it enough, it will confess to
anything"
Jeff Jonas, IBM
6
What is Not Data Mining?
• Use “Google” to check the price of an item.
• Getting the statistical details about the historical price change of an item. (i.e. max price, min price, average
price)
• Checking the sales of an item by color.
• Example:
• Green item sales– 500
• Blue item sales– 1000 etc.
So what would be a data mining question?
- How many items can we sell if we produce the same item in Red color?
- How will the sales change if we make a discount on the price?
- Is there any association between the sales of item X and item Y?
Why Data Mining is not Statistical Analysis?
•Interpretation of results is difficult and daunting
•Requires expert user guidance
•Ill-suited for Nominal and Structured Data Types
•Completely data driven - incorporation of domain knowledge not possible
7
What is the Difference between DBMS, OLAP and Data Mining?
DBMS
OLAP – Data
Warehouse
Data Mining
Task
Extraction of detailed Summaries, Trends,
data
Reports
Knowledge Discovery
of Hidden Patterns,
Insights
Result
Information
Analysis
Insight and Future
Prediction
Method
Deduction (ask the
question, verify the
data)
Model the data,
aggregate, use
statistics
Induction (build the
model, apply to new
data, get the result)
Example
Who purchased the
Apple iPhone 6 so
far?
What is the average
income of iPhone 6
buyers by region and
month?
Who will buy the new
Apple Watch when it
is in the market?
8
What types of Knowledge can be revealed?
• Association Rule Discovery (descriptive)
• Clustering (descriptive)
• Classification (descriptive)
• Sequential Pattern Discovery (descriptive)
• Patterns Within Time Series (predictive)
9
Association Rule Mining
• Association rules are frequently used to generate rules from market-basket
data.
- A market-basket corresponds to the sets of items a consumer purchases
during one visit to a supermarket.
- Create dependency rules to predict occurrence of an item based on
occurrences of other items.
• The set of items purchased by customers is known as an item set.
• An association rule is of the form X=>Y, where X ={x1, x2, …., xn }, and Y = {y1,y2,
…., yn} are sets of items, with xi and yi being distinct items for all i and all j.
• For an association rule to be of interest, it must satisfy a minimum support and
confidence.
Beer & Diapers as an urban legend:
- Father goes to grocery store to get a big pack
of diapers after he is out of work.
- When he buys diapers, he decides to get a six
10
pack
Association Rules - Confidence and Support
• Support:
• The minimum percentage of instances in the database
that contain all items listed in a given association rule.
• Support is the percentage of transactions that contain
all of the items in the item set, LHS U RHS.
• The rule X ⇒ Y holds with support s if s% of transactions in
contain X ∪ Y.
• Confidence:
• The rule X ⇒ Y holds with confidence c if c% of the
transactions that contain X also contain Y
• Confidence can be computed as
• support(LHS U RHS) / support(LHS)
TID
Items
Support =
Occurrence / Total Transactions
1
AB
2
ABD
3
ACD
4
ABC
Total Transactions = 5
Support {AB} = 3/5 = 60%
Support {BC} = 3/5 = 60%
Support {CD} = 1/5 = 20%
Support {ABC} = 1/5 = 20%
5
BC
TID
Items
Given X ⇒ Y
Confidence =
Occurrence {X Y} / Occurrence{X}
1
AB
2
ABD
3
ACD
4
ABC
Total Transactions = 5
Confidence {A ⇒ B} = 3/4 = 75%
Confidence {B ⇒ C} = 2/4 = 50%
Confidence {C ⇒ D} = 1/3 = 33%
Confidence {AB ⇒ C} = 1/3 = 33%
5
BC
11
Association Rules - Apriori algorithm
• A general algorithm for generating association rules is a two-step process.
• Generate all item sets that have a support exceeding the given threshold. Item sets with this property
are called large or frequent item sets.
• Generate rules for each item set as follows:
• For item set X and Y (subset of X), let Z = X – Y (set difference);
• If Support(X)/Support(Z) > minimum confidence, the rule Z=>Y is a valid rule.
• The Apriori algorithm was the first algorithm used to generate association rules.
• The Apriori algorithm uses the general algorithm for creating association rules together with downward
closure and anti-monotonicity.
• For a k-item set to be frequent, each and every one of its items must also be frequent.
• To generate a k-item set: Use a frequent (k-1)-item set and extend it with a frequent 1-itemset.
Downward Closure
A subset of a large itemset must also be large
Anti-monotonicity
A superset of a small itemset is also small. This
implies that the itemset does not have
sufficient support to be considered for rule
generation.
12
Complications seen with Association Rules
• The cardinality of item sets in most situations is extremely large.
• Association rule mining is more difficult when transactions show
variability in factors such as geographic location and seasons.
• Item classifications exist along multiple dimensions.
• Data quality is variable; data may be missing, erroneous, conflicting,
as well as redundant.
13
What types of Knowledge can be revealed?
• Association Rule Discovery (descriptive)
• Clustering (descriptive)
• Classification (descriptive)
• Sequential Pattern Discovery (descriptive)
• Patterns Within Time Series (predictive)
14
Clustering (1/2)
Motivation
• Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs.
• Land use: Identification of areas of similar land use in an earth observation
database.
• Insurance: Identifying groups of motor insurance policy holders with a high
average claim cost.
• City-planning: Identifying groups of houses according to their house type, value,
and geographical location.
• Many more…
15
Clustering (2/2)
- Given a set of data points, find clusters such that;
- Records in one cluster are highly similar to each other and dissimilar from the records in
other clusters.
- It is an unsupervised learning technique which does not require any prior knowledge of clusters.
690
680
680
670
670
660
660
650
650
640
640
630
630
620
620
Cluster 1
Cluster 2
Cluster 3
Centroids
610
610
600
610
Cluster Assignments and Centroids
690
620
630
640
650
660
670
680
690
600
610
620
630
640
650
660
670
680
690
16
k-Means Algorithm (1/2)
• The k-Means algorithm is a simple yet effective clustering technique.
• The algorithm clusters observations into k groups, where k is provided as an input
parameter.
• It then assigns each observation to clusters based upon the observation’s proximity
to the mean center of the cluster.
• The cluster’s mean center is then recomputed and the process begins again.
• Algorithm stops when the means centers do not change.
The objective is minimize the error (distance).
πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ =
π‘˜
𝑖=1
∀π‘Ÿπ‘— ∈𝐢𝑖 π·π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’ π‘Ÿπ‘— , π‘šπ‘–
2
17
k-Means Algorithm (2/2)
1. Select initial k cluster centers.
2. Assignment Step: Assign each point in the dataset to the closest
cluster, based upon the Euclidean distance between the point and each
cluster center.
3. Update Step: Once all points are clustered, re-compute the cluster
centers using the arithmetic mean of the coordinates of the points
which belongs to them.
4. Repeat step 2-3 until the centers do not change (convergence)
Example of K-Means Execution on a Smiley Face dataset.
Input Dataset
Initial 3 centers
User defined
Assignment
Update
Assignment
Final Output
18
k-Means Algorithm – Issues (1/2)
• The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results.
• Initial centroid selection is important for the final result of the algorithm since it converges to a local
minimum.
Initial Centroids (green/red)
Output
Output changes by
different initial
selections of centers
Input data set
Initial Centroids (green/red)
19
Output
k-Means Algorithm – Issues (2/2)
• Density of the clusters may bias the results since k-Means depends on the Euclidean distance between
points and centroids.
Input Dataset
20
Classification vs. Clustering
• Classification is a supervised learning technique
• Classification techniques learn a method for predicting the class of a
data record from the pre-labeled (classified) records.
• Clustering is an unsupervised learning technique which finds clusters
of records without any prior knowledge.
21
Download