Android API Client for Fon11.com Literature Survey

advertisement
Data Mining
Chetan Meshram
Class Id:221
Agenda
Introduction
 Data-Mining Applications
 Finding Frequent Sets of Items


Examples
The A-Priori Algorithm
 References

Introduction – Data Mining
Opportunities to learn surprising facts
from existing databases.
 Stresses both query-optimization and
data-management components as well as
extensions such as language primitives.
 Data-Mining query invites the system to
decide which portion of data to focus.
 Naïve implementations will result in
execution of large decision-support
queries and take long to complete.

Data Mining Applications

Decision Tree Construction






Designed to guide the separation of data into two sets
Interior nodes each have an attribute and a value that
serves as a threshold
The children of a node are other interior nodes or
leaves representing a decision
Tree is constructed from a training set of tuples whose
outcome is known
Data-Mining problem is to design from this data
the decision tree that most reliably for a new
decision
Best attribute A is assigned to root and best
threshold value v for that attribute.
Data-Mining Applications
Clustering: Group data into small number
of groups such that groups each have
something substantial in common
 Search engines:




cluster web documents according to the words
they use.
Places documents in a space that has one
dimension for each possible word excluding
most common words
Data-Mining selects the data and mean or
centers of the clusters
Finding Frequent Sets of Items

Market-Basket Analysis:
Market Basket data fact table:
Basket(basket, item)
Knowing the threshold s, all those sets of
items that support s, gives the frequent
sets of items
 Naïve way to find high-support pairs

SELECT I.item, J.item, Count(I.basket) FROM
Baskets I, Baskets J WHERE I.basket =
J.basket AND I.item<J.item GROUP BY I.item,
J.item HAVING COUNT(I.basket) >=s;
Finding Frequent Sets of Items
Compute the support for each pair of
items I and j, as per above query wont
work.
 This query involves joining baskets,
grouping resulting tuples and throwing
away groups having baskets less than s.
 The WHERE-clause prevents the same pair
from being considered in both orders or
for a pair consisting of the same item
twice

A-Priori Algorithm

If a set of items X has support s, then
each subset of X must also have support
at least s.


A-Priori Algorithm:



If a pair of items {i,j} appear in 1000 baskets, then
there are at least 1000 baskets with item I and j
First finds the set of candidate items – those
that appear in sufficient number of baskets
Runs the query on only the candidate items
As per following queries, it computes
Candidates, subset of Baskets relations,
joins Candidates with itself
A-Priori Algorithm
INSERT INTO Candidates
SELECT * FROM Baskets WHERE item IN (
SELECT item FROM Baskets GROUP BY
item HAVING COUNT(*) >=s );
 SELECT I.item, J.item, COUNT(I.basket)
FROM Candidates I, Candidates J WHERE
I.basket=J.basket AND I.item<J.item
GROUP BY I.item, J.item HAVING
COUNT(*) >=s;
 Here the algorithm first finds frequent
items before finding frequent pairs

References
http://en.wikipedia.org/wiki/DataMining
 http://en.wikipedia.org/wiki/DataMining A
Priori

Questions?
Download