Data Mining Chetan Meshram Class Id:221 Agenda Introduction Data-Mining Applications Finding Frequent Sets of Items Examples The A-Priori Algorithm References Introduction – Data Mining Opportunities to learn surprising facts from existing databases. Stresses both query-optimization and data-management components as well as extensions such as language primitives. Data-Mining query invites the system to decide which portion of data to focus. Naïve implementations will result in execution of large decision-support queries and take long to complete. Data Mining Applications Decision Tree Construction Designed to guide the separation of data into two sets Interior nodes each have an attribute and a value that serves as a threshold The children of a node are other interior nodes or leaves representing a decision Tree is constructed from a training set of tuples whose outcome is known Data-Mining problem is to design from this data the decision tree that most reliably for a new decision Best attribute A is assigned to root and best threshold value v for that attribute. Data-Mining Applications Clustering: Group data into small number of groups such that groups each have something substantial in common Search engines: cluster web documents according to the words they use. Places documents in a space that has one dimension for each possible word excluding most common words Data-Mining selects the data and mean or centers of the clusters Finding Frequent Sets of Items Market-Basket Analysis: Market Basket data fact table: Basket(basket, item) Knowing the threshold s, all those sets of items that support s, gives the frequent sets of items Naïve way to find high-support pairs SELECT I.item, J.item, Count(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item<J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >=s; Finding Frequent Sets of Items Compute the support for each pair of items I and j, as per above query wont work. This query involves joining baskets, grouping resulting tuples and throwing away groups having baskets less than s. The WHERE-clause prevents the same pair from being considered in both orders or for a pair consisting of the same item twice A-Priori Algorithm If a set of items X has support s, then each subset of X must also have support at least s. A-Priori Algorithm: If a pair of items {i,j} appear in 1000 baskets, then there are at least 1000 baskets with item I and j First finds the set of candidate items – those that appear in sufficient number of baskets Runs the query on only the candidate items As per following queries, it computes Candidates, subset of Baskets relations, joins Candidates with itself A-Priori Algorithm INSERT INTO Candidates SELECT * FROM Baskets WHERE item IN ( SELECT item FROM Baskets GROUP BY item HAVING COUNT(*) >=s ); SELECT I.item, J.item, COUNT(I.basket) FROM Candidates I, Candidates J WHERE I.basket=J.basket AND I.item<J.item GROUP BY I.item, J.item HAVING COUNT(*) >=s; Here the algorithm first finds frequent items before finding frequent pairs References http://en.wikipedia.org/wiki/DataMining http://en.wikipedia.org/wiki/DataMining A Priori Questions?