1. Explain FP growth algorithm for discovering frequent item sets?what are its Limitations [Dec-14/Jan 2015][8marks] The FP-growth algorithm: mining frequent patterns without candidate generation [Han, Pei & Yin 2000] •Compress a large database into a compact Frequent-Pattern tree (FP-tree) structure –highly condensed, but complete for frequent pattern mining –avoid costly database scans •Develop an efficient, FP-tree-based frequent pattern mining method –A divide-and-conquer methodology: decompose mining tasks into smaller ones –Avoid candidate generation: sub-database test only Method (divide-and-conquer) –For each item, construct its conditional pattern-base, and then its conditional FP-tree. –Repeat the process on each newly created conditional FP-tree. –Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern) 2. Explain Descriptive tasks in detail? [June/July 2014] [10marks] Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database systems, statistics, machine learning, visualization, and information science. Moreover, depending on the data mining approach used, techniques from other disciplines may be applied, such as neural networks, fuzzy and/or rough set theory, knowledge representation, inductive logic programming, or high-performance computing. Depending on the kinds of data to be mined or on the given data mining application, the data mining system may also integrate techniques from spatial data analysis, information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web technology, economics, business, bioinformatics, or psychology. Because of the diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear classification of data mining systems, which may help potential users distinguish between such systems and identify those that best match their needs. Data mining systems can be categorized according to various criteria, as follows: 3. Develop the Apriori Algorithm for generating frequent itemset.[Dec-14/Jan 2015][8marks], [Dec 13][10marks],[june/july-15][10 marks] Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties, as we shall see following. Apriori employs an iterative approach known as a level-wise search, where k-itemsets are usedtoexplore (k+1)-itemsets. First, the setof frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support. The resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2- itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database. To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property, presented below, is used to reduce the search space. We will first describe this property, and then show an example illustrating its use. Apriori property: All nonempty subsets of a frequent itemset must also be frequent. The Apriori property is based on the following observation. By definition, if an itemset I does not satisfy the minimum support threshold, min sup, then I is not frequent; that is, P(I) < min sup. If an item A is added to the itemset I, then the resulting itemset (i.e., I [A) cannot occur more frequently than I. Therefore, I [A is not frequent either; that is, P(I [A) < min sup. This property belongs to a special category of properties called antimonotone in the sense that if a set cannot pass a test, all of its supersets will fail the same test as well. It is called antimonotone because the property is monotonic in the context of failing a test. “How is the Apriori property used in the algorithm?” To understand this, let us look at how k1 is used to find Lk for k _ 2. A two-step process is followed, consisting of join and prune actions. 1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk1 with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk1. The notation li[j] refers to the jth item in li (e.g., l1[k2] refers to the second to the last item in l1). By convention, Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. For the (k1)itemset, li, this means that the items are sorted such that li[1] < li[2] < : : : < li[k1]. The join, Lk1 on Lk1, is performed, where members of Lk1 are joinable if their first (k2) items are in common. That is, members l1 and l2 of Lk1 are joined if (l1[1] = l2[1]) ^ (l1[2] = l2[2]) ^: : :^(l1[k2] = l2[k2]) ^(l1[k1] < l2[k1]). The condition l1[k1] < l2[k1] simply ensures that no duplicates are generated. The resulting itemset formed by joining l1 and l2 is l1[1], l1[2], : : : , l1[k�2], l1[k�1], l2[k�1]. 2. The prune step:Ck is a superset of Lk, that is, its members may or may not be frequent, but all of the frequent k-itemsets are included inCk.Ascan of the database to determine the count of each candidate in Ck would result in the determination of Lk (i.e., all candidates having a count no less than the minimum support count are frequent by definition, and therefore belong to Lk). Ck, however, can be huge. 4. What is association analysis? [Dec 13/jan-14][10marks] “What is keyword-based association analysis?” Such analysis collects sets of keywords or terms that occur frequently together and then finds the association or correlation relationships among them. Like most of the analyses in text databases, association analysis first preprocesses the text data by parsing, stemming, removing stop words, and so on, and then evokes association mining algorithms. In a document database, each document can be viewed as a transaction, while a set of keywords in the document can be considered as a set of items in the transaction. That is, the database is in the format fdocument id;a set of keywordsg: The problem of keyword association mining in document databases is thereby mapped to item association mining in transaction databases, where many interesting methods have been developed. Notice that a set of frequently occurring consecutive or closely located keywords may form a term or a phrase. The association mining process can help detect compound associations, that is, domain-dependent terms or phrases, such as [Stanford, University] or [U.S., President, George W. Bush], or noncompound associations, such as [dollars, shares, exchange, total, commission, stake, securities].Mining based on these associations is referred to as “term-level association mining” (as opposed to mining on individual words). Term recognition and termlevel association mining enjoy two advantages in text analysis: (1) terms and phrases are automatically tagged so there is no need for human effort in tagging documents; and (2) the number of meaningless results is greatly reduced, as is the execution time of the mining algorithms. With such term and phrase recognition, term-level mining can be evoked to find associations among a set of detected terms and keywords. Some users may like to find associations between pairs of keywords or terms from a given set of keywords or phrases, whereas others may wish to find the maximal set of terms occurring together. Therefore, based on user mining requirements, standard association mining or max-pattern mining algorithms may be evoked. 5.List the measures used for evaluating assocaiation patterns [Dec-14/Jan 2015][4marks] Evaluation of Association Patterns 1.Subjective vs. Objective Measures of Interestingness 2.Objective Measures of Interestingness 3.Interest Factor 4. Correlation Analysis 5.IS Measure