databases algorithm

advertisement
1. Explain FP growth algorithm for discovering frequent item sets?what are its
Limitations [Dec-14/Jan 2015][8marks]
The FP-growth algorithm: mining frequent patterns without candidate generation [Han, Pei &
Yin 2000]
•Compress a large database into a compact Frequent-Pattern tree (FP-tree) structure
–highly condensed, but complete for frequent pattern mining
–avoid costly database scans
•Develop an efficient, FP-tree-based frequent pattern mining method
–A divide-and-conquer methodology: decompose mining tasks into smaller ones
–Avoid candidate generation: sub-database test only
Method (divide-and-conquer)
–For each item, construct its conditional pattern-base, and then its conditional FP-tree.
–Repeat the process on each newly created conditional FP-tree.
–Until the resulting FP-tree is empty, or it contains only one path (single path will generate all
the combinations of its sub-paths, each of which is a frequent pattern)
2. Explain Descriptive tasks in detail? [June/July 2014] [10marks]
Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database
systems, statistics, machine learning, visualization, and information science. Moreover, depending
on the data mining approach used, techniques from other disciplines may be applied, such as neural
networks, fuzzy and/or rough set theory, knowledge representation, inductive logic programming,
or high-performance computing.
Depending on the kinds of data to be mined or on the given data mining application, the data
mining system may also integrate techniques from spatial data analysis, information retrieval,
pattern recognition, image analysis, signal processing, computer graphics, Web technology,
economics, business, bioinformatics, or psychology. Because of the diversity of disciplines
contributing to data mining, data mining research is expected to generate a large variety of data
mining systems. Therefore, it is necessary to provide a clear classification of data mining systems,
which may help potential users distinguish between such systems and identify those that best
match their needs. Data mining systems can be categorized according to various criteria, as
follows:
3. Develop the Apriori Algorithm for generating frequent itemset.[Dec-14/Jan
2015][8marks], [Dec 13][10marks],[june/july-15][10 marks]
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent
itemsets for Boolean association rules. The name of the algorithm is based on the fact that the
algorithm uses prior knowledge of frequent itemset properties, as we shall see following. Apriori
employs an iterative approach known as a level-wise search, where k-itemsets are usedtoexplore
(k+1)-itemsets. First, the setof frequent 1-itemsets is found by scanning the database to accumulate
the count for each item, and collecting those items that satisfy minimum support. The resulting set
is denoted L1.Next, L1 is used to find L2, the set of frequent 2- itemsets, which is used to find L3,
and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full
scan of the database. To improve the efficiency of the level-wise generation of frequent itemsets,
an important property called the Apriori property, presented below, is used to reduce the search
space. We will first describe this property, and then show an example illustrating its use. Apriori
property: All nonempty subsets of a frequent itemset must also be frequent. The Apriori property
is based on the following observation. By definition, if an itemset I does not satisfy the minimum
support threshold, min sup, then I is not frequent; that is, P(I) < min sup. If an item A is added to
the itemset I, then the resulting itemset (i.e., I [A) cannot occur more frequently than I. Therefore,
I [A is not frequent either; that is, P(I [A) < min sup. This property belongs to a special category of
properties called antimonotone in the sense that if a set cannot pass a test, all of its supersets will
fail the same test as well. It is called antimonotone because the property is monotonic in the context
of failing a test. “How is the Apriori property used in the algorithm?” To understand this, let us
look at how k1 is used to find Lk for k _ 2. A two-step process is followed, consisting of join and
prune actions.
1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk1 with itself.
This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk1. The notation li[j] refers to
the jth item in li (e.g., l1[k2] refers to the second to the last item in l1). By convention, Apriori
assumes that items within a transaction or itemset are sorted in lexicographic order. For the (k1)itemset, li, this means that the items are sorted such that li[1] < li[2] < : : : < li[k1]. The join, Lk1
on Lk1, is performed, where members of Lk1 are joinable if their first (k2) items are in common.
That is, members l1 and l2 of Lk1 are joined if (l1[1] = l2[1]) ^ (l1[2] = l2[2]) ^: : :^(l1[k2] =
l2[k2]) ^(l1[k1] < l2[k1]). The condition l1[k1] < l2[k1] simply ensures that no duplicates are
generated. The resulting itemset formed by joining l1 and l2 is l1[1], l1[2], : : : , l1[k�2], l1[k�1],
l2[k�1].
2. The prune step:Ck is a superset of Lk, that is, its members may or may not be frequent, but all
of the frequent k-itemsets are included inCk.Ascan of the database to determine the count of each
candidate in Ck would result in the determination of Lk (i.e., all candidates having a count no less
than the minimum support count are frequent by definition, and therefore belong to Lk). Ck,
however, can be huge.
4. What is association analysis? [Dec 13/jan-14][10marks]
“What is keyword-based association analysis?” Such analysis collects sets of keywords or terms
that occur frequently together and then finds the association or correlation relationships among
them. Like most of the analyses in text databases, association analysis first preprocesses the text
data by parsing, stemming, removing stop words, and so on, and then evokes association mining
algorithms. In a document database, each document can be viewed as a transaction, while a set of
keywords in the document can be considered as a set of items in the transaction. That is, the
database is in the format fdocument id;a set of keywordsg:
The problem of keyword association mining in document databases is thereby mapped to item
association mining in transaction databases, where many interesting methods have been developed.
Notice that a set of frequently occurring consecutive or closely located keywords may form a term
or a phrase. The association mining process can help detect compound associations, that is,
domain-dependent terms or phrases, such as [Stanford, University] or [U.S., President, George W.
Bush], or noncompound associations, such as [dollars, shares, exchange, total, commission, stake,
securities].Mining based on these associations is referred to as “term-level association mining” (as
opposed to mining on individual words). Term recognition and termlevel association mining enjoy
two advantages in text analysis: (1) terms and phrases are automatically tagged so there is no need
for human effort in tagging documents; and (2) the number of meaningless results is greatly
reduced, as is the execution time of the mining algorithms.
With such term and phrase recognition, term-level mining can be evoked to find associations
among a set of detected terms and keywords. Some users may like to find associations between
pairs of keywords or terms from a given set of keywords or phrases, whereas others may wish to
find the maximal set of terms occurring together. Therefore, based on user mining requirements,
standard association mining or max-pattern mining algorithms may be evoked.
5.List the measures used for evaluating assocaiation patterns [Dec-14/Jan 2015][4marks]
Evaluation of Association Patterns
1.Subjective vs. Objective Measures of Interestingness
2.Objective Measures of Interestingness
3.Interest Factor
4. Correlation Analysis
5.IS Measure
Download