Chapter 2 Data Mining Faculty of Computer Science and Engineering HCM City University of Technology October- 2010 1 Outline 1. 2. 3. 4. 5. 6. 7. Overview of data mining Association rules Classification Regression Clustering Other Data Mining problems Applications of data mining 2 DATA MINING Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast amount of data. To be practically useful, data mining must be carried out efficiently on large files and databases. This chapter briefly reviews the state-of-the-art of this extensive field of data mining. Data mining uses techniques from such areas as machine learning, statistics, neural networks genetic algorithms. 3 1. OVERVIEW OF DATA MINING Data Mining as a Part of the Knowledge Discovery Process. Knowledge Discovery in Databases, abbreviated as KDD, encompasses more than data mining. The knowledge discovery process comprises six phases: data selection, data cleansing, enrichment, data transformation or encoding, data mining and the reporting and displaying of the discovered information. 4 Example Consider a transaction database maintained by a specially consumer goods retails. Suppose the client data includes a customer name, zip code, phone number, date of purchase, item code, price, quantity, and total amount. A variety of new knowledge can be discovered by KDD processing on this client database. During data selection, data about specific items or categories of items, or from stores in a specific region or area of the country, may be selected. The data cleansing process then may correct invalid zip codes or eliminate records with incorrect phone prefixes. Enrichment enhances the data with additional sources of information. For example, given the client names and phone numbers, the store may purchases other data about age, income, and credit rating and append them to each record. Data transformation and encoding may be done to reduce the amount of data. 5 Example (cont.) The result of mining may be to discover the following type of “new” information: Association rules – e.g., whenever a customer buys video equipment, he or she also buys another electronic gadget. Sequential patterns – e.g., suppose a customer buys a camera, and within three months he or she buys photographic supplies, then within six months he is likely to buy an accessory items. This defines a sequential pattern of transactions. A customer who buys more than twice in the regular periods may be likely buy at least once during the Christmas period. Classification trees – e.g., customers may be classified by frequency of visits, by types of financing used, by amount of purchase, or by affinity for types of items, and some revealing statistics may be generated for such classes. 6 We can see that many possibilities exist for discovering new knowledge about buying patterns, relating factors such as age, income group, place of residence, to what and how much the customers purchase. This information can then be utilized to plan additional store locations based on demographics, to run store promotions, to combine items in advertisements, or to plan seasonal marketing strategies. As this retail store example shows, data mining must be preceded by significant data preparation before it can yield useful information that can directly influence business decisions. The results of data mining may be reported in a variety of formats, such as listings, graphic outputs, summary tables, or visualization. 7 Goals of Data Mining and Knowledge Discovery Data mining is carried out with some end goals. These goals fall into the following classes: Prediction – Data mining can show how certain attributes within the data will behave in the future. Identification – Data patterns can be used to identify the existence of an item, an event or an activity. Classification – Data mining can partition the data so that different classes or categories can be identified based on combinations of parameters. Optimization – One eventual goal of data mining may be to optimize the use of limited resources such as time, space, money, or materials and to maximize output variables such as sales or profits under a given set of constraints. 8 Data Mining: On What Kind of Data? Relational databases Data warehouses Transactional databases Advanced DB and information repositories Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Text databases and multimedia databases Heterogeneous and legacy databases World Wide Web 9 Types of Knowledge Discovered During Data Mining. Data mining addresses inductive knowledge, which discovers new rules and patterns from the supplied data. Knowledge can be represented in many forms: In an unstructured sense, it can be represented by rules. In a structured form, it may be represented in decision trees, semantic networks, or hierarchies of classes or frames. It is common to describe the knowledge discovered during data mining in five ways: Association rules – These rules correlate the presence of a set of items with another range of values for another set of variables. 10 Types of Knowledge Discovered (cont.) Classification hierarchies – The goal is to work from an existing set of events or transactions to create a hierarchy of classes. Patterns within time series Sequential patterns: A sequence of actions or events is sought. Detection of sequential patterns is equivalent to detecting associations among events with certain temporal relationship. Clustering – A given population of events can be partitioned into sets of “similar” elements. 11 Main function phases of the KD process Learning the application domain: relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge 12 Main phases of data mining Pattern Evaluation/ Presentation Data Mining Patterns Task-relevant Data Data Warehouse Selection/Transformation Data Cleaning Data Integration Data Sources 13 2. ASSOCIATION RULES What Is Association Rule Mining? Association rule mining is finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Applications: Basket data analysis, cross-marketing, catalog design, clustering, classification, etc. Rule form: “Body Head [support, confidence]”. 14 Association rule mining Examples. buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%] major(x, “CS”) takes(x, “DB”) grade(x, “A”) [1%, 75%] Association Rule Mining Problem: Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items E.g., 98% of people who purchase tires and auto accessories also get automotive services done. 15 Rule Measures: Support and Confidence Let J = {i1, i2,…,im} be a set of items. Let D, the taskrelevant data, be a set of database transactions where each transaction T is a set of items such that T J. Each transaction T is said to contain A if and only if A T. An association rule is an implication of the form A B where A J, B J and A B = . The rule A B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A B (i.e. both A and B). This is taken to be the probability P(A B ). The rule A B has the confidence c in the transaction set D if c is the percentage of transactions in D containing A that also contain B. 16 Support and confidence That is. support, s, probability that a transaction contains {A B } s = P(A B ) confidence, c, conditional probability that a transaction having A also contains B. c = P(A|B). Rules that satisfy both a minimum support threhold (min_sup) and a mimimum confidence threhold (min_conf) are called strong. 17 Frequent item set A set of items is referred as an itemset. An itemset that contains k items is a k-itemset. The occurrence frequency of an itemset is the number of transactions that contain the itemset. An itemset satisfies minimum support if the occurrence frequency of the itemset is greater than or equal to the product of min_suf and the total number of transactions in D. The number of transactions required for the itemset to satisfy minimum support is referred to as the minimum support count. If an itemset satisfies minimum support, then it is a frequent itemset. The set of frequent k-itemsets is commonly denoted by Lk. 18 Example 2.1 Transaction-ID Items_bought ------------------------------------------2000 A, B, C 1000 A, C 4000 A, D 5000 B, E, F Let minimum support 50%, and minimum confidence 50%, we have A C (50%, 66.6%) C A (50%, 100%) 19 Types of Association Rules Boolean vs. quantitative associations (Based on the types of values handled) buys(x, “SQLServer”) buys(x, “DMBook”) buys(x, “DBMiner”) [0.2%, 60%] age(x, “30..39”) income(x, “42..48K”) buys(x, “PC”) [1%, 75%] Single dimension vs. multiple dimensional associations The rule that references two or more dimensions, such as the dimensions buys, income and age is a multi-dimensional association rule. Single level vs. multiple-level analysis Some methods for association rule mining can find rules at different levels of abstractions. For example, suppose that a set of association rule mined includes the following rules: age(x, “30..39”) buys(x, “laptop computer”) age(x, “30..39”) buys(x, “ computer”) in which “computer” is a higher level abstraction of “laptop computer”. 20 How to mine association rules from large databases? Association rule mining is a two-step process: 1. Find all frequent itemsets (the sets of items that have minimum support) A subset of a frequent itemset must also be a frequent itemset. (Apriori principle) i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to k (kitemset) 2.Generate strong association rules from the frequent itemsets. The overall performance of mining association rules is determined by the first step. 21 The Apriori Algorithm Apriori is an important algorithm for mining frequent itemsets for Boolean association rules. Apriori algorithm employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)-itemsets. First, the set of frequent 1-itemsets is found. This set is denoted L1. L1 is used to find L2, the set of frequent 2itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database. To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property is used to reduce the search space. 22 Apriori property Apriori property: All nonempty subsets of a frequent itemset must also be frequent. The Apriori property is based on the following observation. By definition, if an itemset I does not satisfy the minimum support threhold, min_sup, then I is not frequent, that is, P(I) < min_suf. If an item A is added to the itemset I, then the resulting itemset, I A, can not occur more frequently than I. Therefore, I A is not frequent either, i.e., P(IA) < min_suf. This property belongs to a special category of properties called anti-monotone in the sense that if a set cannot pass a test, all of its supersets will fail the same test as well. 23 Finding Lk using Lk-1. A two-step process is used in finding Lk using Lk-1. Join Step: Ck is generated by joining Lk-1 with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset 24 Pseudo code Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = apriori_gen(Lk, min_sup); for each transaction t in database do // scan D for counts increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; 25 procedure apriori_gen(Lk:frequent k-itemset, min_sup: minmum support threshold) (1) for each itemset l1 Lk (2) for each itemset l2 Lk (3) if(l1[1] = l2[1] l1[2] = l2[2] … l1[k-1] = l2[k-1] l1[k] < l2[k] then { (4) c = l1 l2; (5) if some k-subset s of c Lk then (6) delete c; // prune step: remove unfruitful candidate (7) else add c to Ck; (8) } (9) return Ck; (10) end procedure 26 Example 2.2: TID List of item_Ids ----------------------------T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 I2, I3 T700 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3 Assume that minimum transaction support count required is 2 (i.e. min_sup = 2/9=22%). 27 C1 Itemset {I1} {I2} {I3) {I4} {I5} C2 Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5} Sup.count 6 7 6 2 2 Sup.count 4 4 1 2 4 2 2 0 1 0 L1 Itemset {I1} {I2} {I3) {I4} {I5} L2 Itemset {I1, I2} {I1, I3} {I1, I5} {I2, I3} {I2, I4} {I2, I5} Sup.count 6 7 6 2 2 Sup.count 4 4 2 4 2 2 28 C3 Itemset {I1, I2, I3} {I1, I2, I5} {I1, I3, I5} {I2, I3, I4} {I2, I3, I5} {I2, I4, I5} Sup.count 2 2 X X X X C4 = {{I1, I2, I3, I5}} L3 Itemset {I1, I2, I3} {I1, I2, I5} Sup.count 2 2 L4 = 29 Generating Association Rules from Frequent Itemsets Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate strong association rules from them. This can be done using the following equation for confidence, where the conditional probability is expressed in terms of itemset support count: confidence(A B) = P(B|A) = support_count(AB)/support_count(A) where support_count(X) is the number of transactions containing the itemsets X. 30 Based on this equation, association rules can be generated as follows: For each frequent itemset l, generate all nonempty subsets of l. For every nonempty subset s of l, output the rule “ s (l – s)” if support_count(l)/support_count(s) min_conf, where min_conf is the minimum confidence threshold. Since the rules are generated from frequent itemsets, each one automatically satisfies minimum support. 31 Example 2.3. From Example 2.2, suppose the data contain the frequent itemset l = {I1, I2, I5}. The nonempty subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2} and {I5}. The resulting association rules are as shown blow: I1 I2 I5 I1 I5 I2 I2 I5 I1 I1 I2 I5 I2 I1 I5 I5 I1 I2 confidence = 2/4 = 50% confidence = 2/2 = 100% confidence = 2/2 = 100% confidence = 2/6 = 33% confidence = 2/7 = 29% confidence = 2/2 = 100% If the minimum confidence threshold is, say, 70%, then only the second, third and last rules above are outputs. 32 Properties of Apriori algorithm Generate several candidate itemsets 104 frequent 1-itemsets more than 107 (≈104(104-1)/2) candidate 2-itemsets Each k-itemset needs at least 2k -1 candidate itemsets. Examine the dataset several times High cost when sizes of itemsets increase. If k-itemsets are identified then the algorithm examines the dataset k+1 times. 33 Improving the efficiency of Apriori Hash-based technique: hashing itemsets into corresponding buckets. Transaction reduction: reducing the number of transaction scanned in future iterations. Partitioning: partitioning the data to find candidate itemsets. Sampling: mining on a subset of the given data. Dynamic itemset counting: adding candidate itemsets at different points during a scan. 34 3. CLASSIFICATION Classification is the process of learning a model that describes different classes of data. The classes are predetermined. Example: In a banking application, customers who apply for a credit card may be classify as a “good risk”, a “fair risk” or a “poor risk”. Hence, this type of activity is also called supervised learning. Once the model is built, then it can be used to classify new data. 35 The first step, of learning the model, is accomplished by using a training set of data that has already been classified. Each record in the training data contains an attribute, called the class label, that indicates which class the record belongs to. The model that is produced is usually in the form of a decision tree or a set of rules. Some of the important issues with regard to the model and the algorithm that produces the model include: the model’s ability to predict the correct class of the new data, the computational cost associated with the algorithm the scalability of the algorithm. Let examine the approach where the model is in the form of a decision tree. A decision tree is simply a graphical representation of the description of each class or in other words, a representation of the classification rules. 36 Example 3.1 Example 3.1: Suppose that we have a database of customers on the AllEletronics mailing list. The database describes attributes of the customers, such as their name, age, income, occupation, and credit rating. The customers can be classified as to whether or not they have purchased a computer at AllElectronics. Suppose that new customers are added to the database and that you would like to notify these customers of an upcoming computer sale. To send out promotional literature to every new customers in the database can be quite costly. A more cost-efficient method would be to target only those new customers who are likely to purchase a new computer. A classification model can be constructed and used for this purpose. The figure 2 shows a decision tree for the concept buys_computer, indicating whether or not a customer at AllElectronics is likely to purchase a computer. 37 Each internal node represents a test on an attribute. Each leaf node represents a class. A decision tree for the concept buys_computer, indicating whether or not a customer at AllElectronics is likely to purchase a computer. 38 Algorithm for decision tree induction Input: set of training data records: R1, R2, …, Rm and set of Attributes A1, A2, …, An Ouput: decision tree Basic algorithm (a greedy algorithm) - Tree is constructed in a top-down recursive divideand-conquer manner - At start, all the training examples are at the root - Attributes are categorical (if continuous-valued, they are discretized in advance) - Examples are partitioned recursively based on selected attributes - Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) 39 Conditions for stopping partitioning - All samples for a given node belong to the same class - There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf - There are no samples left. 40 Procedure Build_tree(Records, Attributes); Begin (1) Create a node N; (2) If all Records belong to the same class, C then (3) Return N as a leaf node with the class label C; (4) If Attributes is empty then (5) Return N as a leaf node with the class label C, such that the majority of Records belong to it; (6) select attributes Ai (with the highest information gain) from Attributes; (7) label node N with Ai; (8) for each known value aj of Ai do begin (9) add a branch for node N for the condition Ai = aj; (10) Sj = subset of Records where Ai = aj; (11) If Sj is empty then (12) Add a leaf L with class label C, such that the majority of Records belong to it and return L else (13) Add the node return by Build_tree(Sj, Attributes – Ai); end end 41 Attribute Selection Measure The expected information gain needed to classify training data of s samples, where the Class attribute has m values (a1, …, am) and si is the number of samples belong to Class label ai is given by: I(s1, s2,…, sm) = - m p log i 2 ( pi ) i 1 where pi is the probability that a random sample belongs to the class with label ai. An estimate of pi is si/s. Consider an attribute A with values {a1, …, av } used as the test attribute for splitting in the decision tree. Attribute A partitions the samples into the subsets S1,…, Sv where samples in each Si have a value of ai for attribute A. Each Si may contain samples that belong to any of the classes. The number of samples in Si that belong to class j can be denoted as sij. Entropy of A is given by: v s1 j ... smj E(A) = I ( s1 j,..., smj) j 1 s 42 I(s1j,…,smj) can be defined using the formulation for I(s1,…,sm) with pi being replaces by pij = sij/sj. Now the information gain by partitioning on attribute A is defined as: Gain(A) = I(s1, s2,…, sm) – E(A). Example 3.1: Table 1 presents a training set of data tuples taken from the AllElectronics customer database. The class label attribute, buys_computer, has two distinct values; therefore two distinct classes (m = 2). Let class C1 correspond to yes and class C2 corresponds to no. There are 9 samples of class yes and 5 samples of class no. To compute the information gain of each attribute, we first use Equation (1) to compute the expected information needed to classify a given sample: I(s1, s2) = I(9,5) = - (9/14) log2(9/14) – (5/9)log2(5/14) = 0.94 43 Training data tuples from the AllElectronics customer database age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent Class No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No 44 Next, we need to compute the entropy of each attribute. Let’s start with the attribute age. We need to look at the distribution of yes and no samples for each value of age. We compute the expected information for each of these distributions. For age =”<= 30”: s11 = 2 s21 = 3 I(s11, s21) = -(2/5)log2(2/5) – (3/5)log2(3/5)= 0.971 For age = “31…40” s12 = 4 s22 = 0 I(s12, s22) = -(4/4)log2(4/4) – (0/4)log2(0/4) = 0 For age = “>40”: s13 = 3 s23 = 2 I(s13, s23) = -(3/5)log2(3/5) – (2/5)log2(2/5)= 0.971 Using Equation (2), the expected information needed to classify a given sample if the samples are partitioned according to age is E(age) = (5/14)I(s11, s21) + (4/14) I(s12, s22) + (5/14)I(s13, s23) = (10/14)*0.971 = 0.694. 45 Hence, the gain in information from such a partitioning would be Gain(age) = I(s1, s2) – E(age) = 0.940 – 0.694 = 0.246 Similarly, we can compute Gain(income) = 0.029, Gain(student) = 0.151, and Gain(credit_rating) = 0.048. Since age has the highest information gain among the attributes, it is selected as the test attribute. A node is created and labeled with age, and branches are grown for each of the attribute’s values. The samples are then partitioned accordingly, as shown in Figure 3. 46 age? <= 30 >40 31…40 income student credit_rating class high no fair no high no excellent no medium no fair no low yes fair yes medium yes excellent yes income high low medium high income medium low low medium medium student no yes no yes student credit_rating no fair yes fair yes excellent yes fair no excellent class yes yes no yes no credit_rating class fair yes excellent yes excellent yes fair yes 47 Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand. Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “no” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “yes” 48 Neural Networks and Classification Neural network is a technique derived from AI that uses generalized approximation and provides an iterative method to carry it out. ANNs use the curve-fitting approach to infer a function from a set of samples. This technique provides a “learning approach”; it is driven by a test sample that is used for the initial inference and learning. With this kind of learning method, responses to new inputs may be able to be interpolated from the known samples. This interpolation depends on the model developed by the learning method. 49 ANN and classification ANNs can be classified into 2 categories: supervised and unsupervised networks. Adaptive methods that attempt to reduce the output error are supervised learning methods, whereas those that develop internal representations without sample outputs are called unsupervised learning methods. ANNs can learn from information on a specific problem. They perform well on classification tasks and are therefore useful in data mining. 50 Information processing at a neuron in an ANN 51 A multilayer Feed-Forward Neural Network 52 Backpropagation algorithm 53 54 Classification with ANN Output vector Err j O j (1 O j ) Errk w jk k Output nodes j j (l) Err j wij wij (l ) Err j Oi Err j O j (1 O j )(T j O j ) Hidden nodes wij Input nodes Oj 1 I 1 e j I j wij Oi j i Input vector: xi 55 Example: 56 Example: 57 58 Other Classification Methods k-nearest neighbor classifier case-based reasoning Genetic algorithm Rough set approach Fuzzy set approaches 59 The k-Nearest Neighbor Algorithm All instances (samples) correspond to points in the n-dimensional space. The nearest neighbor are defined in terms of Euclidean distance. The Euclidean distance of two points, X = (x1, x2, …,xn) and Y = (y1, y2, …,yn) is n d(X,Y) = (xi –yi)2 1 When given an unknown sample, the k-Nearest Neighbor classifier search for the space for the k training samples that are closest to the unknown sample xq. The unknown sample is assigned the most common class among its k nearest neighbors. The algorithm has to vote to determine the most common class among the k nearest neighbor. When k = 1, the unknown sample is assigned the class of the training sample that is closest to it in the space. Once we have obtained xq’s k-nearest neighbors using the distance function, it is time for the neighbors to vote in order to determine xq’s class. 60 Two approaches are common. Majority voting: In this approach, all votes are equal For each class, we count how many of the k neighbors have that class. We return the class with the most votes. Inverse distance-weighted voting: In this approach, closer neighbors get higher votes. While there are better-motivated methods, the simplest version is to take a neighbor’s vote to be the inverse of its distance to xq: 1 w d ( xq , xi )2 Then we sum the votes and return the class with the highest vote. 61 Genetic Algorithms GA: based on an analogy to biological evolution Each rule is represented by a string of bits. Example: The rule “IF A1 and Not A2 then C2“ can be encoded as the bit string “100”, where the two left bits represent attributes A1 and A2, respectively, and the rightmost bit represents the class. Similarly, the rule “IF NOT A1 AND NOT A2 THEN C1” can be encoded as “001”. An initial population is created consisting of randomly generated rules Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings The fitness of a rule is represented by its classification accuracy on a set of training examples Offsprings are generated by crossover and mutation. 62 4. Regression Predictive data mining Descriptive data mining Definition: (J. Han et al., 2001&2006) Regression is a method used to predict continuous values for given input. 63 Regression Regression analysis can be used to model the relationship between one or more independent or predictor variables and one or more response or dependent variables. Categories Linear regression and nonlinear regression Uni-variate and multi-variate regression parametric, nonparametric and semi-parametric 64 Regression function Regression function: Y = f(X, β) X: predictor/independent variables Y: response/dependent variables β: regression coefficients X: used to explain the changes of response variable Y. Y: used to describe the target phenomenon. The relationship between Y and X can be representeb y the functional dependence of Y to X. β describes the influence of X to Y. 65 Regression with a single predictor variable Given N observed objects, this linear model is described in the following form with εi representing the part of response Y that can not be explained from X: - Line form: -Parabola form: 66 Linear regression with single predictor variable Estimate the parameter set β ( obtain linear regression model: ) in order to residual Sum of squared residuals minimize Estimated value β 67 Linear multiple regression This linear regression model analyses the relationship between response/dependent variable and two or more independent variables. yi = b0 + b1xi1 + b2xi2 + … + bkxik i = 1..n (n is the number of observed objects) k = the number of predictor variables (the number of attributes) Y = dependent variables X = independent variables b0 = Y’s value when X = 0 b1..k = regression coefficients 68 Linear multiple regression Estimated value of Y Estimated values of b yˆ b0 b1 x1 b2 x2 1 bk xk b X X X Y 1 x1,1 Y1 1 x Y 2,1 Y 2, X Yn 1 xn,1 T x1,2 x2,2 xn,2 T x1,k b0 b x2, k , b 1 xn,k bk 69 Linear multiple regression Example: a sales manager of Tackey Toys, needs to predict sales of Tackey products in selected market area. He believes that advertising expenditures and the population in each market area can be used to predict sales. He gathered sample of toy sales, advertising expenditures and the population as below. Find the linear multiple regression equation which the best fit to the data. 70 Linear multiple regression Market Area Advertising Expenditures (Thousands of Dollars) x1 Population (Thousands) x2 Toy sales (Thousands of Dollars) y A 1.0 200 100 B 5.0 700 300 C 8.0 800 400 D 6.0 400 200 E 3.0 100 100 F 10.0 600 400 71 Linear multiple regression Softwares: SPSS, SAS, R yˆ 6.3972 20.4921x1 0.2805 x2 72 Nonlinear regression Y = f(X, β) Y is a nonlinear function for the combining of the coefficients β. Examples: exponential function, logarithmic function, Gauss function, … Determine the optimal β: optimization algorithms Local optimization Global optimization for the sum of squared residuals 73 Applications of regression Data Mining Preprocessing stage Data Mining stage Descriptive data mining Predictive data mining Application areas: biology, agriculture, social issues, economy, business, … 74 Some problems with regression Some assumptions going along with regression. Danger of extrapolation. Evaluation of regression models. Other advanced techniques for regression: Artificial Neural Network (ANN) Support Vector Machine (SVM) 75 5. CLUSTERING What is Cluster Analysis? Cluster: a collection of data objects Cluster analysis Similar to one another within the same cluster Dissimilar to the objects in other clusters. Grouping a set of data objects into clusters. Clustering is unsupervised learning: no predefined classes, no class-labeled training samples. Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms 76 General Applications of Clustering Pattern Recognition Spatial Data Analysis create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining Image Processing Economic Science (especially market research) World Wide Web Document classification Cluster Weblog data to discover groups of similar access patterns 77 Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs. Land use: Identification of areas of similar land use in an earth observation database. Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. City-planning: Identifying groups of houses according to their house type, value, and geographical location. Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults. 78 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: d (i, j) q (| x x |q | x x |q ... | x x |q ) i1 j1 i2 j2 ip jp where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer If q = 1, d is Manhattan distance d (i, j) | x x | | x x | ... | x x | i1 j1 i2 j 2 ip jp 79 Euclid distance If q = 2, d is Euclidean distance: Properties d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j) Also one can use weighted distance, or other disimilarity measures. 80 Type of data in cluster analysis Interval-scaled variables/attributes Binary variables/attributes Categorical variables/attributes Ordinal variables/attributes Ratio-scaled variables/attributes Variables/attributes of mixed types 81 Type of data Interval-scaled variables/attributes Mean absolute deviation sf 1 n (| x1 f m f | | x2 f m f | ... | xnf m f |) Mean m f 1n (x1 f x2 f Z-score measurement ... xnf ) . xif m f zif sf 82 Type of data Binary variables/attributes Object j Object i 1 0 1 a b 0 c d sum a c b d sum a b cd p (= a + b + c + d) Dissimilarity (if symmetric): d (i, j) bc a bc d Dissimilarity (if asymmetric): d (i, j) bc a bc 83 Type of data Binary variables/attributes Example Name Jack Mary Jim Gender M F M Fever Y Y Y Cough N N P Test-1 P P N Test-2 N N N Test-3 N P N Test-4 N N N gender: symmetric Other binary attributes: asymmetric Y, P 1, N 0 01 0.33 2 01 11 d ( jack , jim ) 0.67 111 1 2 d ( jim , mary ) 0.75 11 2 d ( jack , mary ) 84 Type of data Variables/attributes of mixed types General case: pf 1 ij( f ) dij( f ) d (i, j) pf 1 ij( f ) if xif or xjf is missing) then f (variable/attribute): binary (categorical) dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise f : interval-scaled f : ordinal dij(f) = |xif-xjf|/(maxhxhf-minhxhf) zif compute ranks rif and zif becomes interval-scaled r 1 M 1 if f 85 Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion. - Global optimal: exhaustively enumerate all partitions - Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 86 The K-Means Clustering Method Input: a database D, of m records, r1, r2,…,rm and a desired number of clusters k. Output: set of k clusters that minimizes the square error criterion. Given k, the k-means algorithm is implemented in 4 steps: Step 1: Randomly choose k records as the initial cluster centers. Step 2: Assign each records ri, to the cluster such that the distance between ri and the cluster centroid (mean) is the smallest among the k clusters. Step 3: recalculate the centroid (mean) of each cluster based on the records assigned to the cluster. Step 4: Go back to Step 2, stop when no more new assignment. 87 The algorithm begins by randomly choosing k records to represent the centroids (means), m1, m2,…,mk of the clusters, C1, C2,…,Ck. All the records are placed in a given cluster based on the distance between the record and the cluster mean. If the distance between mi and record rj is the smallest among all cluster means, then record is placed in cluster Ci. Once all records have been placed in a cluster, the mean for each cluster is recomputed. Then the process repeats, by examining each record again and placing it in the cluster whose mean is closest. Several iterations may be needed, but the algorithm will converge, although it may terminate at a local optimum. 88 Square-error criterion The terminating condition is usually the squarederror criterion. Typically, the square-error criterion is used, defined as: k E = i=1pCi |p – mi|2 where E is the sum of square-error for all objects in the database, p is the point in space representing a given object, and mi is the mean of cluster Ci. This criterion tries to make the resulting clusters as compact and as separate as possible. 89 Example 4.1: Consider the K-means clustering algorithm that works with the (2-dimensional) records in Table 2. Assume that the number of desired clusters k is 2. RID Age Years of Service -------------------------------------1 30 5 2 50 25 3 50 15 4 25 5 5 30 10 6 30 25 Let the algorithm choose records with RID 3 for cluster C1 and RID 6 for cluster C2. as the initial cluster centroids. The first iteration: distance(r1, C1) = (50-30)2+(15-5)2 = 22.4; distance(r1, C2) = 32.0, so r1 C1. distance(r2, C1) = 10.0 and distance(r2, C2) = 5.0 so r2 C2. distance(r4, C1) = 25.5 and distance(r4, C2) = 36.6 so r4 C1 distance(r5, C1) = 20.6 and distance(r5, C2) = 29.2 so r5 C1 Now the new means (centroids) for the two clusters are computed. 90 The means for a cluster Ci with n records of m dimensions is the vector: (1/n rj Ci rj1, … 1/n rj Ci rjm) The new mean for C1 is (33.75, 8.75) and the new mean for C2 is (52.5, 25). The second iteration: r1, r4, r5 C1 and r2, r3, r6 C2. The mean for C1 and C2 are recomputed as (28.3, 6.7) and (51.7, 21.7). In the next iteration, all records stay in their previous clusters and the algorithm terminates. 91 Clustering of a set of objects based on the k-means method. 92 Comments on the K-Means Method Strength Relatively efficient: O(tkn), where n is # objects, k is # of clusters, and t is # of iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Quality of clustering is dependent on the choice of initial cluster centroids. 93 Partitioning around metroids (k-metroids) 94 K-metroids Compute “total cost S of swapping Oj và Orandom” = ΣpCp/OiOrandom 95 K-metroids Compute “total cost S of swapping Oj và Orandom” = ΣpCp/OiOrandom Cp/OiOrandom = d(p,Oi) – d(p,Oj) Cp/OiOrandom = d(p,Orandom) – d(p,Oj) Cp/OiOrandom = 0 Cp/OiOrandom = d(p,Orandom) – d(p,Oi) 96 Properties of k-metroids algorithm Each cluster has its representative objective, the medoid, or most centrally located objects, of its cluster. Reduce the influence of noise (outlier/irregularitites/extrema). The number of clusters k needs to be predetermined. Complexity for each iteration: O(k(n-k)2) Algorithm becomes very costly for large values of n and k. 97 Hierarchical Clustering A hierarchical clustering method works by grouping data objects into a tree of clusters. In general, there are two types of hierarchical clustering methods: Agglomerative hierarchical clustering: This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until a certain termination conditions are satisfied. Most hierarchical clustering methods belong to this category. They differ only in their definition of intercluster similarity. Divisive hierarchical clustering: This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfied certain termination condition, such as a desired number clusters is obtained or the distance between two closest clusters is above a certain threshold distance. 98 Agglomerative algorithm Assume that we are given n data records r1, r2,…,rm and a function D(Ci, Cj) for measuring the distance between two clusters Ci and Cj. Then an agglomerative algorithm for clustering can be as follows: for i = 1,…,n do let Ci = {ri}; while there is more than one cluster left do begin Let Ci and Cj be the clusters which minimize the distance D(Ck, Ch) between any two clusters Ck, Ch; Ci = Ci Cj; Remove cluster Cj end 99 Example Example 4.2: Figure 4 shows the application of AGNES (Agglomerative NESting), an agglomerative hierarchical clustering method, and DIANA (DIvisive ANAlysis), a divisive hierarchical clustering method to a data set of five objects, {a, b, c, d, e}. Initially, AGNES places each object into a cluster of its own. The clusters are then merged step-by-step according to some criterion. For example, clusters C1 and C2 may be merged if an object in C1 and an object in C2 form the minimum Euclidean distance between any two objects from different clusters. This is a single-link approach in that each cluster is represented by all of the objects in the cluster, and the similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters. The cluster merging process repeats until all of the objects are eventually merged to form one cluster. 100 Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e} 101 Hierarchical Clustering In DIANA, all of the objects are used to form one initial cluster. The cluster is split according to some principle, such as the maximum Euclidean distance between the closest neighboring objects in the cluster. The cluster splitting process repeats until, eventually, each new cluster contains only a single objects. In general, divisive methods are more computationally expensive and tend to be less widely used than agglomerative methods. There are a variety of methods for defining the intercluster distance D(Ck, Ch). However, local pairwise distance measures (i.e., between pairs of clusters) are especially suited to hierarchical methods. 102 Hierarchical Clustering One of the most important of intercluster distances is the nearest neighbor or single link method. This defines the distance between two clusters as the distance between the two closest points, one from each cluster; Dsl(Ci, Cj) = min{d(x,y)| x Ci, y Cj} where d(x, y) is the distance between objects x and y. If the distance between two clusters is defined as the distance between the two farthest points, one from each cluster: Dcl(Ci, Cj) = max{d(x,y)| x Ci, y Cj} where d(x, y) is the distance between objects x and y The method is called a complete-linkage algorithm. 103 Single-linkage Complete-linkage 4 4 3 3 2 2 1 1 1 2 3 4 1 2 3 4 Criteria to merge two clusters: single-linkage and complete-linkage 104 Major weakness of hierarchical clustering methods: 1. They do not scale well: time complexity of at least O(n2), where n is the number of total objects. The agglomerative algorithm requires in the first iteration that we locate the closest pair of objects. This takes O(n2) time and so, in most cases, the algorithm require O(n2) time, and frequently much more. 2. We can never undo what was done previously. 105 Some advanced hierarchical clustering algorithms BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): partitions the objects using two concepts: cluster-feature and cluster-feature tree. ROCK (Robust Clustering using linKs): clustering algorithm for categorical/discrete attributes., Chameleon: A hierarchical clustering algorithm using dynamic modeling. 106 Introduction to BIRCH (1996) Hierarchial representation of a clustering The data set needs to be scanned only once Clustering decisions on-the-fly when new data points are inserted. Outlier removal Good for very large dataset Limited computational resources 107 Introduction to BIRCH Features of a cluster: Centroid: Euclidean mean Radius: average distance from any member point to centroid. Diameter: Average pairwise distance between two data points 108 Alternative of measures for closeness of two clusters Any of the following can be used as distance metric to compare a new data point to existing clusters: in BIRCH algorithm: D0: Euclidean distance between centroids D1: Manhattan distance between centroids 109 Alternative of measures for closeness of two clusters ANd for deciding whether to merge clusters: D2=Average Inter-cluster distance between 2 clusters D3=Average intra-cluster distance inside a cluster (geometrically, the diameter of new cluster if 2 clusters are merged) D4=Variance increase distance: amount by which the intracluster distance variance changes if 2 clusters are merged 110 Clustering feature Maintained for each subcluster Enough information to calculate intra-cluster distances 111 Additivity theorem Clustering features are additive Allows us to merge two subclusters 112 CF tree Hierarchial representation of a clustering Updated dynamically when new data points are inserted Each entry in a leaf node is not a data point, but a subcluster 113 CF tree parameters The diameter of a leaf node has to be less than T A nonleaf node contains at most B entries A leaf node contains at most L entries The tree size is a function of T B and L are determined by the requirement that each node fits into a memory page of given size 114 Example CF tree 115 CF-Tree 116 The BIRCH algorithm 1. Build an in-memory CF tree 2. Optional: condense into a smaller CF tree 3. Global clustering 4. Optional: cluster refining 117 Phase 1 Insert data points one at a time, building a CF-tree dynamically If the tree grows too large, increase the threshold T and rebuild from the current tree Optionally remove outliers when rebuilding the tree 118 Insertion algorithm Start from the root node and find the closest leaf node and leaf entry If the distance to the centroid is less than the threshold T, the entry will absorb the new data point Otherwise create a new leaf entry If there is no space on the leaf for new entry, split the leaf node 119 CF-Tree Insertion Choose the farthest pair of entries, and redistribute the remaining entries Insert a new nonleaf entry into the parent node We may have to split the parent node as well If the root is split, the tree height increases by one 120 CF-Tree Insertion 121 CF-Tree Insertion 122 CF-Tree Insertion 123 CF-Tree Insertion 124 CF-Tree Insertion 125 CF-Tree Insertion 126 CF-Tree Insertion 127 Phase 2 Scan the leaf entries to rebuild a smaller CFtree Remove outliers Group more crowded subclusters into larger ones The idea is to make next phase more efficient 128 Phase 3 Leaf nodes do not necessarily represent clusters A regular clustering algorithm is used to cluster the leaf entries 129 Phase 4 Redistribute data points to their closest centroid 130 6. OTHER DATA MINING PROBLEMS Discovering of Sequential Patterns The discovery of sequential patterns is based on the concept of a sequence of itemsets. We assume that transactions are ordered by time of purchase. That ordering yields a sequence of itemsets. For example, {milk, bread, juice}, {bread, eggs}, {cookies, milk, coffee} may be such a sequence of itemsets based on three visits of the same customer to the store. The support for a sequence S of itemsets is the percentage of the given set U of sequences of which S is a subsequence. In this example, {milk, bread, juice}, {bread, eggs} and {bread, eggs}, {cookies, milk, coffee} are considered subsequences. 131 The problem of identifying sequential patterns, then, is to find all subsequences from the given sets of sequences that have a user-defined minimum support. The sequence S1, S2, S3,… is a predictor of the fact that a customer who buys itemset S1 is likely to buy itemset S2 and then S3, and so on. This prediction is based on the frequency (support) of this sequence in the past. Various algorithms have been investigated for sequence detection. 132 Mining Time series Discovering of Patterns in Time Series A time series database consists of sequences of values or events changing with time. The values are typically measured at equal time intervals. Time series databases are popular in many applications, such as studying daily fluctuations of a stock market, traces of scientific experiments, medical treatments, and so on. A time series can be illustrated as a time-series graph which describes a point moving with the passage of time. 133 Time series data: Stock price of IBM over time 134 Categories of Time-Series Movements: Long-term or trend movements Cyclic movements or cycle variations, e.g., business cycles Seasonal movements or seasonal variations i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years. Irregular or random movements 135 Similarity Search in Time-Series Analysis Normal database query finds exact match. Similarity search finds data sequences that differ only slightly from the given query sequence. Two categories of similarity queries Whole matching: find a sequence that is similar to the query sequence Subsequence matching: find all pairs of similar sequences Typical Applications Financial market Transaction data analysis Scientific databases (e.g. power consumption analysis) Medical diagnosis (e.g. cardiogram analysis) 136 Data transformation For similarity analysis of time series data, Euclidean distance is typically used as a similarity measure. Many techniques for signal analysis require the data to be in the frequency domain. Therefore, distance-preserving transformations are often used to transform the data from time domain to frequency domain. Usually data-independent transformations are used where the transformation matrix is determined a priori. E.g., discrete Fourier transform (DFT), discrete wavelet transform (DWT) The distance between two signals in the time domain is the same as their Euclidean distance in the frequency domain DFT does a good job of concentrating energy in the first few coefficients. If we keep only first a few coefficients in DFT, we can compute the lower bounds of the actual distance. 137 Multidimensional Indexing Multidimensional index Constructed for efficient accessing using the first few Fourier coefficients Use the index to retrieve the sequences that are at most a certain small distance away from the query sequence. Perform post-processing by computing the actual distance between sequences in the time domain and discard any false matches. 138 Subsequence Matching Break each sequence into a set of pieces of window with length w. Extract the features of the subsequence inside the window Map each sequence to a “trail” in the feature space Divide the trail of each sequence into “subtrails” and represent each of them with minimum bounding rectangle. (R-trees and R*-trees have been used to store minimal bounding rectangles so as to speed up the similarity search.) Use a multipiece assembly algorithm to search for longer sequence matches. 139 We can group clusters of datapoints with “boxes”, called Minimum Bounding Rectangles (MBR). R1 R2 R4 R5 R3 R6 R9 R7 R8 We can further recursively group MBRs into larger MBRs…. 140 …these nested MBRs are organized as a tree (called a spatial access tree or a multidimensional tree). Examples include R-tree, Hybrid-Tree etc. R10 R11 R10 R11 R12 R1 R2 R3 R12 R4 R5 R6 R7 R8 R9 Data nodes containing points 141 Discretization Discretization of a time series is tranforming it into a symbolic string. The main benefit of this discretization is that there is an enormous wealth of existing algorithms and data structures that allow the efficient manipulations of symbolic representations. Lin and Keogh et al. (2003) proposed a method called Symbolic Aggregate Approximation (SAX), which allows the descretization of original time series into symbolic strings. 142 Symbolic Aggregate Approximation (SAX) [Lin et al. 2003] The first symbolic representation of time series, that allow Lower bounding of Euclidean distance Dimensionality Reduction Numerosity Reduction baabccbc 143 How do we obtain SAX C C 0 40 60 80 100 120 c First convert the time series to PAA representation, then convert the PAA to symbols It take linear time 20 c c b b a 0 20 b a 40 60 80 100 baabccbc 120 144 Two parameter choices C The word size, in this case 8 C 0 20 40 1 2 60 3 1 b a 0 20 4 80 100 5 c 6 7 8 c c b 120 b 2 1 a 40 60 80 100 3 120 The alphabet size (cardinality), in this case 3 145 Time Series Data Mining tasks • Similarity Search • Classification • Clustering • Motif Discovery • Novelty/Anomaly Detection • Time series visualization • Time series prediction 146 Some representative data mining tools Acknosoft (Kate) Decision trees, case-based reasoning DBMiner Technology (DBMiner) OLAP analysis, Associations, classification, clustering. IBM (Intelligent Miner) Classification, Association rules, predictive models. NCR (Management Discovery Tool) Association rules SAS (Enterprise Miner) Decision trees, Association rules, neural networks, Regression, clustering Silicon Graphics (MineSet) Decision trees, Association rules Oracle (Oracle Data Mining) classification, prediction, regression, clustering, association, feature selection, feature extraction, anomaly selection. Weka system (http://www.cs.waikato.ac.nz/ml/weka) University of Waikato, Newzealand. The system is written in Java. The platforms: Linux, Windows, Macintosh. 147 7. POTENTIAL APPLICATIONS OF DM Database analysis and decision support Market analysis and management Risk analysis and management target marketing, customer relation management, market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management Other Applications Text mining (news group, email, documents) and Web analysis. Intelligent query answering 148 Market Analysis and Management Where are the data sources for analysis? Target marketing Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time Conversion of single to a joint bank account: marriage, etc. 149 Cross-market analysis Customer profiling data mining can tell you what types of customers buy what products (clustering or classification) Identifying customer requirements Associations/co-relations between product sales Prediction based on the association information identifying the best products for different customers use prediction to find what factors will attract new customers Provides summary information various multidimensional summary reports statistical summary information (data central tendency and variation) 150 Corporate Analysis and Risk Management Finance planning and asset evaluation Resource planning: cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) summarize and compare the resources and spending Competition: monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market 151 Fraud Detection and Management Applications Approach widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references 152 Fraud Detection and Management (cont.) Detecting inappropriate medical treatment Detecting telephone fraud Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail Analysts estimate that 38% of retail shrink is due to dishonest employees. 153 Other Applications Sports Astronomy IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. 154