TABLE OF CONTENT TABLE OF CONTENT .......................................................................................... 1 DATA MINING TECHNIQUES .............................................................................. 2 INTRODUCTION ............................................................................................... 2 CLUSTERING ANALYSIS AND AUTO-CLUSTERING ..................................... 3 BOUNDARY MODELING & CENTROID MODELING ................................... 4 LINK ANALYSIS ................................................................................................... 5 VISULIZATION .................................................................................................. 5 CONVENTIONAL AND SPATIAL VISULATION ............................................ 6 RULE INDUCTION ............................................................................................... 7 WORKS CITIED.................................................................................................... 9 1 DATA MINING TECHNIQUES INTRODUCTION Data mining can be divided in two sub groups: discovery and exploration. discovery in the sense that meaningful patterns are uncovered in data and distinguish properly and exploitation meaning that during these meaningful patterns are used to create useful application for data modeling purposes. The process of discovering knowledge in data is referred to as “Knowledge discovery (KD) this is when data is being worked on from a conventional data store not a data warehouse or an online transaction system. KD can be performed in two ways either manually or through an automated process. This process of KD is finding a connection between data and facts, it can in fact be said to be a relationship between prior facts and a subsequent fact. The prior fact is called the antecedent and the subsequent fact known as the consequent. There should be no assumption of an underlying cause and effect connection between the antecedent and consequent. Through the process of data mining, it should always be in mind that data is not facts, in order to have knowledge and facts one needs data. Knowledge is the main connection between antecedent facts and consequent facts, this knowledge that plays the main role between antecedent and consequent facts 2 takes the form of several shapes in the dorm of data mining. Knowledge can be in the form if if-then-else rules that are known as knowledge based expert system (KBES) or it can be in form of a complex mathematical transform. This section of the paper will discuss the four main techniques used in data mining: Auto-clustering, link analysis, visualization, and the rule induction. These four techniques are used in the creation of knowledge information based on data from the database. CLUSTERING ANALYSIS AND AUTO-CLUSTERING Patterns are very important in the issue of data mining and in order to look for pattern between data in order to create a relationship, which will produce knowledge we need to better, understand patterns. A “Pattern” is regularities in data, but it should always be kept in mind that data can be very complex and provides no useful information” (Data Mining Explained, Chapter 8, Pg.141). The process of data mining is looking for patterns that will give information on the behavior of the data, this behavior may be our own behavior, those of the customers, suppliers or that of the industry after a significant events occurs. When dealing with numeric data the most useful irregularity that can be found is “clusters”. A “cluster” is a coherent aggregation of data points; clusters are a group of data with similar behaviors and characteristics. Clusters can be in the shape of normally distributed aggregations of nonoverlapping point sets, or they can be in the shape of long thing strands of data points or ragged blob having various shapes and sizes. Clusters usually include 3 a mixture of cluster groups of various sorts, which frequently overlap in very complex ways. The fact that clusters sometimes overlap is what makes data mining difficult to determine where one cluster ends or the other cluster begins. The ultimate goal of clustering is to find groups of data that are similar to each other and separate them from groups of data without those similar characteristics or features. Business knowledge is very important to the clustering process in order for data to be processed and interpreted. Groups of clusters can also be used to further classify new data; common algorithms used to perform clustering include boundary modeling and centroid modeling. BOUNDARY MODELING & CENTROID MODELING Boundary modeling is when clusters are distinguished by specifying the boundaries between them while centroid modeling clusters are distinguished by specifying their locations and extents. Boundary model is also known as bottom – up approach to analyzing, this is due to the fact that data clusters are determined by looking at the location of individual points rather than the whole group of data as a whole. Centroid modeling is sometimes called top down modeling because locations and extend of data are determined by the distribution of the point as a group. 4 LINK ANALYSIS This is a graphical approach to analysis of data to establish connections and relationship in order to determine knowledge and facts based on the data. Link analysis helps to explore data that can help identify relationship among values in the databases. There are various types of link analysis but the most common are stratification, association discovery and sequence discovery. Stratification is a process of retrieving records from the data storage using the hypothesized links as the retrieval keys, while association discovery finds rule about items that appear together in an event such as a purchase transaction. The final most commonly used link analysis is sequence discovery and it is very similar to association discovery but it is an association relation over time. VISULIZATION Data Visualization's role in data mining is the process of allowing the data analyst to gain intuition about the data being observed. Data visualization applications are used to assist the analyst in selecting display formats, viewer’s perspective and data representation schemas that help to foster deep intuitive understanding of the data to be mined. There are two type of Data visualization techniques “Conventional visualization” and “Spatial Visualization” both of techniques have been around for a while but conventional visualization have been around much longer so it have been more researched and better developed over the years. 5 The most commonly used visualization techniques include scatter plots, pie charts, tables and other graphic representation that depict the relationship between the data being observed. This application also include features such as drill down menu, hyper links to related data sources, multiple simultaneous data views, select count analyze object in displays and dynamic and active view of the database just to name a few. The most effective pattern recognition equipment known to date is called the “Grayware”; researchers discovered that “ Grayware allows manual knowledge discovery to be facilitated when domain data are presented in a form that allows the human mind to interact with them at an intuitive level” (Data Mining explained, Chapter 8, Ph 150). CONVENTIONAL AND SPATIAL VISULATION Conventional visualization means the use of graph, charts and other graphic representation to depict information about a population and not individual population data itself. Conventional visualization is visual depiction of certain types of Meta data and it involves computation of descriptive statistics such as counts, frequencies, ranges and proportions. Spatial visualization on the other hand uses plots that depict actual members of the population in their feature spaces. With spatial visualization there is no computation of descriptive statistics, the depiction preserves the geometric relationships among the data which are the “spatial” characteristic of the population that re being sampled and visualized. 6 RULE INDUCTION Rule induction is the process of creating rules directly from the data without human interaction. Rule induction is a process within data mining that “derives a set of rule to classify cases.” (Introduction to Data Mining & knowledge Discovery, Page 17). The purpose of rule induction is to create a set of independent rules, which do not necessarily form some sort of decision tree. These rules that are generated do not necessarily have to cover all possible scenarios and the result can sometime conflict with the predications. In the case when the result do not match up with the predication is to assign a confidence to the rules and use a rule that the analyst is more confident keeping in mind the knowledge of the business and the data being analyzed. In the case were more than two rules conflict it is practice that analyst weight their votes of confidence on each rule. Rule induction algorithms determine what decision surfaces of a specified type that correctly classify as much of the given data as possible. Most algorithms use decision surfaces that are combinations of points, lines, planes and hyper-planes. The most common rule-induction algorithm is based on a “Splitting” technique. These splits create decision surfaces that are expressed as predicate in rule. Data mining techniques are very important to the data mining process as a whole. The type of technique to use can vary and should depend on the type of data to be analyzed. The analyst should also have a good understanding of the business and what the data is to be used for. This is important in determining 7 what data is information and how to go about creating rule that would generate useful information that would be beneficial to the business. 8 WORKS CITIED 1. Two Crows Corporation. Introduction to Data Mining and Knowledge Discovery. Third Edition, 1999. 2. Hancock, Delmater. Data Mining Explained. Digital Press, Boston 2001. 9