IEEE Paper Template in A4 (V1) - the Journal of Information

JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER SCIENCE AND APPLICATIONS A STUDY OF DATA MINING APPLICATION. 1 MS. ZARANA. C. PADIYA, 2 MS. SEEMA ZOPE, 3 MS. YESHA DAVE 1Asst. Professor, MCA, SLDCCA, Bharuch, Gujarat. .Professor, MCA, SLDCCA, Bharuch, Gujarat. 3 Asst. Professor, MCA, RKCET, Rajkot, Gujarat. 2 Asst Zarnapadia86@yahoo.in, spshekhawat@rediffmail.com,yesha112ysd@gmail.com ABSTRACT: Data Mining was a totally new concept for us and it took quit long time to understand the concept of this technology. we have gone through some data and a few dataset and found relationship among the attributes, but it was really a difficult task to establish relation for a dataset which was not known to us. And also find the hidden information in that dataset, as finding hidden and useful information from a given dataset is the major goal of data mining technology. The major task was that how to initiate the process, this made to study entire technology in depth. This study was very much helpful to understand the system and it became the ladder through which we could climb step-by step towards analysis. The resultant system of this research and finding is named as Web Based Data Mining Application (WDMA).Before discussing WDMA in whole we would like to list out the findings of our research Keywords—Data, Dataset, WDMA, Information, mining. I: INTRODUCTIONS Data Mining refers generally to the exploration, analysis and presentation of historical data, or can refer to the specific act of retrieving information, usually from the Data Warehouse. In big organization Dataset can be real assets. Commercial dataset-Ex. the retail sector-are growing at unpredictable rates. Such dataset contain a lot of information, which can often be only accessed, with the help of suitably designed computer based search and analysis applications. The scientific approach to search and analysis is referred to as Data Mining. Data Mining is the extraction of hidden information from large dataset and it is a power full new technology with great potential to help companies focus on the most important information in their Data Warehouses. This enough to start with and now we are going to focus at the functionalities of Data Mining in some more details. II WHAT KIND OF DATA CAN BE MINED? In principle, data mining is not specific to one type of media or data. Data mining should be applicable to any kind of information repository. However, algorithms and approaches may differ when applied to different types of data. Data mining is being put into use and studied for databases, including relational databases, object-relational databases and objectoriented databases, data warehouses, transactional databases, unstructured and semi-structured repositories such as the World Wide Web, advanced databases such as spatial databases, multimedia databases, time-series databases and textual databases, and even flat files. Here are some examples in more detail: II.I FLAT FILES: Flat files are actually the most common data source for data mining algorithms, especially at the research level. Flat files are simple data files in text or binary format with a structure known by the data mining algorithm to be applied. The data in these files can be transactions, time-series data, scientific measurements, etc. II.II RELATIONAL DATABASES: Briefly, a relational database consists of a set of tables containing either values of entity attributes, or values of attributes from entity relationships. Tables have columns and rows, where columns represent attributes and rows represent tuples. A tuple in a relational table corresponds to either an object or a relationship between objects and is identified by a set of attribute values representing a unique key. In Figure below we present some relations Customer, Items, and Borrow representing business activity in a fictitious video store Our Video Store. These relations are just a subset of what could be a database for the video store and is given as an example. Figure-1 The most commonly used query language for ISSN: 0975 – 6728| NOV 12TO OCT 13 | VOLUME – 02, ISSUE - 02 Page 106 JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER SCIENCE AND APPLICATIONS relational database is SQL, which allows retrieval and manipulation of the data stored in the tables, as well as the calculation of aggregate functions such as average, sum, min, max and count. For instance, an SQL query to select the videos grouped by category would be: SELECT count (*) FROM Items WHERE type=video GROUP BY category. graphics, image interpretation, and natural language processing methodologies. II.V SPATIAL DATABASES: Spatial databases are databases that, in addition to usual data, store geographical information like maps, and global or regional positioning. Such spatial databases present new challenges to data mining algorithms Data mining algorithms using relational databases can be more versatile than data mining algorithms specifically written for flat files, since they can take advantage of the structure inherent to relational databases. While data mining can benefit from SQL for data selection, transformation and consolidation, it goes beyond what SQL could provide, such as predicting, comparing, detecting deviations, etc. II.III TRANSACTION DATABASES: A transaction database is a set of records representing transactions, each with a time stamp, an identifier and a set of items. Associated with the transaction files could also be descriptive data for the items. For example, in the case of the video store, the rentals table such as shown in the Figure below represents the transaction database. Each record is a rental contract with a customer identifier, a date, and the list of items rented (i.e. video tapes, games, VCR, etc.). Since relational databases do not allow nested tables (i.e. a set as attribute value), transactions are usually stored in flat files or stored in two normalized transaction tables, one for the transactions and one for the transaction items. One typical data mining analysis on such data is the so-called market basket analysis or association rules in which associations between items occurring together or in sequence are studied. Figure-2. Transaction Database. II.IV: MULTIMEDIA DATABASES: Multimedia databases include video, images, and audio and text media. They can be stored on extended objectrelational or object-oriented databases, or simply on a file system. Multimedia is characterized by its high dimensionality, which makes data mining even more challenging. Data mining from multimedia repositories may require computer vision, computer Figure: Spatial Database. II.VI TIME SERIES DATABASES: Time-series databases contain time related data such stock market data or logged activities. These databases usually have a continuous flow of new data coming in, which sometimes causes the need for a challenging real time analysis. Data mining in such databases commonly includes the study of trends and correlations between evolutions of different variables, as well as the prediction of trends and movements of the variables in time. Figure below shows some examples of time-series data. Figure: Time series Database ISSN: 0975 – 6728| NOV 12TO OCT 13 | VOLUME – 02, ISSUE - 02 Page 107 JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER SCIENCE AND APPLICATIONS III: FUNCTIONS OF DATA MINING: III. I KNOWLEDGE DISCOVERY: What are Data Mining and Knowledge Discovery? With the enormous amount of data stored in files, databases, and other repositories, it is increasingly important, if not necessary, to develop powerful means for analysis and perhaps interpretation of such data and for the extraction of interesting knowledge that could help in decision-making. Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases. While data mining and knowledge discovery in databases (or KDD) are frequently treated as synonyms, data mining is actually part of the knowledge discovery process. The following figure shows data mining as a step in an iterative knowledge discovery process. Data mining software can explore large set of data without predetermined hypothesis to find interesting pattern and then present this information to the user. Terminology of Data Mining the process of knowledge discovery is known as Knowledge Discovery in Datasets (KDD). The term Knowledge Discovery in Datasets, refers to the broad process of finding knowledge in data, and emphasizes the “high level” application of particular Data Mining methods. It is one of interest to researchers in pattern recognition, dataset, statistics, artificial intelligence, knowledge acquisition for expert system, and data visualization. The unifying goal of the KDD process is to extract knowledge from data in the context of large datasets. III.I.I Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant data are removed from the collection. III.I.II Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in a common source. III.I.III Data selection: at this step, the data relevant to the analysis is decided on and retrieved from the data collection. III.I.IV Data transformation: also known as data consolidation, it is a phase in which the selected data is transformed into forms appropriate for the mining procedure. III.I.V Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful. III.I.VI Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identified based on given measures. III.I.VII Knowledge representation: is the final phase in which the discovered knowledge is visually represented to the user. The KDD is an iterative process. Once the discovered knowledge is presented to the user, the evaluation measures can be enhanced, the mining can be further refined, new data can be selected or further transformed, or new data sources can be integrated, in order to get different, more appropriate results. III.II PREDICTIVE MODELING: This function uses patterns discovered to predict future behavior. For example information gathered from credit card transaction can be used to identify customer priorities or instance of fraud. III.III FORNESIC ANALYSIS. This is the process of applying the extracted patterns to find anomalies and unusual patterns in the data. For example, retail analyst could explore the reason why a particular population group makes certain types of purchase in a particular store. IV. DATA MINING TECHNOLOGIES: Data mining software uses a variety of different approaches to sift and sort data, Identify patterns and process information. Methods adopted include: Figure: Data mining is the core for knowledge discovery process The Knowledge Discovery in Databases process comprises of a few steps leading from raw data collections to some form of new knowledge. The iterative process consists of the following steps: Decision-tree approach Regression approach Rule discovery approach Neural network approach Genetic programming Fuzzy logic Nearest Neighbor approach These methods can be combined in different ways to sift and sort complex data. These methods no doubt ISSN: 0975 – 6728| NOV 12TO OCT 13 | VOLUME – 02, ISSUE - 02 Page 108 JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER SCIENCE AND APPLICATIONS provide means to mine the data but there are lots of statistical activities making base for these techniques to be implemented. As working on these kind of technologies is not an easy task, so the hurdles being simplified by the statistical methods and various statistical tests. Commercial software packages often use a combination of two or more of these methods. VI.I DECISION-TREE APPROACH Decision tree system partition data sets into smaller subsets, based on simple conditions, with a single starting point. In the example below, the decision tree is used to make decisions about expenses, from the starting point of ‘Grade’. A disadvantage of this approach is that there will always be information loss, because a decision tree selects one specific attribute for partitioning at each stage with a single starting point. The decision tree can present one set of outcomes, but not more than one set, as there is a single starting point. Therefore decision trees are suited for data sets where there is one attribute to start with. While studying decision tree we come across some algorithms specially designed to generate decision tress and they are listed and discussed as below: CHAID, C&RT and ID3. They are discussed as below: algorithm is used basically for solving regression and classification problems ID3 ID3 classification algorithm varies simply, builds a decision tree from a fixed set of examples. The resulting tree is used to classify future samples. The example has several attributes and belongs to class (like yes or no). The leaf nodes of the decision tree contain the class name whereas a non-leaf node is a decision node. The decision node is an attribute test with each branch (to another decision tree) being a possible value of the attribute. ID3 uses information gain to help it decide which attribute goes into a decision node. The advantage of learning a decision tree is that a program, rather than a knowledge engineer elicits knowledge from an expert. VI.II NEURAL NETWORKING: Neural networking classifies large sets of data and assigns weights or scores to the data. This info is then retained by the software and adjusted as it undergoes further iterations. CHAID The acronym CHAID stands for chi-squared Automatic Interaction Detector. It is one of the oldest tree classification methods originally proposed by Kass (1980; according to Ripley, 1996, the CHAID algorithm is a descendent of THAID developed by Morgan and Messenger, 1973). CHAID will “build” non-binary trees (i.e., trees where more than two branches can attach to a single root or node), based on a relatively simple algorithm that is particularly well suited for the analysis of larger datasets. Also, because the CHAID algorithm will often effectively yield many multi-way frequency tables, it has been particularly popular in marketing research, in the context of market segmentation studies. C&RT C&RT builds classification and regression trees for predicting continuous dependent variables (regression) and categorical predictor variables (classification). The classic C&RT algorithm was popularized by breiman et al. (Breiman, Friedman, OLshen, &Stone, 1984; see also Ripley, 1996). This A neural net consist of a number of interconnected elements (called neurons) which learn by modifying the connection between them each neuron has a set of weights that determine how it evaluates the combine strength of the input signals. Once the neuron network has calculated the relative effect each of this characteristic has on the data, it can apply the knowledge it has learned to a new set of data. Neural network can “learn” form examples. ` A Simple Neural Network ISSN: 0975 – 6728| NOV 12TO OCT 13 | VOLUME – 02, ISSUE - 02 Page 109 JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER SCIENCE AND APPLICATIONS records are the data set is found (And the most similar neighbors are identified. For example, a bank may compare a new customer with all existing bank customers, by examining age, income etc. And so set an appropriate credit rating. V. CONCLUSION A neural network with feedback and competition However, the disadvantage of neural networks is that input has to be numeric, which may lead to complication when dealing with non-scalar fields, such as Country or Product the numeric labels have to be given to fields of equal value. A neural network in the process of iteration may come to assign relationship or value based on these arbitrary numbers, which would corrupt the output. VI.III FUZZY LOGIC With this research We would like to conclude that the study of data mining and its techniques had made us understand the vast and varying world of data mining. And by learning the scope and application of data mining we got the idea for my work i.e the web based data mining application. Although this has lots of complexity and unfolds, it has its own importance and advantages so we would like to go with it. REFERENCES: Data Mining: Concepts and Techniques Jiawei Han and Micheline Kamber, Morgan Kaufmann Data Mining for Association Rules and Sequential Patterns Jean-Marc Adamo, Springer Rules incorporate probability. So good might mean a 70% success rate or a 90% success rate. This is called an inexact rule. A “Fuzzy” rule can vary in terms of the numeric values in the body of the rule. For e.g. the confidence might vary according to the values of the of one of the variable (e.g. as the age increases). Fuzzy logic assesses data in terms of possibility and uncertainty. IF income is low AND person is young THEN credit limit is low This rule is fuzzy because of the imprecise definition of “income”, “young” and “credit limit”. The credit limit will change as the age or income changes. VI.IV NEAREST NEIGHBOUR APPROACH The nearest neighbor method matches patterns between different sets of data. This approach is based on data retention. When a new record is presented for prediction, the “distance” between it and similar ISSN: 0975 – 6728| NOV 12TO OCT 13 | VOLUME – 02, ISSUE - 02 Page 110

IEEE Paper Template in A4 (V1) - the Journal of Information

Related documents

Products

Support

IEEE Paper Template in A4 (V1) - the Journal of Information

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib