International Journal of Engineering Trends and Technology- Volume3Issue2- 2012 Balancing Exploration and Exploitation using Search Mining Techniques Sonal Kapoor, Narendra Kumar and Alok Aggrawal Singhania University, Jhunjhanu ACEM, Agra JIIT, Noida Abstract: Search Mining is the process of extracting patterns from data. Search Mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery. In this paper we are presenting an overview to balancing and exploration. We are also going insight into the search mining methods as an application part. Introduction: Search Mining techniques are becoming indispensable parts of business intelligence programs. Use these links to learn more about these emerging fields and keep on top of this trend. Search Mining is used to uncover patterns in data but is often carried out only on samples of data. The mining process will be ineffective if the samples are not good representation of the larger body of data. Search Mining cannot discover patterns that may be present in the larger body of data if those patterns are not present in the sample being “mined”. Inability to find patterns may become a cause for some disputes between customers and service providers. Therefore Search Mining is not foolproof but may be useful in sufficiently representative data samples are collected. The discovery of particular patterns in a particular set of data does not necessarily mean that a pattern is found elsewhere in the larger data from which that sample was drawn. An important part of this process is the verification and validation of patterns on other samples of data. The related terms data dredging, data ISSN: 2231-5381 fishing and data snooping refers to the use of Search Mining Techniques to samples sizes that are (or may be) too small for statistical inferences to be made about the validity of any patterns discovered. Data dredging may, however, be used to develop new hypotheses, which must then be validated with sufficiently large sample sets. Evolution: In addition to industry driven demand for standards and interoperability, professional and academic activity have also made considerable contribution to the evolution and rigour of the methods and models; an article published in a 2008 issue of the International Journal of Information Technology and Decision Making summarizes the results of a literature survey which traces and analyzes this evolution. The premier professional body in the field is the Association for Computing Machinery’s Special Interest Group on Knowledge Discovery and Search Mining(SIGKDD). Since 1989 they have hosted an annual international conference and published its proceedings, and since 1999 have published a biannual academic journal titled “SIGKDD Exploitations”. Research Methodology: • Clustering – is the task of discovering groups and structures in the data that are in some way or another “similar”, without using known structure in the data. • Classification – is the task of generalizing known structure to apply to new data. For example, an email programme might attempt to http://www.internationaljournalssrg.org Page 158 International Journal of Engineering Trends and Technology- Volume3Issue2- 2012 classify an email as legitimate or spam. Common algorithms include decision tree learning, nearest neighbor, naïve Bayesian classification and neural networks. • Regression – Attempt to find a function which models the data with the least error. • Association rule learning – Searches for relationships between variables,. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. Through the use of automated statistical analysis (or “data mining”) techniques, businesses are discovering new trends and patterns of behavior that previously went unnoticed. Once they’ve uncovered this vital intelligence, it can be used in a predictive manner for a variety of applications. Brain James, assistant coach of the Toronto Raptors, uses Search Mining techniques to rack and stack his team against the rest of the NBA. The bank of Montreal’s business intelligence and knowledge discovery program is used to gain insight into customer behavior. RESULTS VALIDATION The first step towards building a productive Search Mining program is, of course, to gather data! Most business already perform these data gathering tasks to some extent – the key here is to locate the data critical to your business, refine it and prepare it for the Search Mining process. If you are currently tracking customer data in a modern DBMS, chances are you’re almost done. Take a look at the article Mining Customer Data from DB2 Magazine for a great feature on preparing your data for the mining process. The final step of knowledge discovery from data is to verify the patterns produced by the mining algorithms occur in the wider data set. Not all patterns produced by the Search Mining algorithms are necessarily valid. It is common for the Search Mining algorithms to find patterns in the training set which are not present in the general data set, this is called over fitting. To overcome this, the evolution uses a test set of data which the Search Mining algorithms was not trained on. The learnt patterns are applied to this test set and the resulting output is compared to the desired output. For example, a Search Mining algorithms trying to distinguish spam from legitimate emails would be trained on a training set of sample emails. Once trained, the learnt patterns would be applied to the test set of emails which it had not been trained on; the accuracy of these patterns can then be measured from how many emails they correctly classify. A number of statistical methods may be used to evaluate the algorithms such as ROC curves. If the learnt patterns do not meet the desired standards, then it is necessary to reevaluate and change the preprocessing and data mining. If the learnt patterns and turn them into knowledge. SEARCH MININGIN BUSINESS ISSN: 2231-5381 GATHERING DATA SELECTING AN ALGORITHM At this point, take a moment to pat yourself on the back. You have a data warehouse! The next step is to choose one or more Search Mining algorithms to apply to your problem. If you’re just starting out, its probably a good idea to experiment with several techniques to give yourself a feel for how they work. Your choice of algorithm will depend upon the data you have gathered, the problem you are trying to solve and the computing tools you have available to you. REGRESSION Regression is the oldest and most well-known statistical technique that the Search Mining community utilizes. Basically, regression takes a numerical dataset and develops a http://www.internationaljournalssrg.org Page 159 International Journal of Engineering Trends and Technology- Volume3Issue2- 2012 mathematical formula that fits the data. When you are ready to use the results to predict future behavior, you simply take your new data, plug it into the developed formula and you have got a prediction! The major limitation of this technique is that it only works well with continuous quantitative data (like weight, speed or age). If you are working with categorical data where order is not significant (like color, name or gender) you’re better off choosing another technique. CLASSIFICATION Working with categorical data or a mixture of continuous numeric and categorical data? Classification analysis might suit your needs well. This technique is capable of processing a wider variety of data than regression and is growing in popularity. You’ll also find output that is much easier to interpret. Instead of the complicated mathematical formula given by the regression technique you’ll receive a decision tree that requires a series of binary decisions. SEARCH MININGPRODUCTS Search Mining products are taking the industry by storm. The major data base vendors have already taken steps to ensure that their platforms incorporate Search Mining techniques. Oracle’s Search Mining Suite (Darwin) implements classification and regression trees, neural networks, k-nearest neighbors, regression analysis and clustering algorithms. Microsoft’s SQL Server also offers Search Mining functionality through the use of classification trees and clustering algorithms. If you’re already working in a statistics environment, you’re probably familiar with the Search Mining algorithms implementations offered by the advanced statistical packages SPSS, SAS, and S-Plus. Glossary of Terms; Analytical Mode : A structure and process for analyzing a dataset. Anomalous Data: Data that result from errors. Artificial neural networks: Non-Linear predictive models that learn through training ISSN: 2231-5381 and resemble biological neural and networks in structure. CART: Classification and Regression trees CHAID : Chi Square Automatic Interactions and Detections Classification: The process of dividing a dataset into mutually exclusive group such that the members of each group are as close as possible to one another. Data Cleansing : The process of ensuring that all values in a dataset are consistent and correctly recorded. References: 3. 4. 5. 6. 7. 8. 1. Fayyad, Usama; Gregory Piatetsky-Shapiro, and Padhraic Smyth (1996).. Retrieved 2008-12-17. 2. Clifton, Christopher (2010).. Retrieved 2010-1209. Ian H. Witten; Eibe Frank; Mark A. Hall (30 January 2011). Search Mining: Practical Machine Learning Tools and Techniques (3 ed.). Elsevier.. R.R. Bouckaert; E. Frank; M.A. Hall; G. Holmes; B. Pfahringer; P. Reutemann; I.H. Witten (2010), "WEKA Experiences with a Java open-source project", Journal of Machine Learning Research 11: 2533–2541, "the original title, "Practical machine learning", was changed [...]^ Kantardzic, Mehmed (2003). Search Mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons, International Conferences on Knowledge Discovery and Search Mining, ACM, New York. , ACM, New York. Günnemann, S.; Kremer, H.; Seidl, T. (2011). "An extension of the PMML standard to subspace clustering models". Proceedings of the 2011 workshop on Predictive markup language modeling - PMML '11. pp. 48 Ellen Monk, Bret Wagner (2006). Concepts in Enterprise Resource Planning, Second Edition. Thomson Course Technology, Boston, MA. Roberto Battiti and Mauro Brunato, Reactive Search Srl, Italy, February 2011. Battiti, Roberto; Andrea Passerini (2010). "BrainComputer Evolutionary Multi-Objective Optimization (BCEMO): a genetic algorithm adapting to the decision maker.". IEEE Transactions on Evolutionary Computation 14 (15): 671–687. http://www.internationaljournalssrg.org Page 160