Publikácia o hĺbkovej analýze údajov, teda o data miningu Mgr. Ing. Adriana Horníková, PhD Inovace 2010, Praha 30.11.-3.12.2010 Hĺbková analýza údajov Hĺbková analýza údajov je novou a veľmi dynamicky sa rozvíjajúcou vednou disciplínou, ktorá je predurčená na vyhľadávanie nových poznatkov v databázach (extrahovanie informácií a súvislostí z dátových súborov). Predpokladá sa, že hĺbková analýza údajov bude mat revolučný vývoj v nasledujúcom desaťročí a to podľa technologického magazínu ZDNET News (z 8. februára, 2001). [16] MIT Technology Review vybral hĺbkovú analýzu ako jednu z desiatich vynárajúcich sa technológií, ktoré zmenia súčasný svet. Hĺbková analýza údajov 2 Autory D. Hand, K. Mannila a P. Smyth píšu, “Hĺbková analýza údajov je analýzou (často veľkých) súborov napozorovaných údajov s cieľom vyhľadania neočakávaných vzťahov a zosumarizovania údajov novým spôsobom, ktoré sú pochopiteľné ako aj užitočné pre majiteľa údajov. Hĺbková analýza údajov má viacero synonymických názvov ako extrahovanie poznania, objavovanie poznatkov, získavanie informácií, archeológia údajov a/alebo spracovanie vzorov v údajoch. Hĺbková analýza údajov– Definícia V súčasnosti definujeme hĺbkovú analýzu údajov ako interaktívny a iteratívny proces vyhľadania poznania z experimentálnych údajov. Pozostáva z nasledujúcich krokov: definovanie problému, vyslovenie hypotéz, zber údajov, predspracovanie zozbieraných údajov, vytvorenie modelu alebo odhadovanie, interpretácia výsledkov a zosumarizovanie. Zosumarizovanie je čiastočne ekvivalentné riešeniu problému alebo riešeniu problému iteratívnym postupom. Hĺbková analýza údajov Prístup V súčasnosti hĺbková analýza využíva jednu z dvoch metód riešenia problému: postup prediktívny alebo opisný. Prediktívny postup využíva známe premenné k predikcii neznámej realizácie premennej alebo k predikcii iných neznámych premenných. Rôzne postupy sa dajú využiť pre rôzne úlohy. Opisný prístup využíva postup rozpoznania obrazcov/tvarov a vzťahov v opisnom procese a ktorý je možné interpretovať z experimentálnych údajov a vyhodnotiť štatistickými nástrojmi. Hĺbková analýza údajov– Prístup 2 Typ úlohy Špecifická úloha Príklad metódy Opisná Asociácia Asociačné pravidlá, rozhodovacie stromy, vizualizácia údajov Segmentácia Zhlukovanie, rozhodovacie stromy Prediktívna Vyhľadanie vybočujúcich údajov Zhlukovanie, vizualizácia údajov Klasifikácia Diskriminačná analýza, logistická regresia, naívny Bayesovský klasifikátor Regresia Viacnásobná lineárna regresia Hĺbková analýza údajov Výsledky Output from the knowledge discovery in databases is the generated new knowledge, usually described in terms of rules, patterns, classification models, associations, trends, statistical analysis, etc.[3] But what is the actual purpose of Data Mining? It is the process of making decisions. Decisions in organizations should be based on extensive Data Mining and analytics to model what-if scenarios, forecast the future, and minimize risks. In book Information Revolution authors classify organization that is utilizing Data Mining being of level 5 (the highest level possible). Hĺbková analýza údajov Aplikácie Existuje v súčasnosti niekoľko rôznych oblastí aplikácie hĺbkovej analýzy údajov application of Data Mining, some examples are: predikcie spotreby energií, predikcie výmenných kurzov pre trhy, ohodnocovanie klientov bánk alebo poisťovacích spoločností, analýzy poskytovateľov služieb a ich porovnanie, analýza spoľahlivosti strojov alebo ich dielcov, analýza pacientov nemocníc, analýza nákupného košíka a podobné spojené s veľkými dátovými bankami alebo sériami záznamov. Príklady sa dajú nájsť aj na makroekonomickej úrovni. Hĺbková analýza údajov Filozofia To formalize the knowledge discovery process within a common framework several authors introduced the process model concept or further the standardized process model. In general there are several standard methodologies currently enabling to use Data Mining. There are many nuances to this process, but roughly the steps are to pre process raw data, mine the data, and interpret the results. Hĺbková analýza údajov Metodológia There are currently in use several wide-spread methods for discovering knowledge in databases. The dominant methods are: SEMMA methodology (Sample, Explore, Modify, Model and Assess), 5A methodology (Assess, Access, Analyze, Act and Automate) and not least important CRISP-DM methodology (the CRoss Industry Standard Process for Data Mining). CRISP – DM metodológia CRISP-DM is an iterative and adaptive process of six basic steps that can be used in differing order when analyzing a Data Mining problem: 1. research understanding phase or business understanding phase (with several sub-steps: determination of business objectives, assessment of the situation, determination of Data Mining goals and generation of a project plan), 2. data understanding phase (with several sub-steps: collection of initial data, description of data, exploration of data and verification of data quality), CRISP – DM metodológia 2 3. data preparation phase (with several sub-steps: selection of data, cleansing of data, construction of data, integration of data and formatting of data subsets), 4. modeling phase (with several sub-steps: selection of modeling techniques, generation of test design, creation of models and assessment of generated models), 5. evaluation phase (with several sub-steps: evaluation of the results, process review and determination of the next step) and 6. deployment phase (with several sub-steps: plan deployment, plan monitoring and maintenance, generation of final report and review of the process substeps). CRISP – DM metodológia 3 Very often is this methodology presented as a circle of the Data Mining project with six phases. The order of steps implementation is not fixed, just outputs of one step influence the selection of approaches within the subsequent step. Sometimes is needed to return to re-evaluation of several steps from past. A circle is a good symbol of the cycles of CRISP-DM methodologies´ steps to be re-evaluated repeatedly. This methodology is supported by Clementine® Data Mining software suite from SPSS. 5A Methodology 5A methodology uses the following steps: Assess, Access, Analyze, Act and Automate. The starting steps are aims definition, strategy declaration and processes. The second steps is creating the database. In the third step the Data Mining algorithms are being employed. In the subsequent step are formulated interpretations of results and advices. The last step (Automate) stands for the implementation of the advices into practical applications and improvement of the business process. SEMMA metodológia Software of the SAS Company has its own methodology for Data Mining, SEMMA. Model´s abbreviation is Sample (identify input datasets), Explore (explore datasets statistically and graphically), Modify (prepare data for analysis), Model (fit a predictive model) and Assess (compare predictive models). A specialized licensed module of the SAS Company package dedicated to Data Mining is the SAS Enterprise Miner®. Nová publikácia The book should have 240 pages and shall be divided into 10 chapters. First four chapters are dedicated to the philosophy of Data Mining. Described are models and techniques for seeking hidden nuggets of information, the insight is on how Data Mining algorithms operate and so on. Each method is presented in detail and statistical or mathematical principles are explained as well, in the form of the “white-box” approach. Zoskupenie kapitol Chapter 1 is entitled Data Mining with Datasets. Chapter 2 details steps for classification of data, missing data approaches, outliers seeking and reduction into feature vectors or signatures characterizing observations, customers, records etc. Chapter 3 explains the learning process. Chapter 4 is referring to interpretation of results and models comparison. Chapter 5 on decision trees and decision rules creating, e.g. pruning. Zoskupenie kapitol 2 Chapter 6 is about the association rules finding, also called the market basket analysis. Chapter 7 is about neural networks, the model of an artificial neuron, architecture of artificial neural networks, etc. Chapter 8 is about regression and correlation analysis (the logit regression function). Chapter 9 is composed of simple Bayesian classifier, discriminant analysis, clustering and Kohonen´s maps. Chapter 10 refers to genetic algorithms. Vyhľadanie asociačných pravidiel The reader finds here the thorough definition of the association rule "IF antecedent THEN consequent". The purchase pattern can be featured by several statistics or characteristics using the probability evaluation. The most commonly used are the descriptive support, the descriptive confidence or the lift ratio. [2] Some less well known are the benchmark confidence value, the coverage, the quality, the causal support, the causal confidence, the descriptive confirmation, the causal confirmation, the conviction, the interestingness and the dependency. Association Rules Mining 2 Tanning cream Bag Powder Nail polish Eye shades Brush Hand cream Figure 1. Associations diagram for chemists´ purchases, Example Zhlukovanie The majority of space in this chapter is dedicated to cluster analysis / clustering. There is a wide spectrum of different clustering principles: agglomerative, divisive or hierarchical (table 2). Cluster analysis can utilize principles like probability distribution, fuzzy logic, mathematical model, grids etc. The distinguishing measure is the distance measure used for measuring the distance between objects. There are several different simple (Euclidean distance or Hamming / City Blocks distance) or more sophisticated measures (Minkowsky distance and Mahalanobis distance) that can be used. But there are many other similarity and distance measures Zhlukovacie postupy Type of method Hierarchical Advantages Shows hierarchical relationships, does not require predefinition of the number of clusters Disadvantages Slow to compute, not suitable for large datasets Partitioned Fast to compute, works with large datasets Need to specify the number of clusters upfront, does not show relationships Fuzzy Fast to compute, works with large datasets Need to specify the number of clusters upfront Genetické algoritmy Evolutionary algorithms (one of which are also genetic algorithms) have several qualities like, the ability to globally search through the experimental space, good dealing with interactions between variables, finding of patterns also with an open ending as well as to find knowledge at the highest level of generalization. Nowadays are generic algorithms considered to be the best approach of machine learning. The most hinting negative feature of genetic algorithms is their long computational time. Genetické algoritmy 2 Important when applying genetic algorithms is the Fundamental Theorem of Genetic Algorithm (speaking about schemas) saying that short schemas of low order having a high fit value will exhibit in the next generation an exponentially growing presence. The author, J. Holland also exploited other features of schemas. The chapter concludes with a simple example of the implementation of genetic algorithm (authors utilise examples like travelling salesman, mapping of inputto-output or finding the optimum of a function on a given interval of definition) followed by a graphical presentation of genetic programming. Genetické algoritmy 3 The Data Mining implementation of genetic algorithms is different, e.g. numerical optimization of algorithms, machine learning, model finding in economical, social and population fields. Their optimization quality can be implemented in industries, including complex problems of planning, resources optimization of big factories, classification of large complex databases of data (e.g. e-mail messages) or predictive modelling. Examples of implementation of genetic algorithms in conjunction with other Data Mining methodologies is the optimization of weights (assigned through the genetic algorithm) while using a neural network or genetic algorithm implementation in finding the best topology of an artificial neural network. Závery Early methods of identifying patterns in data include Bayes´ theorem and regression analysis. As more data are gathered, with the amount of data doubling every three years, Data Mining is becoming an increasingly important tool to transform these data into information. Discoveries in computer science, enabled new methods like neural networks, clustering, genetic algorithms, decision trees and support vector machines. Data Mining is today commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery. Závery 2 The evolution of knowledge discovery in databases (Data Mining) has already undergone three distinct phases: the first generation systems that have been providing only one Data Mining technique with very weak support for the overall process framework, the second generation systems (suites) provided multiple types of integrated data analysis methods, the third generation systems introduced a vertical approach which enables them to address specific business problems.