Hĺbková analýza údajov

advertisement
Publikácia o hĺbkovej
analýze údajov,
teda o data miningu
Mgr. Ing. Adriana Horníková, PhD
Inovace 2010,
Praha 30.11.-3.12.2010
Hĺbková analýza údajov
Hĺbková analýza údajov je novou a veľmi dynamicky
sa rozvíjajúcou vednou disciplínou, ktorá je
predurčená na vyhľadávanie nových poznatkov v
databázach (extrahovanie informácií a súvislostí z
dátových súborov). Predpokladá sa, že hĺbková
analýza údajov bude mat revolučný vývoj v
nasledujúcom desaťročí a to podľa technologického
magazínu ZDNET News (z 8. februára, 2001). [16]
MIT Technology Review vybral hĺbkovú analýzu ako
jednu z desiatich vynárajúcich sa technológií, ktoré
zmenia súčasný svet.
Hĺbková analýza údajov 2
Autory D. Hand, K. Mannila a P. Smyth píšu, “Hĺbková
analýza údajov je analýzou (často veľkých) súborov
napozorovaných údajov s cieľom vyhľadania
neočakávaných vzťahov a zosumarizovania údajov
novým spôsobom, ktoré sú pochopiteľné ako aj
užitočné pre majiteľa údajov.
Hĺbková analýza údajov má viacero synonymických
názvov ako extrahovanie poznania, objavovanie
poznatkov, získavanie informácií, archeológia
údajov a/alebo spracovanie vzorov v údajoch.
Hĺbková analýza údajov–
Definícia
V súčasnosti definujeme hĺbkovú analýzu údajov ako
interaktívny a iteratívny proces vyhľadania poznania
z experimentálnych údajov. Pozostáva z
nasledujúcich krokov: definovanie problému,
vyslovenie hypotéz, zber údajov, predspracovanie
zozbieraných údajov, vytvorenie modelu alebo
odhadovanie, interpretácia výsledkov a
zosumarizovanie. Zosumarizovanie je čiastočne
ekvivalentné riešeniu problému alebo riešeniu
problému iteratívnym postupom.
Hĺbková analýza údajov Prístup
V súčasnosti hĺbková analýza využíva jednu z dvoch
metód riešenia problému: postup prediktívny alebo
opisný.
Prediktívny postup využíva známe premenné k
predikcii neznámej realizácie premennej alebo k
predikcii iných neznámych premenných. Rôzne
postupy sa dajú využiť pre rôzne úlohy.
Opisný prístup využíva postup rozpoznania
obrazcov/tvarov a vzťahov v opisnom procese a
ktorý je možné interpretovať z experimentálnych
údajov a vyhodnotiť štatistickými nástrojmi.
Hĺbková analýza údajov–
Prístup 2
Typ úlohy
Špecifická úloha
Príklad metódy
Opisná
Asociácia
Asociačné pravidlá, rozhodovacie
stromy, vizualizácia údajov
Segmentácia
Zhlukovanie, rozhodovacie stromy
Prediktívna
Vyhľadanie
vybočujúcich
údajov
Zhlukovanie, vizualizácia údajov
Klasifikácia
Diskriminačná analýza, logistická
regresia,
naívny Bayesovský klasifikátor
Regresia
Viacnásobná lineárna regresia
Hĺbková analýza údajov Výsledky
Output from the knowledge discovery in databases is
the generated new knowledge, usually described in
terms of rules, patterns, classification models,
associations, trends, statistical analysis, etc.[3]
But what is the actual purpose of Data Mining? It is the
process of making decisions. Decisions in
organizations should be based on extensive Data
Mining and analytics to model what-if scenarios,
forecast the future, and minimize risks. In book
Information Revolution authors classify organization
that is utilizing Data Mining being of level 5 (the
highest level possible).
Hĺbková analýza údajov Aplikácie
Existuje v súčasnosti niekoľko rôznych oblastí
aplikácie hĺbkovej analýzy údajov application of Data
Mining, some examples are: predikcie spotreby
energií, predikcie výmenných kurzov pre trhy,
ohodnocovanie klientov bánk alebo poisťovacích
spoločností, analýzy poskytovateľov služieb a ich
porovnanie, analýza spoľahlivosti strojov alebo ich
dielcov, analýza pacientov nemocníc, analýza
nákupného košíka a podobné spojené s veľkými
dátovými bankami alebo sériami záznamov.
Príklady sa dajú nájsť aj na makroekonomickej
úrovni.
Hĺbková analýza údajov Filozofia
To formalize the knowledge discovery process
within a common framework several authors
introduced the process model concept or
further the standardized process model. In
general there are several standard
methodologies currently enabling to use Data
Mining. There are many nuances to this
process, but roughly the steps are to pre
process raw data, mine the data, and
interpret the results.
Hĺbková analýza údajov Metodológia
There are currently in use several wide-spread
methods for discovering knowledge in
databases. The dominant methods are:
SEMMA methodology (Sample, Explore,
Modify, Model and Assess),
5A methodology (Assess, Access, Analyze, Act
and Automate) and not least important
CRISP-DM methodology (the CRoss Industry
Standard Process for Data Mining).
CRISP – DM metodológia
CRISP-DM is an iterative and adaptive process of six
basic steps that can be used in differing order
when analyzing a Data Mining problem:
1. research understanding phase or business
understanding phase (with several sub-steps:
determination of business objectives, assessment
of the situation, determination of Data Mining goals
and generation of a project plan),
2. data understanding phase (with several sub-steps:
collection of initial data, description of data,
exploration of data and verification of data quality),
CRISP – DM metodológia 2
3. data preparation phase (with several sub-steps: selection of
data, cleansing of data, construction of data, integration
of data and formatting of data subsets),
4. modeling phase (with several sub-steps: selection of
modeling techniques, generation of test design, creation
of models and assessment of generated models),
5. evaluation phase (with several sub-steps: evaluation of the
results, process review and determination of the next
step) and
6. deployment phase (with several sub-steps: plan
deployment, plan monitoring and maintenance,
generation of final report and review of the process
substeps).
CRISP – DM metodológia 3
Very often is this methodology presented as a circle of
the Data Mining project with six phases. The order
of steps implementation is not fixed, just outputs of
one step influence the selection of approaches
within the subsequent step. Sometimes is needed to
return to re-evaluation of several steps from past. A
circle is a good symbol of the cycles of CRISP-DM
methodologies´ steps to be re-evaluated repeatedly.
This methodology is supported by Clementine®
Data Mining software suite from SPSS.
5A Methodology
5A methodology uses the following steps: Assess,
Access, Analyze, Act and Automate. The starting
steps are aims definition, strategy declaration and
processes. The second steps is creating the
database. In the third step the Data Mining
algorithms are being employed. In the subsequent
step are formulated interpretations of results and
advices. The last step (Automate) stands for the
implementation of the advices into practical
applications and improvement of the business
process.
SEMMA metodológia
Software of the SAS Company has its own
methodology for Data Mining, SEMMA.
Model´s abbreviation is Sample (identify input
datasets), Explore (explore datasets
statistically and graphically), Modify (prepare
data for analysis), Model (fit a predictive
model) and Assess (compare predictive
models). A specialized licensed module of the
SAS Company package dedicated to Data
Mining is the SAS Enterprise Miner®.
Nová publikácia
The book should have 240 pages and shall be
divided into 10 chapters. First four chapters
are dedicated to the philosophy of Data
Mining. Described are models and
techniques for seeking hidden nuggets of
information, the insight is on how Data Mining
algorithms operate and so on. Each method
is presented in detail and statistical or
mathematical principles are explained as
well, in the form of the “white-box” approach.
Zoskupenie kapitol
Chapter 1 is entitled Data Mining with Datasets.
Chapter 2 details steps for classification of data,
missing data approaches, outliers seeking and
reduction into feature vectors or signatures
characterizing observations, customers, records etc.
Chapter 3 explains the learning process.
Chapter 4 is referring to interpretation of results and
models comparison.
Chapter 5 on decision trees and decision rules
creating, e.g. pruning.
Zoskupenie kapitol 2
Chapter 6 is about the association rules finding, also
called the market basket analysis.
Chapter 7 is about neural networks, the model of an
artificial neuron, architecture of artificial neural
networks, etc.
Chapter 8 is about regression and correlation analysis
(the logit regression function).
Chapter 9 is composed of simple Bayesian classifier,
discriminant analysis, clustering and Kohonen´s maps.
Chapter 10 refers to genetic algorithms.
Vyhľadanie asociačných
pravidiel
The reader finds here the thorough definition of the
association rule "IF antecedent THEN consequent".
The purchase pattern can be featured by several
statistics or characteristics using the probability
evaluation. The most commonly used are the
descriptive support, the descriptive confidence or
the lift ratio. [2] Some less well known are the
benchmark confidence value, the coverage, the
quality, the causal support, the causal confidence,
the descriptive confirmation, the causal confirmation,
the conviction, the interestingness and the
dependency.
Association Rules Mining 2
Tanning
cream
Bag
Powder
Nail
polish
Eye shades
Brush
Hand cream
Figure 1. Associations diagram for chemists´ purchases, Example
Zhlukovanie
The majority of space in this chapter is dedicated to
cluster analysis / clustering. There is a wide
spectrum of different clustering principles:
agglomerative, divisive or hierarchical (table 2).
Cluster analysis can utilize principles like probability
distribution, fuzzy logic, mathematical model, grids
etc. The distinguishing measure is the distance
measure used for measuring the distance between
objects. There are several different simple
(Euclidean distance or Hamming / City Blocks
distance) or more sophisticated measures
(Minkowsky distance and Mahalanobis distance)
that can be used. But there are many other similarity
and distance measures
Zhlukovacie postupy
Type of method
Hierarchical
Advantages
Shows hierarchical
relationships, does not
require predefinition of
the number of clusters
Disadvantages
Slow to compute, not
suitable for large
datasets
Partitioned
Fast to compute, works
with large datasets
Need to specify the number
of clusters upfront, does
not show relationships
Fuzzy
Fast to compute, works
with large datasets
Need to specify the number
of clusters upfront
Genetické algoritmy
Evolutionary algorithms (one of which are also genetic
algorithms) have several qualities like, the ability to
globally search through the experimental space,
good dealing with interactions between variables,
finding of patterns also with an open ending as well
as to find knowledge at the highest level of
generalization. Nowadays are generic algorithms
considered to be the best approach of machine
learning. The most hinting negative feature of
genetic algorithms is their long computational time.
Genetické algoritmy 2
Important when applying genetic algorithms is the
Fundamental Theorem of Genetic Algorithm
(speaking about schemas) saying that short
schemas of low order having a high fit value will
exhibit in the next generation an exponentially
growing presence. The author, J. Holland also
exploited other features of schemas. The chapter
concludes with a simple example of the
implementation of genetic algorithm (authors utilise
examples like travelling salesman, mapping of inputto-output or finding the optimum of a function on a
given interval of definition) followed by a graphical
presentation of genetic programming.
Genetické algoritmy 3
The Data Mining implementation of genetic algorithms is
different, e.g. numerical optimization of algorithms,
machine learning, model finding in economical, social
and population fields. Their optimization quality can be
implemented in industries, including complex
problems of planning, resources optimization of big
factories, classification of large complex databases of
data (e.g. e-mail messages) or predictive modelling.
Examples of implementation of genetic algorithms in
conjunction with other Data Mining methodologies is
the optimization of weights (assigned through the
genetic algorithm) while using a neural network or
genetic algorithm implementation in finding the best
topology of an artificial neural network.
Závery
Early methods of identifying patterns in data include
Bayes´ theorem and regression analysis. As more
data are gathered, with the amount of data doubling
every three years, Data Mining is becoming an
increasingly important tool to transform these data
into information.
Discoveries in computer science, enabled new
methods like neural networks, clustering, genetic
algorithms, decision trees and support vector
machines. Data Mining is today commonly used in a
wide range of profiling practices, such as marketing,
surveillance, fraud detection and scientific discovery.
Závery 2
The evolution of knowledge discovery in databases
(Data Mining) has already undergone three distinct
phases:

the first generation systems that have been
providing only one Data Mining technique with very
weak support for the overall process framework,

the second generation systems (suites) provided
multiple types of integrated data analysis
methods,

the third generation systems introduced a vertical
approach which enables them to address specific
business problems.
Download