Subject : Data Mining - E

advertisement
STUDY MATERIAL
Subject
Staff
Class
Semester
: Data Mining
: Usha.P
: III B.Sc(CS-B)
: VI
UNIT-I:
Introduction: Nature of Data Sets – Models and Patterns – Data mining Tasks –
Components of Data mining Algorithms – The interacting roles of Statistics and Data mining
–Dredging, Snooping and Fishing. Measurement of Data: Types of measurement - Distance
Measures – Transforming Data – The form of Data – Data Quality for individual
measurements– Data Quality for Collections of data
Introduction
Overview
Generally, data mining (sometimes called data or knowledge discovery) is the process
of analyzing data from different perspectives and summarizing it into useful information information that can be used to increase revenue, cuts costs, or both. Data mining software is
one of a number of analytical tools for analyzing data. It allows users to analyze data from
many different dimensions or angles, categorize it, and summarize the relationships
identified. Technically, data mining is the process of finding correlations or patterns among
dozens of fields in large relational databases.
Data mining is also known asknowledge discovery in databases, or KDD. This term
originated in the artificial intelligence (AI) research field. The KDD process involves several
stages:
 selecting the target data,
 pre-processing the data,
 transforming them if necessary,
 performing data mining to extract patterns and relationships,
 and then interpreting and assessing the discovered structures.
The process of seeking relationships within a data set— of seeking accurate, convenient, and
useful summary representations of some aspect of the data—involves a number of steps:
 Determining the nature and structure of the representation to be used;
 Deciding how to quantify and compare how well different representations fit the data
 choosing an algorithmic process to optimize the score function;
 Deciding what principles of data management are required to implement the
algorithms efficiently.
Example
For example, one Midwest grocery chain used the data mining capacity of Oracle
software to analyze local buying patterns. They discovered that when men bought diapers on
1
Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these
shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however,
they only bought a few items. The retailer concluded that they purchased the beer to have it
available for the upcoming weekend. The grocery chain could use this newly discovered
information in various ways to increase revenue. For example, they could move the beer
display closer to the diaper display. And, they could make sure beer and diapers were sold at
full price on Thursdays.
Data, Information, and Knowledge
Data
Data are any facts, numbers, or text that can be processed by a computer. Today,
organizations are accumulating vast and growing amounts of data in different formats and
different databases. This includes:

operational or transactional data such as, sales, cost, inventory, payroll, and
accounting

nonoperational data, such as industry sales, forecast data, and macro economic data

meta data - data about the data itself, such as logical database design or data
dictionary definitions
Information
The patterns, associations, or relationships among all this data can provide
information. For example, analysis of retail point of sale transaction data can yield
information on which products are selling and when.
Knowledge
Information can be converted into knowledge about historical patterns and future
trends. For example, summary information on retail supermarket sales can be analyzed in
light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a
manufacturer or retailer could determine which items are most susceptible to promotional
efforts.
Data Warehouses
Dramatic advances in data capture, processing power, data transmission, and storage
capabilities are enabling organizations to integrate their various databases into data
warehouses. Data warehousing is defined as a process of centralized data management and
retrieval.
What can data mining do?
Data mining is primarily used today by companies with a strong consumer focus retail, financial, communication, and marketing organizations. It enables these companies to
2
determine relationships among "internal" factors such as price, product positioning, or staff
skills, and "external" factors such as economic indicators, competition, and customer
demographics. And, it enables them to determine the impact on sales, customer satisfaction,
and corporate profits. Finally, it enables them to "drill down" into summary information to
view detail transactional data.
With data mining, a retailer could use point-of-sale records of customer purchases to
send targeted promotions based on an individual's purchase history. By mining demographic
data from comment or warranty cards, the retailer could develop products and promotions to
appeal to specific customer segments.
For example, Blockbuster Entertainment mines its video rental history database to
recommend rentals to individual customers. American Express can suggest products to its
cardholders based on analysis of their monthly expenditures.
WalMart is pioneering massive data mining to transform its supplier relationships.
WalMart captures point-of-sale transactions from over 2,900 stores in 6 countries and
continuously transmits this data to its massive 7.5 terabyte Teradata data warehouse. WalMart
allows more than 3,500 suppliers, to access data on their products and perform data analyses.
These suppliers use this data to identify customer buying patterns at the store display level.
They use this information to manage local store inventory and identify new merchandising
opportunities. In 1995, WalMart computers processed over 1 million complex data queries.
The National Basketball Association (NBA) is exploring a data mining application
that can be used in conjunction with image recordings of basketball games. The Advanced
Scoutsoftware analyzes the movements of players to help coaches orchestrate plays and
strategies. For example, an analysis of the play-by-play sheet of the game played between the
New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark
Price played the Guard position, John Williams attempted four jump shots and made each
one! Advanced Scout not only finds this pattern, but explains that it is interesting because it
differs considerably from the average shooting percentage of 49.30% for the Cavaliers during
that game.
By using the NBA universal clock, a coach can automatically bring up the video clips
showing each of the jump shots attempted by Williams with Price on the floor, without
needing to comb through hours of video footage. Those clips show a very successful pickand-roll play in which Price draws the Knick'sdefense and then finds Williams for an open
jump shot.
How does data mining work?
While large-scale information technology has been evolving separate transaction and
analytical systems, data mining provides the link between the two. Data mining software
analyzes relationships and patterns in stored transaction data based on open-ended user
queries. Several types of analytical software are available: statistical, machine learning, and
neural networks. Generally, any of four types of relationships are sought:

Classes: Stored data is used to locate data in predetermined groups. For example, a
restaurant chain could mine customer purchase data to determine when customers
visit and what they typically order. This information could be used to increase traffic
3
by having daily specials.

Clusters: Data items are grouped according to logical relationships or consumer
preferences. For example, data can be mined to identify market segments or consumer
affinities.

Associations: Data can be mined to identify associations. The beer-diaper example is
an example of associative mining.

Sequential patterns: Data is mined to anticipate behavior patterns and trends. For
example, an outdoor equipment retailer could predict the likelihood of a backpack
being purchased based on a consumer's purchase of sleeping bags and hiking shoes.
Data mining consists of five major elements:

Extract, transform, and load transaction data onto the data warehouse system.

Store and manage the data in a multidimensional database system.

Provide data access to business analysts and information technology professionals.

Analyze the data by application software.

Present the data in a useful format, such as a graph or table.
2. Nature of Data Sets





A data set is a set of measurements taken from some environment or process. In
the simplest case, we have a collection of objects, and for each object we have a set
of the same p measurements.
In this case, we can think of the collection of the measurements on n objects as a
form of n × p data matrix.
The n rows represent the n objects on which measurements were taken (for
example, medical patients, credit card customers, orindividual objects observed in
the night sky, such as stars and galaxies).
N rows maybe referred to as individuals, entities, cases, objects, or records
depending on thecontext.
The pcolumns are called variables, features, attributesor fields.
Measurement of Data comes in many forms:
1.Quantitative-it measured only the numerical values eg.age and income columns.
2.Categorical-it measured only the text values eg. Sex and marital status columns.
The data set also referred by another names:
1. Training data
2. Sample data
3. Database
4
Types of Structure
The structured can be classified into 2 types.
1 model – its referred as global model, A model structure, as defined here, is a global
summary of a data set; it makes statements about any point in the full measurement space.
2.A simple model might take the form Y = aX+ c, where Y and X are variables and a
and care parameters of the model
3.Patterns-its referred as local patterns, pattern structures make statements only
about restricted regions of the space spanned by the variables. The structure consist of
different constraints.
4. Data Mining Tasks
1.Exploratory Data Analysis (EDA):





the goal here is simply to explore the data without any clear ideas of what we are
looking for.
EDA techniques are interactive and visual,
there are many effective graphical display methods for relatively small, lowdimensional data sets.
As the dimensionality(number of variables, p) increases, it becomes much more
difficult to visualize the cloud of points in p-space.
Large numbers of cases can be difficult to visualize effectively, however, and notions
of scale and detail come into play: "lower resolution" data samples can be displayed
or summarized at the cost of possibly missing important details.
2.Descriptive Modelling:
 The goal of a descriptive model is describe all of the data (or the process generating
the data).
 Examples of such descriptions include models for the overall probability
distribution of the data (density estimation),
 partitioning of the p-dimensional space into groups (cluster analysis and
segmentation),
 Models describing the relationship between variables (dependency modeling).
In segmentation analysis, for example, the aim is to group together similar records, as
in market segmentation of commercial databases. Here the goal is to split the records into
homogeneous groups so that similar people (if the records refer to people) are put into the
same group. This enablesadvertisers and marketers to efficiently direct their promotions to
those most likely to respond. The number of groups here is chosen by the researcher; there is
no "right" number. This contrasts with clusteranalysis, in which the aim is to discover
"natural" groups in data—inscientific databases, for example.
5
3.PredictiveModelling: Classification and Regression:

The aim here is to build a model that will permit the value of one variable to be
predicted from the known values of other variables.
 In classification, the variable being predicted is categorical, while in regression the
variable is quantitative. The term "prediction" is used here in a general sense, and no
notion of a time continuum is implied.
So, for example, while we might want to predict the value of the stock market at some
future date, or which horse will win a race, we might also want to determine the diagnosis of
a patient, or the degree of brittleness of a weld. A large number of methods have been
developed in statistics and machine learning to tackle predictive modeling problems, and
work in this area has led to significant theoretical advances and improved understanding of
deep issues of inference. The key distinction between prediction and description is that
prediction has as its objective a unique variable (the market's value, the disease class, the
brittleness, etc.), while in descriptive problems no single variable is central to the model.
4. Discovering Patterns and Rules:
The three types of tasks listed above are concerned with model building. Other data
miningapplications are concerned with pattern detection. One example is spotting fraudulent
behavior by detecting regions of the space defining the different types of transactions where
the data points significantly different from the rest. Another use is in astronomy, where
detection of unusual stars or galaxies may lead to the discovery of previously unknown
phenomena. Yet another is the task of finding combinations of items that occur frequently in
transaction databases (e.g., grocery products that are often purchased together). This problem
has been the focus of much attention in data mining and has been addressed using algorithmic
techniques based on association rules.
5. Retrieval by Content:



the user has a pattern of interest and wishes to find similar patterns in the data set.
This task is most commonly used for text and image data sets. For text, the pattern
may be a set of keywords, and the user may wish to find relevant documents within a
large set of possibly relevant documents (e.g., Web pages).
For images, the user may have a sample image, a sketch of an image, or a description
of an image, and wish to find similar images from a large set of images. In both cases
the definition of similarity is critical, but so are the details of the search strategy.
5. Components of Data Mining Algorithms
The data mining algorithms that addressthese tasks have four basic components:
1. Model or Pattern Structure: determining the underlying structure or functional
forms that we seek from the data.
2. Score Function: judging the quality of a fitted model.
3. Optimization and Search Method: optimizing the score function and searching over
different model and pattern structures.
4. Data Management Strategy: handling data access efficiently during the
search/optimization.
6
We have already discussed the distinction between model and pattern structures. In
theremainder of this section we briefly discuss the other three components of a data
miningalgorithm.
Score Functions
The score function is a measure ofhow well aspects of the data match proposed
modelsor patterns. Usually, these models or patterns are described in terms of a structure,
sometimes with unknown parameter values.
Score functions quantify how well a model or parameter structure fits a given data set.
simple, "generic"score functions, such as least squares and classification accuracy are
commonly used.
Several score functions are widely used for this purpose, sum ofsquared errors, and
misclassification rate (the latter is used in supervised classificationproblems).the well-known
squared error score function is defined as:
where we are predicting n "target" values y(i), 1 = i = n, and our predictions for each
aredenoted as y(i) (typically this is a function of some other "input" variable values
forprediction and the parameters of the model)..
Optimization and Search Methods
The goal of optimization and search is to determine the structure and the parameter
values that achieve a minimum (or maximum, depending on the context) value of the score
function.
The task of finding the "best" values of parameters in models is typically cast as an
optimization (or estimation) problem. The task of finding interesting patterns (such as rules)
from a large family of potential patterns is typically cast as a combinatorial search problem,
and is often accomplished using heuristic search techniques.
In linear regression, a prediction rule is usually found by minimizing a least squares
score function (the sum of squared errors between the prediction from a model and the
observed values of the predicted variable). Such a score function is amenable to mathematical
manipulation, and the model that minimizes it can be found algebraically. In contrast, a score
function such as misclassification rate in supervised classification is difficult to minimize
analytically.
For example, since it is intrinsically discontinuous the powerful tool of differential
calculus cannot be brought to bear. Of course, while we can produce score functions to
produce a good match between a model or pattern and the data, in many cases this is not
really the objective. As noted above, we are often aiming to generalize to new data which
might arise (new customers, new chemicals, etc.) and having too close a match to the data in
the database may prevent one from predicting new cases accurately. We discuss this point
later in the chapter.
7
Data Management Strategies
The final component in any data mining algorithm is the data management strategy:
Theways in which the data are stored, indexed, and accessed.
Most well-known data analysis algorithms in statistics and machine learning have
been developed under the assumption that all individual data points can be accessed quickly
and efficiently in random-access memory (RAM)
. While main memory technology has improved rapidly, there have been equally rapid
improvements in secondary (disk) and tertiary (tape) storage technologies, to the extent that
many massive data sets still reside largely on disk or tape and will not fit in available RAM.
Thus, there will probably be a price to pay for accessing massive data sets, since not all data
points can be simultaneously close to the main processor.
Many data analysis algorithms have been developed without including any
explicitspecification of a data management strategy. While this has worked in the past
onrelatively small data sets, many algorithms (such as classification and regression
treealgorithms) scale very poorly when the "traditional version" is applied directly to data
thatreside mainly in secondary storage.
The field of databases is concerned with the development of indexing methods,
datastructures, and query algorithms for efficient and reliable data retrieval. Many of
thesetechniques have been developed to support relatively simple counting (aggregating)
operations on large data sets for reporting purposes. However, in recent years,development
has begun on techniques that support the "primitive" data accessoperations necessary to
implement efficient versions of data mining algorithms (forexample, tree-structured indexing
systems used to retrieve the neighbours of a point inmultiple dimensions).
8
6. The Interacting Roles of Statistics and Data Mining













Statistical techniques alone may not be sufficient to address some of the more
challenging issues in data mining, especially those arising from massive data sets.
statistics plays a very important role in data mining: it is a necessarycomponent in any
data mining enterprise.
With large data sets (and particularly with very large data sets) we may simply not
knoweven straightforward facts about the data. Simple eye-balling of the data is not
an option.This means that sophisticated search and examination methods may be
required toilluminate features which would be readily apparent in small data sets.
The database provides the setof objects which will be used to construct the model or
search for a pattern, but theultimate objective will not generally be to describe those
data.
In most cases theobjective is to describe the general process by which the data arose,
and other data setswhich could have arisen by the same process. All of this means that
it is necessary toavoid models or patterns which match the available database too
closely:
Statistical ideas and methods are so fundamental to data mining.
Data mining merely exploratory statistics, albeit for potentially huge data sets
The most fundamental difference between classical statistical applications and
datamining is the size of the data set.
To a conventional statistician, a "large" data set maycontain a few hundred or a
thousand data points. To someone concerned with datamining, however, many
millions or even billions of data points is not unexpected—gigabyte and even terabyte
databases are by no means uncommon. Such large databases occur in all walks of life.
difficulties arise when there are many variables. One that is important in
somecontexts is the curse of dimensionality
Various problems arise from the difficulties of accessing very large data sets.
Thestatistician's conventional viewpoint of a "flat" data file, in which rows represent
objectsand columns represent variables, may bear no resemblance to the way the data
arestored
The size of a data set may lead to difficulties, so also may other properties notoften
found in standard statistical applications. We have already remarked that datamining
is typically a secondary process of data analysis; that is, the data were
originallycollected for some other purpose. In contrast, much statistical work is
concerned withprimary analysis: the data are collected with particular questions in
mind, and then areanalyzed to answer those questions.
Problems arising from the way the data have been collected, we expectother
distortions to occur in large data sets—including missing values, contamination,and
corrupted data points.
7. Dredging, Snooping, and Fishing
1. the historical use of terms such as "data mining," "dredging," "snooping," and
"fishing." In the1960s, as computers were increasingly applied to data analysis
problems.
9
2. it was notedthat if you searched long enough, you could always find some
model to fit a data setarbitrarily well
There are two factors contributing to this situation:
 the complexity of themodel
 the size of the set of possible models.
definition Data dredging, sometimes referred to as "data fishing" is a data mining practice in which
large volumes of data are analyzed seeking any possible relationships between data. The
traditional scientific method, in contrast, begins with a hypothesis and follows with an
examination of the data. Sometimes conducted for unethical purposes, data dredging often
circumvents traditional data mining techniques and may lead to premature conclusions.
Data dredging is sometimes described as "seeking more information from a data set than it
actually contains."
Data dredging sometimes results in relationships between variables announced as significant
when, in fact, the data require more study before such an association can legitimately be
determined. Many variables may be related through chance alone; others may be related
through some unknown factor. To make a valid assessment of the relationship between any
two variables, further study is required in which isolated variables are contrasted with a
control group. Data dredging is sometimes used to present an unexamined concurrence of
variables as if they led to a valid conclusion, prior to any such study.
Although data dredging is often used improperly, it can be a useful means of finding
surprising relationships that might not otherwise have been discovered. However, because the
concurrence of variables does not constitute information about their relationship (which
could, after all, be merely coincidental), further analysis is required to yield any useful
conclusions.
10
8. Measurement and Data (CHAPTER 2)
Data:
Data are collected by mapping entities in the domain of interest to symbolic
representation by means of some measurement procedure, which associates the value of a
variable with a given property of an entity. The relationships between objects are represented
by numerical relationships between variables. These numerical representations, the data
items, are stored in the data set;


Schema of data—the a priori structure imposed on the data.
No data set is perfect, and this is particularly true of large data sets. Measurement
error, missing data, sampling distortion, human mistakes, and a host of other factors
corrupt the data. Since data mining is concerned with detecting unsuspected patterns
in data, it is very important to be aware of these imperfections
Types of Measurement
Measurements may be categorized in many ways.
1. Some of the distinctions arise fromthe nature of the properties the measurements
represent
2. Others arise from the use to which the measurements are put.
We will begin by considering how we might measure the property WEIGHT.
Let us imagine we have a collection of rocks. The first thing we observe is that we can
rank the rocks according to the WEIGHT property. We could do this, for example, by placing
a rock on each pan of a weighing scale and seeing which way the scale tipped. On this basis,
we could assign a number to each rock so that larger numbers corresponded to heavier rocks.
Note that here only the ordinal properties of these numbers are relevant. The fact that one
rock was assigned the number 4 and another was assigned the number 2 would not imply that
the first was in any sense twice as heavy as the second.
We can assign numbers to the rocks in such a way that not only does the order of the
numbers correspond to the order observed from the weighing scales, but the sum of the
numbers assigned to the two smaller rocks equals the number assigned to the larger rock.
That is, the total weight of the two smaller rocks equals the weight of the larger rock. Note
that even now the assignment of numbers is not unique.
Suppose we had assigned the numbers 2 and 3 to the smaller rocks, and the number 5
to the larger rock. This assignment satisfies the ordinal and additive property requirements,
but so too would the assignment of 4, 6, and 10 respectively. There is still some freedom in
how we define the variable weight corresponding to the WEIGHT property. The point of this
example is that our numerical representation reflects the empiricalproperties of the system
we are studying.
Relationships between rocks in terms of their WEIGHT property correspond to
relationships between values of the measured variable weight. This representation is useful
because it allows us to make inferences about the physical system by studying the numerical
system. Without juggling sacks of rocks, we can see which sack contains the largest rock,
which sack has the heaviest rocks on average, and so on. The rocks example involves two
empirical relationships: the order of the rocks, in terms of how they tip the scales, and their
concatenation property—the way two rocks together balance a third.
11
Other empirical systems might involve less than or more than two relationships. The
order relationship is very common; typically, if an empirical system has only one
relationship, it is an order relationship. Examples of the order relationship are provided by the
SEVERITY property in medicine and the PREFERENCE property in psychology.
Of course, not even an order relationship holds with some properties, for example, the
properties HAIR COLOR, RELIGION, and RESIDENCE OF PROGRAMMER, do not
have a natural order. Numbers can still be used to represent "values" of the properties,
(blond = 1, black = 2, brown = 3, and so on), but the only empirical relationship being
represented is that the colours are different (and so are represented by different numbers). It is
perhaps even more obvious here that the particular set of numbers assigned is not unique.
Any set in which different numbers correspond to different values of the property will do.
Given that the assignment of numbers is not unique, we must find some way to restrict
this freedom—or else problems might arise if different researchers use different assignments.
The solution is to adopt some convention. For the rocks example, we would adopt a basic
"value" of the property WEIGHT, corresponding to a basic value of the variable weight, and
defined measured values in terms of how many copies of the basic value are required to
balance them. Examples of such basic values for the WEIGHT/weight system are the gram
and pound.
Types of measurement may be categorized in terms of the empirical relationships
they seek to preserve.
Measurement also described in the numerical form it considers in scale value:
 Ordinal scale
 Ratio scale
 Nominal scale
 Interval scale
Scale Type
nominal (also
denoted as
categorical)
ordinal
interval
ratio
Permissible
Statistics
Admissible Scale
Transformation
Mathematical
structure
standard set
structure
(unordered)
mode, chi square
One to One
(equality (=))
median, percentile
Monotonic
increasing
(order (<))
totally ordered set
Positive linear
(affine)
affine line
mean, standard
deviation,
correlation,
regression, analysis
of variance
All statistics
permitted for
interval scales plus
the following:
geometric mean,
harmonic mean,
coefficient of
variation,
logarithms
12
Positive similarities
field
(multiplication)
Nominal scale


i.e., for a nominal category, one uses labels; for example, rocks can be generally
categorized as igneous, sedimentary and metamorphic. For this scale, some valid
operations are equivalence and set membership. Nominal measures offer names or
labels for certain characteristics.
Variables assessed on a nominal scale are called categorical variables;
Ordinal scale




Rank-ordering data simply puts the data on an ordinal scale. Ordinal measurements
describe order, but not relative size or degree of difference between the items
measured. In this scale type, the numbers assigned to objects or events represent the
rank order (1st, 2nd, 3rd, etc.) of the entities assessed.
A scale may also use names with an order such as: "bad", "medium", and "good"; or
"very satisfied", "satisfied", "neutral", "unsatisfied", "very unsatisfied."
An example of an ordinal scale is the result of a horse race, which says only which
horses arrived first, second, or third but include no information about race times.
Another is the Mohs scale of mineral hardness, which characterizes the hardness of
various minerals through the ability of a harder material to scratch a softer one, saying
nothing about the actual hardness of any of them.
When using an ordinal scale, the central tendency of a group of items can be
described by using the group's mode (or most common item) or its median (the
middle-ranked item), but the mean (or average) cannot be defined.
Interval scale



Quantitative attributes are all measurable on interval scales, as any difference between
the levels of an attribute can be multiplied by any real number to exceed or equal
another difference. A highly familiar example of interval scale measurement is
temperature with the Celsius scale.
In this particular scale, the unit of measurement is 1/100 of the difference between the
melting temperature and the boiling temperature of water at atmospheric pressure.
The "zero point" on an interval scale is arbitrary; and negative values can be used.
The formal mathematical term is an affine space (in this case an affine line). Variables
measured at the interval level are called "interval variables" or sometimes "scaled
variables" as they have units of measurement.
Ratios between numbers on the scale are not meaningful, so operations such as
multiplication and division cannot be carried out directly. But ratios of differences can
be expressed; for example, one difference can be twice another.
Ratio measurement

Most measurement in the physical sciences and engineering is done on ratio scales.
Mass, length, time, plane angle, energy and electric charge are examples of physical
measures that are ratio scales.
13

The scale type takes its name from the fact that measurement is the estimation of the
ratio between a magnitude of a continuous quantity and a unit magnitude of the same
kind (Michell, 1997, 1999). Informally, the distinguishing feature of a ratio scale is
the possession of a non-arbitrary zero value.
9. Distance Measures
Distance Measures forMany data mining techniques (for example, nearest neighbour
classification methods, cluster analysis, and multidimensional scaling methods) are based on
similarity measures between objects.
Similarity between objects can be measured into two ways:
1. They can be obtained directly from the objects.
2. In the second case it is necessary to define precisely what we mean by "similar," so
that we can calculate formal similarity measures.
3. Instead of talking about how similar two objects are, we could talk about how
dissimilar they are.
4. s(i, j) denotes the similaritybetweenobjects i and j
5. d(i, j) denotes the dissimilarity betweenobjects i and j
6. The term proximity is often used as a general term to denoteeither a measure of
similarity or dissimilarity.
7. possible transformations include d(i, j) = 1 - s(i, j) and d(i,j)
Two additional terms—distance and metric—are often used in this context. The
termdistance is often used informally to refer to a dissimilarity measure derived from
thecharacteristics describing the object
A metric,on the other hand, is a dissimilarity measure that satisfies three conditions:
1. d(i, j) = 0 for all i and j, and d(i, j) = 0 if and only if i = j;
2. d(i, j) = d(j, i) for all i and j; and
3. d(i, j) = d(i, k ) + d(k, j) for all i, j, and k.
The third condition is called the triangle inequality.
Suppose we have n data objects with p real-valued measurements on each object.
Wedenote the vector of observations for the ith object by x(i) = (x1(i), x2(i), . . . , xp(i)), 1 = i
=n, where the value of the kth variable for the ith object is xk(i). The Euclidean
distancebetween the ith and jth objects is defined as
This measure assumes some degree of commensurability between the differentvariables.
Thus, it would be effective if each variable was a measure of length (with thenumber p of
dimensions being 2 or 3, it would yield our standard physical measure ofdistance) or a
measure of weight, with each variable measured using the same units. Itmakes less sense if
the variables are no commensurate. For example, if one variablewere length and another were
14
weight, there would be no obvious choice of units; byaltering the choice of units we would
change which variables were most important as faras the distance was concerned.
The standard deviation for the kth variable Xkcan be estimated as:
10. Transforming Data
15
Sometimes raw data are not in the most convenient form and it can be advantageous
to modify them prior to analysis.
Note that there is a duality between the form of the modeland the nature of the data.
For example, if we speculate that a variable Y is a function of
The square of a variable X, then we either could try to find a suitable function of X2, or we
Could square X first, to U = X2, and fit a function to U. The equivalence of the two
Approaches is obvious in this simple example, but sometimes one or other can be much
More straightforward.
Common data transformations include taking square roots, reciprocals, logarithms, and
raising variables to positive integral powers. For data expressed as proportions, the logit
Transformation,f(p) = p/1-p , is often used.
Example 2.3
Clearly variable V1 in figure 2.2is nonlinearly related to variable V2. However, if we work
With the reciprocal of V2, that is, V3 = 1/V2, we obtain the linear relationship shown in figure
2.3.
V1
Figure 2.2: A Simple Nonlinear Relationship between Variable V1 and V2. (In These and
Subsequent Figures V1 and V2 are on the X and Y Axes Respectively).
16
Figure 2.3: The Data of Figure 2.2 after the Simple Transformation of V2 to 1/V2.
Figure 2.4: Another Simple Nonlinear Relationship. Here the Variance of V2 Increases as V1
Increases.
Figure 2.5: The Data of Figure 2.4 after a Simple Square Root Transformation of V2. Now
the Variance of V2 is Relatively Constant as V1 Increases.
17
11. The Form of Data
o The data sets come in different forms; these forms are known as schemas.
o The simplest form of data is a set of vector measurements on objects o(1), . . . ,
o(n)
o Each object we have measurements of p variables X1, . . . ,Xp.
o The data can be viewed as a matrix within N rows and P columns.
o We refer to this standard form of data as a data matrix, or simply standard
data. We can also refer to the data set as a table.
Often there are several types of objects we wish to analyze. For example, in a payroll
Database, we might have data both about employees, with variables name, department -name,
age, and salary, and about departments with variables department-name, budgetand manager.
These data matrices are connected to each other by the occurrence of thesame
(categorical) values in the department-name fields and in the fields name andmanager. Data
sets consisting of several such matrices or tables are calledmulti relational data.
In many cases multirelational data can be mapped to a single data matrix or table.
ForExample, we could join the two data tables using the values of the variable
departmentname.
More important, from the point of view of efficiency in storage and data access,
“flattening" multirelational data to form a single large table may involve the
needlessreplication of numerous values.
There are different forms of data Schemas
1. A stringis a sequence of symbols from some finite alphabet. A sequence of values
from a categorical variable is a string, and so is Standard English text, in which
the values arealphanumeric characters, spaces, and punctuation marks. Protein and
DNA/RNAsequences are other examples. A string is another data type that is
ordered and for which the standard matrix form is not necessarily suitable.
2. A related ordered data type is the event-sequence. Given a finite alphabet of
categorical event types, an event-sequence is a sequence of pairs of the form
{event, occurrencetime}. This is quite similar to a string, but here each item in the
sequence is tagged with an occurrence time.
3. Ordered data are spread along a unidimensional continuum (per individual
variable), butother data often lie in higher dimensions. Spatial, geographic, or
image data are located in two and three dimensional spaces.
4. A hierarchical structure is a more complex data schema. For example, a data set
of children might be grouped into classes, which are grouped into years, which are
grouped into schools, which are grouped into counties, and so on.
12. Data Quality for Individual Measurements
The effectiveness of a data mining depends critically on the quality of the data.
18
GIGO—Garbage In, GarbageOut. Is a secondary analysis of large data sets, the
dangers aremultiplied. It is quite possible that the most interesting patterns we discover
during a data.
Drawbacks of these data analysis are:
1. measurement inaccuracies
2. distorted samples
3. Some other unsuspected difference between the reality of the data and our perception
of it.
In this method is convenient to characterize data quality in two ways:
1. The quality of the individual records and fields
2. The overall quality of the collection of data.
Measurement procedure is also containing different risk of errors.




The sources of error are infinite
ranging from human carelessness
instrumentation failure,
To inadequate definitionof what it is that we are measuring?
Measuring instruments can lead to errors in twoways:
 they can be inaccurate
 They can be imprecise.
A precisemeasurement procedure is one that has small variability (often measured by
itsvariance). Using a precise process, repeated measurements on the same object underthe
same conditions will yield very similar values.
An accuratemeasurement procedure, in contrast, not only possesses small variability, but also
yields results close to what we think of as the true value. A measurement procedure may yield
precise but inaccurate measurements.
The reliability of a measurement procedure is the same as its precision.
Onewas basic precision—the extent to which repeated measurements of the same
object gave similar results. The other was the extent to which the distribution of
measurementswas centered on the true value. While precision corresponds to reliability, the
othercomponent corresponds to validity.
13. Data Quality for Collections of Data
Inaccuracies can occur in two ways.
1. Estimates from different samples might vary greatly, so that they are unreliable:
2. Using a different sample might have led to a very different estimate.
19
In addition to the quality of individual observations, we need to consider the quality
ofcollections of observations. Much of statistics and data mining is concerned
withinferencefrom a sample to a population, that is, how, on the basis of examining just a
fraction of the objects in a collection, one can infer things about the entire
population.Statisticians use the term parameter to refer to descriptive summaries of
populations ordistributions of objects (more generally, of course, a parameter is a value that
indexes afamily of mathematical functions). Values computed from a sample of objects are
calledstatistics, and appropriately chosen statistics can be used as estimates of
parameters.Thus, for example, we can use the average of a sample as an estimate of the
mean(parameter) of an entire population or distribution.
20
Download