STUDY MATERIAL Subject Staff Class Semester : Data Mining : Usha.P : III B.Sc(CS-B) : VI UNIT-I: Introduction: Nature of Data Sets – Models and Patterns – Data mining Tasks – Components of Data mining Algorithms – The interacting roles of Statistics and Data mining –Dredging, Snooping and Fishing. Measurement of Data: Types of measurement - Distance Measures – Transforming Data – The form of Data – Data Quality for individual measurements– Data Quality for Collections of data Introduction Overview Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Data mining is also known asknowledge discovery in databases, or KDD. This term originated in the artificial intelligence (AI) research field. The KDD process involves several stages: selecting the target data, pre-processing the data, transforming them if necessary, performing data mining to extract patterns and relationships, and then interpreting and assessing the discovered structures. The process of seeking relationships within a data set— of seeking accurate, convenient, and useful summary representations of some aspect of the data—involves a number of steps: Determining the nature and structure of the representation to be used; Deciding how to quantify and compare how well different representations fit the data choosing an algorithmic process to optimize the score function; Deciding what principles of data management are required to implement the algorithms efficiently. Example For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on 1 Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays. Data, Information, and Knowledge Data Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes: operational or transactional data such as, sales, cost, inventory, payroll, and accounting nonoperational data, such as industry sales, forecast data, and macro economic data meta data - data about the data itself, such as logical database design or data dictionary definitions Information The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when. Knowledge Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts. Data Warehouses Dramatic advances in data capture, processing power, data transmission, and storage capabilities are enabling organizations to integrate their various databases into data warehouses. Data warehousing is defined as a process of centralized data management and retrieval. What can data mining do? Data mining is primarily used today by companies with a strong consumer focus retail, financial, communication, and marketing organizations. It enables these companies to 2 determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data. With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments. For example, Blockbuster Entertainment mines its video rental history database to recommend rentals to individual customers. American Express can suggest products to its cardholders based on analysis of their monthly expenditures. WalMart is pioneering massive data mining to transform its supplier relationships. WalMart captures point-of-sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive 7.5 terabyte Teradata data warehouse. WalMart allows more than 3,500 suppliers, to access data on their products and perform data analyses. These suppliers use this data to identify customer buying patterns at the store display level. They use this information to manage local store inventory and identify new merchandising opportunities. In 1995, WalMart computers processed over 1 million complex data queries. The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. The Advanced Scoutsoftware analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the Cavaliers during that game. By using the NBA universal clock, a coach can automatically bring up the video clips showing each of the jump shots attempted by Williams with Price on the floor, without needing to comb through hours of video footage. Those clips show a very successful pickand-roll play in which Price draws the Knick'sdefense and then finds Williams for an open jump shot. How does data mining work? While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought: Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic 3 by having daily specials. Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities. Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining. Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes. Data mining consists of five major elements: Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals. Analyze the data by application software. Present the data in a useful format, such as a graph or table. 2. Nature of Data Sets A data set is a set of measurements taken from some environment or process. In the simplest case, we have a collection of objects, and for each object we have a set of the same p measurements. In this case, we can think of the collection of the measurements on n objects as a form of n × p data matrix. The n rows represent the n objects on which measurements were taken (for example, medical patients, credit card customers, orindividual objects observed in the night sky, such as stars and galaxies). N rows maybe referred to as individuals, entities, cases, objects, or records depending on thecontext. The pcolumns are called variables, features, attributesor fields. Measurement of Data comes in many forms: 1.Quantitative-it measured only the numerical values eg.age and income columns. 2.Categorical-it measured only the text values eg. Sex and marital status columns. The data set also referred by another names: 1. Training data 2. Sample data 3. Database 4 Types of Structure The structured can be classified into 2 types. 1 model – its referred as global model, A model structure, as defined here, is a global summary of a data set; it makes statements about any point in the full measurement space. 2.A simple model might take the form Y = aX+ c, where Y and X are variables and a and care parameters of the model 3.Patterns-its referred as local patterns, pattern structures make statements only about restricted regions of the space spanned by the variables. The structure consist of different constraints. 4. Data Mining Tasks 1.Exploratory Data Analysis (EDA): the goal here is simply to explore the data without any clear ideas of what we are looking for. EDA techniques are interactive and visual, there are many effective graphical display methods for relatively small, lowdimensional data sets. As the dimensionality(number of variables, p) increases, it becomes much more difficult to visualize the cloud of points in p-space. Large numbers of cases can be difficult to visualize effectively, however, and notions of scale and detail come into play: "lower resolution" data samples can be displayed or summarized at the cost of possibly missing important details. 2.Descriptive Modelling: The goal of a descriptive model is describe all of the data (or the process generating the data). Examples of such descriptions include models for the overall probability distribution of the data (density estimation), partitioning of the p-dimensional space into groups (cluster analysis and segmentation), Models describing the relationship between variables (dependency modeling). In segmentation analysis, for example, the aim is to group together similar records, as in market segmentation of commercial databases. Here the goal is to split the records into homogeneous groups so that similar people (if the records refer to people) are put into the same group. This enablesadvertisers and marketers to efficiently direct their promotions to those most likely to respond. The number of groups here is chosen by the researcher; there is no "right" number. This contrasts with clusteranalysis, in which the aim is to discover "natural" groups in data—inscientific databases, for example. 5 3.PredictiveModelling: Classification and Regression: The aim here is to build a model that will permit the value of one variable to be predicted from the known values of other variables. In classification, the variable being predicted is categorical, while in regression the variable is quantitative. The term "prediction" is used here in a general sense, and no notion of a time continuum is implied. So, for example, while we might want to predict the value of the stock market at some future date, or which horse will win a race, we might also want to determine the diagnosis of a patient, or the degree of brittleness of a weld. A large number of methods have been developed in statistics and machine learning to tackle predictive modeling problems, and work in this area has led to significant theoretical advances and improved understanding of deep issues of inference. The key distinction between prediction and description is that prediction has as its objective a unique variable (the market's value, the disease class, the brittleness, etc.), while in descriptive problems no single variable is central to the model. 4. Discovering Patterns and Rules: The three types of tasks listed above are concerned with model building. Other data miningapplications are concerned with pattern detection. One example is spotting fraudulent behavior by detecting regions of the space defining the different types of transactions where the data points significantly different from the rest. Another use is in astronomy, where detection of unusual stars or galaxies may lead to the discovery of previously unknown phenomena. Yet another is the task of finding combinations of items that occur frequently in transaction databases (e.g., grocery products that are often purchased together). This problem has been the focus of much attention in data mining and has been addressed using algorithmic techniques based on association rules. 5. Retrieval by Content: the user has a pattern of interest and wishes to find similar patterns in the data set. This task is most commonly used for text and image data sets. For text, the pattern may be a set of keywords, and the user may wish to find relevant documents within a large set of possibly relevant documents (e.g., Web pages). For images, the user may have a sample image, a sketch of an image, or a description of an image, and wish to find similar images from a large set of images. In both cases the definition of similarity is critical, but so are the details of the search strategy. 5. Components of Data Mining Algorithms The data mining algorithms that addressthese tasks have four basic components: 1. Model or Pattern Structure: determining the underlying structure or functional forms that we seek from the data. 2. Score Function: judging the quality of a fitted model. 3. Optimization and Search Method: optimizing the score function and searching over different model and pattern structures. 4. Data Management Strategy: handling data access efficiently during the search/optimization. 6 We have already discussed the distinction between model and pattern structures. In theremainder of this section we briefly discuss the other three components of a data miningalgorithm. Score Functions The score function is a measure ofhow well aspects of the data match proposed modelsor patterns. Usually, these models or patterns are described in terms of a structure, sometimes with unknown parameter values. Score functions quantify how well a model or parameter structure fits a given data set. simple, "generic"score functions, such as least squares and classification accuracy are commonly used. Several score functions are widely used for this purpose, sum ofsquared errors, and misclassification rate (the latter is used in supervised classificationproblems).the well-known squared error score function is defined as: where we are predicting n "target" values y(i), 1 = i = n, and our predictions for each aredenoted as y(i) (typically this is a function of some other "input" variable values forprediction and the parameters of the model).. Optimization and Search Methods The goal of optimization and search is to determine the structure and the parameter values that achieve a minimum (or maximum, depending on the context) value of the score function. The task of finding the "best" values of parameters in models is typically cast as an optimization (or estimation) problem. The task of finding interesting patterns (such as rules) from a large family of potential patterns is typically cast as a combinatorial search problem, and is often accomplished using heuristic search techniques. In linear regression, a prediction rule is usually found by minimizing a least squares score function (the sum of squared errors between the prediction from a model and the observed values of the predicted variable). Such a score function is amenable to mathematical manipulation, and the model that minimizes it can be found algebraically. In contrast, a score function such as misclassification rate in supervised classification is difficult to minimize analytically. For example, since it is intrinsically discontinuous the powerful tool of differential calculus cannot be brought to bear. Of course, while we can produce score functions to produce a good match between a model or pattern and the data, in many cases this is not really the objective. As noted above, we are often aiming to generalize to new data which might arise (new customers, new chemicals, etc.) and having too close a match to the data in the database may prevent one from predicting new cases accurately. We discuss this point later in the chapter. 7 Data Management Strategies The final component in any data mining algorithm is the data management strategy: Theways in which the data are stored, indexed, and accessed. Most well-known data analysis algorithms in statistics and machine learning have been developed under the assumption that all individual data points can be accessed quickly and efficiently in random-access memory (RAM) . While main memory technology has improved rapidly, there have been equally rapid improvements in secondary (disk) and tertiary (tape) storage technologies, to the extent that many massive data sets still reside largely on disk or tape and will not fit in available RAM. Thus, there will probably be a price to pay for accessing massive data sets, since not all data points can be simultaneously close to the main processor. Many data analysis algorithms have been developed without including any explicitspecification of a data management strategy. While this has worked in the past onrelatively small data sets, many algorithms (such as classification and regression treealgorithms) scale very poorly when the "traditional version" is applied directly to data thatreside mainly in secondary storage. The field of databases is concerned with the development of indexing methods, datastructures, and query algorithms for efficient and reliable data retrieval. Many of thesetechniques have been developed to support relatively simple counting (aggregating) operations on large data sets for reporting purposes. However, in recent years,development has begun on techniques that support the "primitive" data accessoperations necessary to implement efficient versions of data mining algorithms (forexample, tree-structured indexing systems used to retrieve the neighbours of a point inmultiple dimensions). 8 6. The Interacting Roles of Statistics and Data Mining Statistical techniques alone may not be sufficient to address some of the more challenging issues in data mining, especially those arising from massive data sets. statistics plays a very important role in data mining: it is a necessarycomponent in any data mining enterprise. With large data sets (and particularly with very large data sets) we may simply not knoweven straightforward facts about the data. Simple eye-balling of the data is not an option.This means that sophisticated search and examination methods may be required toilluminate features which would be readily apparent in small data sets. The database provides the setof objects which will be used to construct the model or search for a pattern, but theultimate objective will not generally be to describe those data. In most cases theobjective is to describe the general process by which the data arose, and other data setswhich could have arisen by the same process. All of this means that it is necessary toavoid models or patterns which match the available database too closely: Statistical ideas and methods are so fundamental to data mining. Data mining merely exploratory statistics, albeit for potentially huge data sets The most fundamental difference between classical statistical applications and datamining is the size of the data set. To a conventional statistician, a "large" data set maycontain a few hundred or a thousand data points. To someone concerned with datamining, however, many millions or even billions of data points is not unexpected—gigabyte and even terabyte databases are by no means uncommon. Such large databases occur in all walks of life. difficulties arise when there are many variables. One that is important in somecontexts is the curse of dimensionality Various problems arise from the difficulties of accessing very large data sets. Thestatistician's conventional viewpoint of a "flat" data file, in which rows represent objectsand columns represent variables, may bear no resemblance to the way the data arestored The size of a data set may lead to difficulties, so also may other properties notoften found in standard statistical applications. We have already remarked that datamining is typically a secondary process of data analysis; that is, the data were originallycollected for some other purpose. In contrast, much statistical work is concerned withprimary analysis: the data are collected with particular questions in mind, and then areanalyzed to answer those questions. Problems arising from the way the data have been collected, we expectother distortions to occur in large data sets—including missing values, contamination,and corrupted data points. 7. Dredging, Snooping, and Fishing 1. the historical use of terms such as "data mining," "dredging," "snooping," and "fishing." In the1960s, as computers were increasingly applied to data analysis problems. 9 2. it was notedthat if you searched long enough, you could always find some model to fit a data setarbitrarily well There are two factors contributing to this situation: the complexity of themodel the size of the set of possible models. definition Data dredging, sometimes referred to as "data fishing" is a data mining practice in which large volumes of data are analyzed seeking any possible relationships between data. The traditional scientific method, in contrast, begins with a hypothesis and follows with an examination of the data. Sometimes conducted for unethical purposes, data dredging often circumvents traditional data mining techniques and may lead to premature conclusions. Data dredging is sometimes described as "seeking more information from a data set than it actually contains." Data dredging sometimes results in relationships between variables announced as significant when, in fact, the data require more study before such an association can legitimately be determined. Many variables may be related through chance alone; others may be related through some unknown factor. To make a valid assessment of the relationship between any two variables, further study is required in which isolated variables are contrasted with a control group. Data dredging is sometimes used to present an unexamined concurrence of variables as if they led to a valid conclusion, prior to any such study. Although data dredging is often used improperly, it can be a useful means of finding surprising relationships that might not otherwise have been discovered. However, because the concurrence of variables does not constitute information about their relationship (which could, after all, be merely coincidental), further analysis is required to yield any useful conclusions. 10 8. Measurement and Data (CHAPTER 2) Data: Data are collected by mapping entities in the domain of interest to symbolic representation by means of some measurement procedure, which associates the value of a variable with a given property of an entity. The relationships between objects are represented by numerical relationships between variables. These numerical representations, the data items, are stored in the data set; Schema of data—the a priori structure imposed on the data. No data set is perfect, and this is particularly true of large data sets. Measurement error, missing data, sampling distortion, human mistakes, and a host of other factors corrupt the data. Since data mining is concerned with detecting unsuspected patterns in data, it is very important to be aware of these imperfections Types of Measurement Measurements may be categorized in many ways. 1. Some of the distinctions arise fromthe nature of the properties the measurements represent 2. Others arise from the use to which the measurements are put. We will begin by considering how we might measure the property WEIGHT. Let us imagine we have a collection of rocks. The first thing we observe is that we can rank the rocks according to the WEIGHT property. We could do this, for example, by placing a rock on each pan of a weighing scale and seeing which way the scale tipped. On this basis, we could assign a number to each rock so that larger numbers corresponded to heavier rocks. Note that here only the ordinal properties of these numbers are relevant. The fact that one rock was assigned the number 4 and another was assigned the number 2 would not imply that the first was in any sense twice as heavy as the second. We can assign numbers to the rocks in such a way that not only does the order of the numbers correspond to the order observed from the weighing scales, but the sum of the numbers assigned to the two smaller rocks equals the number assigned to the larger rock. That is, the total weight of the two smaller rocks equals the weight of the larger rock. Note that even now the assignment of numbers is not unique. Suppose we had assigned the numbers 2 and 3 to the smaller rocks, and the number 5 to the larger rock. This assignment satisfies the ordinal and additive property requirements, but so too would the assignment of 4, 6, and 10 respectively. There is still some freedom in how we define the variable weight corresponding to the WEIGHT property. The point of this example is that our numerical representation reflects the empiricalproperties of the system we are studying. Relationships between rocks in terms of their WEIGHT property correspond to relationships between values of the measured variable weight. This representation is useful because it allows us to make inferences about the physical system by studying the numerical system. Without juggling sacks of rocks, we can see which sack contains the largest rock, which sack has the heaviest rocks on average, and so on. The rocks example involves two empirical relationships: the order of the rocks, in terms of how they tip the scales, and their concatenation property—the way two rocks together balance a third. 11 Other empirical systems might involve less than or more than two relationships. The order relationship is very common; typically, if an empirical system has only one relationship, it is an order relationship. Examples of the order relationship are provided by the SEVERITY property in medicine and the PREFERENCE property in psychology. Of course, not even an order relationship holds with some properties, for example, the properties HAIR COLOR, RELIGION, and RESIDENCE OF PROGRAMMER, do not have a natural order. Numbers can still be used to represent "values" of the properties, (blond = 1, black = 2, brown = 3, and so on), but the only empirical relationship being represented is that the colours are different (and so are represented by different numbers). It is perhaps even more obvious here that the particular set of numbers assigned is not unique. Any set in which different numbers correspond to different values of the property will do. Given that the assignment of numbers is not unique, we must find some way to restrict this freedom—or else problems might arise if different researchers use different assignments. The solution is to adopt some convention. For the rocks example, we would adopt a basic "value" of the property WEIGHT, corresponding to a basic value of the variable weight, and defined measured values in terms of how many copies of the basic value are required to balance them. Examples of such basic values for the WEIGHT/weight system are the gram and pound. Types of measurement may be categorized in terms of the empirical relationships they seek to preserve. Measurement also described in the numerical form it considers in scale value: Ordinal scale Ratio scale Nominal scale Interval scale Scale Type nominal (also denoted as categorical) ordinal interval ratio Permissible Statistics Admissible Scale Transformation Mathematical structure standard set structure (unordered) mode, chi square One to One (equality (=)) median, percentile Monotonic increasing (order (<)) totally ordered set Positive linear (affine) affine line mean, standard deviation, correlation, regression, analysis of variance All statistics permitted for interval scales plus the following: geometric mean, harmonic mean, coefficient of variation, logarithms 12 Positive similarities field (multiplication) Nominal scale i.e., for a nominal category, one uses labels; for example, rocks can be generally categorized as igneous, sedimentary and metamorphic. For this scale, some valid operations are equivalence and set membership. Nominal measures offer names or labels for certain characteristics. Variables assessed on a nominal scale are called categorical variables; Ordinal scale Rank-ordering data simply puts the data on an ordinal scale. Ordinal measurements describe order, but not relative size or degree of difference between the items measured. In this scale type, the numbers assigned to objects or events represent the rank order (1st, 2nd, 3rd, etc.) of the entities assessed. A scale may also use names with an order such as: "bad", "medium", and "good"; or "very satisfied", "satisfied", "neutral", "unsatisfied", "very unsatisfied." An example of an ordinal scale is the result of a horse race, which says only which horses arrived first, second, or third but include no information about race times. Another is the Mohs scale of mineral hardness, which characterizes the hardness of various minerals through the ability of a harder material to scratch a softer one, saying nothing about the actual hardness of any of them. When using an ordinal scale, the central tendency of a group of items can be described by using the group's mode (or most common item) or its median (the middle-ranked item), but the mean (or average) cannot be defined. Interval scale Quantitative attributes are all measurable on interval scales, as any difference between the levels of an attribute can be multiplied by any real number to exceed or equal another difference. A highly familiar example of interval scale measurement is temperature with the Celsius scale. In this particular scale, the unit of measurement is 1/100 of the difference between the melting temperature and the boiling temperature of water at atmospheric pressure. The "zero point" on an interval scale is arbitrary; and negative values can be used. The formal mathematical term is an affine space (in this case an affine line). Variables measured at the interval level are called "interval variables" or sometimes "scaled variables" as they have units of measurement. Ratios between numbers on the scale are not meaningful, so operations such as multiplication and division cannot be carried out directly. But ratios of differences can be expressed; for example, one difference can be twice another. Ratio measurement Most measurement in the physical sciences and engineering is done on ratio scales. Mass, length, time, plane angle, energy and electric charge are examples of physical measures that are ratio scales. 13 The scale type takes its name from the fact that measurement is the estimation of the ratio between a magnitude of a continuous quantity and a unit magnitude of the same kind (Michell, 1997, 1999). Informally, the distinguishing feature of a ratio scale is the possession of a non-arbitrary zero value. 9. Distance Measures Distance Measures forMany data mining techniques (for example, nearest neighbour classification methods, cluster analysis, and multidimensional scaling methods) are based on similarity measures between objects. Similarity between objects can be measured into two ways: 1. They can be obtained directly from the objects. 2. In the second case it is necessary to define precisely what we mean by "similar," so that we can calculate formal similarity measures. 3. Instead of talking about how similar two objects are, we could talk about how dissimilar they are. 4. s(i, j) denotes the similaritybetweenobjects i and j 5. d(i, j) denotes the dissimilarity betweenobjects i and j 6. The term proximity is often used as a general term to denoteeither a measure of similarity or dissimilarity. 7. possible transformations include d(i, j) = 1 - s(i, j) and d(i,j) Two additional terms—distance and metric—are often used in this context. The termdistance is often used informally to refer to a dissimilarity measure derived from thecharacteristics describing the object A metric,on the other hand, is a dissimilarity measure that satisfies three conditions: 1. d(i, j) = 0 for all i and j, and d(i, j) = 0 if and only if i = j; 2. d(i, j) = d(j, i) for all i and j; and 3. d(i, j) = d(i, k ) + d(k, j) for all i, j, and k. The third condition is called the triangle inequality. Suppose we have n data objects with p real-valued measurements on each object. Wedenote the vector of observations for the ith object by x(i) = (x1(i), x2(i), . . . , xp(i)), 1 = i =n, where the value of the kth variable for the ith object is xk(i). The Euclidean distancebetween the ith and jth objects is defined as This measure assumes some degree of commensurability between the differentvariables. Thus, it would be effective if each variable was a measure of length (with thenumber p of dimensions being 2 or 3, it would yield our standard physical measure ofdistance) or a measure of weight, with each variable measured using the same units. Itmakes less sense if the variables are no commensurate. For example, if one variablewere length and another were 14 weight, there would be no obvious choice of units; byaltering the choice of units we would change which variables were most important as faras the distance was concerned. The standard deviation for the kth variable Xkcan be estimated as: 10. Transforming Data 15 Sometimes raw data are not in the most convenient form and it can be advantageous to modify them prior to analysis. Note that there is a duality between the form of the modeland the nature of the data. For example, if we speculate that a variable Y is a function of The square of a variable X, then we either could try to find a suitable function of X2, or we Could square X first, to U = X2, and fit a function to U. The equivalence of the two Approaches is obvious in this simple example, but sometimes one or other can be much More straightforward. Common data transformations include taking square roots, reciprocals, logarithms, and raising variables to positive integral powers. For data expressed as proportions, the logit Transformation,f(p) = p/1-p , is often used. Example 2.3 Clearly variable V1 in figure 2.2is nonlinearly related to variable V2. However, if we work With the reciprocal of V2, that is, V3 = 1/V2, we obtain the linear relationship shown in figure 2.3. V1 Figure 2.2: A Simple Nonlinear Relationship between Variable V1 and V2. (In These and Subsequent Figures V1 and V2 are on the X and Y Axes Respectively). 16 Figure 2.3: The Data of Figure 2.2 after the Simple Transformation of V2 to 1/V2. Figure 2.4: Another Simple Nonlinear Relationship. Here the Variance of V2 Increases as V1 Increases. Figure 2.5: The Data of Figure 2.4 after a Simple Square Root Transformation of V2. Now the Variance of V2 is Relatively Constant as V1 Increases. 17 11. The Form of Data o The data sets come in different forms; these forms are known as schemas. o The simplest form of data is a set of vector measurements on objects o(1), . . . , o(n) o Each object we have measurements of p variables X1, . . . ,Xp. o The data can be viewed as a matrix within N rows and P columns. o We refer to this standard form of data as a data matrix, or simply standard data. We can also refer to the data set as a table. Often there are several types of objects we wish to analyze. For example, in a payroll Database, we might have data both about employees, with variables name, department -name, age, and salary, and about departments with variables department-name, budgetand manager. These data matrices are connected to each other by the occurrence of thesame (categorical) values in the department-name fields and in the fields name andmanager. Data sets consisting of several such matrices or tables are calledmulti relational data. In many cases multirelational data can be mapped to a single data matrix or table. ForExample, we could join the two data tables using the values of the variable departmentname. More important, from the point of view of efficiency in storage and data access, “flattening" multirelational data to form a single large table may involve the needlessreplication of numerous values. There are different forms of data Schemas 1. A stringis a sequence of symbols from some finite alphabet. A sequence of values from a categorical variable is a string, and so is Standard English text, in which the values arealphanumeric characters, spaces, and punctuation marks. Protein and DNA/RNAsequences are other examples. A string is another data type that is ordered and for which the standard matrix form is not necessarily suitable. 2. A related ordered data type is the event-sequence. Given a finite alphabet of categorical event types, an event-sequence is a sequence of pairs of the form {event, occurrencetime}. This is quite similar to a string, but here each item in the sequence is tagged with an occurrence time. 3. Ordered data are spread along a unidimensional continuum (per individual variable), butother data often lie in higher dimensions. Spatial, geographic, or image data are located in two and three dimensional spaces. 4. A hierarchical structure is a more complex data schema. For example, a data set of children might be grouped into classes, which are grouped into years, which are grouped into schools, which are grouped into counties, and so on. 12. Data Quality for Individual Measurements The effectiveness of a data mining depends critically on the quality of the data. 18 GIGO—Garbage In, GarbageOut. Is a secondary analysis of large data sets, the dangers aremultiplied. It is quite possible that the most interesting patterns we discover during a data. Drawbacks of these data analysis are: 1. measurement inaccuracies 2. distorted samples 3. Some other unsuspected difference between the reality of the data and our perception of it. In this method is convenient to characterize data quality in two ways: 1. The quality of the individual records and fields 2. The overall quality of the collection of data. Measurement procedure is also containing different risk of errors. The sources of error are infinite ranging from human carelessness instrumentation failure, To inadequate definitionof what it is that we are measuring? Measuring instruments can lead to errors in twoways: they can be inaccurate They can be imprecise. A precisemeasurement procedure is one that has small variability (often measured by itsvariance). Using a precise process, repeated measurements on the same object underthe same conditions will yield very similar values. An accuratemeasurement procedure, in contrast, not only possesses small variability, but also yields results close to what we think of as the true value. A measurement procedure may yield precise but inaccurate measurements. The reliability of a measurement procedure is the same as its precision. Onewas basic precision—the extent to which repeated measurements of the same object gave similar results. The other was the extent to which the distribution of measurementswas centered on the true value. While precision corresponds to reliability, the othercomponent corresponds to validity. 13. Data Quality for Collections of Data Inaccuracies can occur in two ways. 1. Estimates from different samples might vary greatly, so that they are unreliable: 2. Using a different sample might have led to a very different estimate. 19 In addition to the quality of individual observations, we need to consider the quality ofcollections of observations. Much of statistics and data mining is concerned withinferencefrom a sample to a population, that is, how, on the basis of examining just a fraction of the objects in a collection, one can infer things about the entire population.Statisticians use the term parameter to refer to descriptive summaries of populations ordistributions of objects (more generally, of course, a parameter is a value that indexes afamily of mathematical functions). Values computed from a sample of objects are calledstatistics, and appropriately chosen statistics can be used as estimates of parameters.Thus, for example, we can use the average of a sample as an estimate of the mean(parameter) of an entire population or distribution. 20