Data Mining – Input: Concepts, instances, attributes Chapter 2 Concept • Thing to be learned – Ignore any philosophy about what a concept is – Need description that is • Intelligible – can be understood, and thus can be argued / discussed as to its validity by humans • Operational – it can be applied to future examples • How the concept is expressed is the “concept description” • Concept may differ based on different styles of learning … classification, association, clustering, numeric prediction … • Concept description may differ based on learning scheme/algorithm used Styles of Learning • Classification – learn way of “classifying” unseen examples – put them in the correct category • Association – learn any association between attributes • Clustering – seek groups of examples that belong together, without pre-classification • Numeric prediction – prediction of numeric quantity instead of category Classification • “Supervised” – learning scheme is provided correct classification/class/category for “training” data • Success is measured by trying out what is learned on independent/ previous unseen “test” data (withholding category/class until checking the program’s answer) Supervision • Classification and numeric prediction are “supervised” • Association and Clustering are “unsupervised” Inputs – What’s in an Example? • Input is a set of instances (records/examples) • Instance has set of values for pre-determined attributes (like a record in a DB) • I.e. input is like a single DB table, or “flat file” – There may be things we’d like to learn that don’t fit into this simple structure – but current technology is largely only up to handling simple input – You may find it useful sometimes to “denormalize” a DB – do a JOIN of two or more tables to produce a flat file (just make sure you don’t just re-learn the primary keys or foreign key!) Attributes • Flat file format means that all examples are expected to have values for the same attributes – Some attributes may be irrelevant for some examples – Some attributes relevance may depend on value of another attribute – Usual workaround – irrelevant attributes have a special irrelevant “value” Kinds of attributes • Binary/boolean – two valued; e.g. Resident Student? • Nominal/categorical/enumerated/discrete – multiple valued, unordered; e.g. Major • Ordinal - Ordered, but no sense of distance between – – e.g. Fr, So, Jr, Sr; – e.g. Household Income 1 - < 15K, 2 – 15-20K, 3- 20-25K, 4- 2530K, 5 – 30-40K, 6 – 40-50K, 7 - > 50K • Interval – ordered, distance is measurable; e.g. birth year • Ratio – an actual measurement with defined zero point such that we could say that one value is double another or triple, or ½; e.g. GPA Kinds of Attributes • Many algorithms cannot handle all of those different types of attributes • One approach – – treat binary and nominal as nominal – Treat ordinal, interval, and ratio as “numeric” • Requires coding ordinals such as Fr, So etc as numbers Preparing the Data • Preparing the data “usually consumes the bulk of the effort invested in the entire data mining process” • Real data is frequently low quality • Data Cleaning is frequently necessary and time consuming Preparing the Data • Integrating data from multiple sources – E.g. data from different departments – marketing, sales, billing, customer service – E.g. sometimes outside data is valuable – economic conditions, weather data • Challenges – different coding conventions, different time periods, different aggregations, different keys, different kinds of errors • Point of intersection with Data Warehousing – this work needs to be done for BOTH! • May need to iterate to get right Preparing the Data • Standard format – any tool needs data to be in some standard format • Weka tool requires data to be in ARFF format ARFF Format • Lines beginning with % are comments • File starts with name of the relation • Attributes are defined – Nominal attributes are followed by the set of values – Numeric attributes list the keyword “numeric” – No identification of class to be predicted – flexible • Beginning of data is flagged with @data • Data itself is comma delimited (easily created from Access or Excel) • Missing values are represented with a ? % ARFF file for the we ather d ata w ith s ome numeric features % @relation weather @attribute @attribute @attribute @attribute @attribute outlook { sunny, overcast, rainy } temperature numeric humidity numeric windy { tr ue, false } play? { ye s, no } @data % % 14 instances % sunny, 8 5, 85, false, n o sunny, 8 0, 90, true, no overcast, 83, 86 , false, yes rainy, 7 0, 96, false, y es rainy, 6 8, 80, false, y es rainy, 6 5, 70, true, no overcast, 64, 65 , true, ye s sunny, 7 2, 95, false, n o sunny, 6 9, 70, false, y es rainy, 7 5, 80, false, y es sunny, 7 5, 70, true, yes overcast, 72, 90 , true, ye s overcast, 81, 75 , false, yes rainy, 7 1, 91, true, no Figure 2.2 ARFF file for the weather data. Data Preparation • You need to understand machine learning schemes before using them for data mining – Some schemes treat numerics as ordinals and only compare < > = – Others treat numerics as ratios and perform distance and other measurements • If distance measurements are to be made, avoid scheme if datasets contain ordinals that distort distances (e.g. income example earlier) • Distance between nominals is frequently all or nothing (0 or 1) • If scheme only deals with nominals, any numerics need to be converted to nominals (e.g. age converted to young, mid, old) (some info is lost) • If dataset has nominals that are coded as integers, don’t confuse the scheme by marking them numeric Normalization • Some schemes require all numeric attributes to be on a similar scale – thus normalize or standardize (different term than DB normalization) • One normalization approach: Norm val = (val – minimum value for attribute) (max value for attribute – min val) • One standardization approach: Stand val = (val – mean) / SD Missing Values • In real datasets, missing values are frequently coded with weird value (e.g. –1, 999999) • Sometimes different types of missing values are distinguished – unknown, vs unrecorded vs not applicable vs … • Missing values may have meaning – – e.g. maybe income may be left blank more often by people whose income is particularly high or low – E.g. in diagnosis, a particular test may not need to be done for a particular case – Get data-knowledgeable person involved • Most machine learning schemes assume that missing value is not particularly meaningful – If meaningful, need to let scheme know … Inaccurate Values • Errors and omissions may be more important to mining algorithms than to source system • Misspelling of nominal attribute values may suggest incorrect possible values • Typos or incorrect measurement may yield numeric outliers – Find via graphing / involve data-knowledgeable person • Duplicate records – confuse scheme by giving heavier weight to • Deliberate mis-entry occurs (e.g. supermarket checkout entering own bonus card) Data Age • We are frequently using data to predict the future • At some point, the world / business has changed enough that the data is no longer appropriate for that Getting to Know Your Data • Several points above reflect this need • Graphic display of data can help find problems (e.g. outliers, large numbers of unknown value (e.g. 9999), typos of nominals) • Domain knowledgeable people are valuable – explain anomalies, missing values, coding schemes. • Data cleaning is extremely important. – At least look at some records to see what is going on – “Time spent looking at your data is always time well spent” End Chapter 2 • Work with basic formatting data into ARFF format – do japanbank – see www.lasalle.edu/~redmond/teach/658/resources.htm • (Data Courtesy of Dr Markov of C Conn St U)