Chapter 7: Transformations Attribute Selection • Adding irrelevant attributes confuses learning algorithms---so avoid such attributes • Both divide-and-conquer and separate-and-conquer algorithms suffer from this; Naïve Bayes does not suffer • So first choose the attributes to be considered and then proceed--dimensionality reduction • Scheme independent selection: – Just enough attributes to divide up the instance space in a way that separates all the training instances: For example, in Table 1, if we were to drop outlook, instance 1 and 4 will be inseparable-not good. --- very tedious procedure • Using machine learning algorithms for attribute selection – Decision tree: Apply DT on all attributes, and select only that are actually used in the decisions---the selected attributes can then be used in another chosen learning algorithm – Use linear SVM algorithm that ranks attributes based on weights to choose the attributes---recursive feature elimination – Using instance-based learning methods • Sample instances randomly from the training set • Check neighboring records of the same and different classes (near hits and near misses) • If a near hit has a different value for a certain attribute, that attribute appears to be irrelevant---reduce its weight • If a near miss, has a different value, the attribute appears to be relevant and its weight should be increased • After repeating this procedure many times, selection takes place---only attributes with +ve weights are chosen. • Searching the attribute space: – – – – Fig 7.1 Forward selection (start with empty set and keep expanding) Backward elimination (start with all, and start eliminating one by one) Bidirectional search---combination of the above two • Scheme-specific selection – Cross-validation is sued to measure the effectiveness of a subset of attributes Discretizing Numeric Attributes • • • • • • Global discretization: Used in 1R learning scheme: Sort the instances by the attribute’s value and assign the value into ranges at the points that class value changes---keeping some minimum instance coverage criteria Local discretization: Used in decision trees: When a specific attribute is used to split a node, a decision is made on the value at which this break could take place Transforming numeric attribute into k binary variables Unsupervised discretization: Not taking the classes of the training set--break the value range into some intervals---e,g., equal-interval binning or equal-frequency binning---runs the risk of destroying distinctions within an interval or bin Supervised discretization---takes classes into account while making intervals Proportional k-interval discretization: #of bins chosen as square root of #of instances with equal-frequency binning is found to be excellent Entropy-based Discretization • One example: Order the values of the attribute, and for each possible break-point determine the information gain (p. 298-299). Split at the point where this value is the smallest. – For all values, find the smallest (A); – Repeat this procedure for each of the parts formed by the breaking at A; – Repeat this step recursively until a stopping criteria is met Some Useful Transformations • Examples: – Subtracting one date attribute from another to obtain a new age attribute – Converting two attributes A and B to A/B, a new attribute representing the ratio – Reduce several nominal attributes to one by concatenating their vales, producing a single k1xk2 value attribute • Principal component analysis: Use a special coordinate system that depend on the given cloud of points as follows: place the first axis in the direction of greatest variance of the points to maximize the variance along that axis; the 2nd axis in perpendicular to it; in multidimensional case, choose the 2nd axis that maximizes variance along that axis; and so on; finally, choose the ones that contribute to the highest variance---the principal components • • http://en.wikipedia.org/wiki/Principal_components_analysis http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf Random Projections • Since PCA is expensive (cubic in the #of dimensions), alternative is to a random projection of the data into a subspace with a predetermined number of dimensions Text to attribute vector • Convert a document to a vector of words that occur in the document---it could be the frequency of the words or just the absence/presence of the word • In other words, a document is characterized by the words that appear often in it. Time series • Some times, we may replace the attributes by the difference in successive values, etc. This is time series. Automatic Data Cleansing • Data mining techniques themselves can sometimes help to solve the problem of cleansing the corrupted data • By discarding misclassified instances from the training set, relearning, and then repeating until there are no more misclassified instances, decision trees induced from data can be improved • Robust regression---by removing outliers, linear regression is improve Combining Multiple Models • Bagging, boosting, and stacking are prominent methods to combine multiple models • Bagging: Models receive equal weight---output of each model is a majority value, for example. • Boosting: Similar to bagging except that it assigns different weights to different model outputs • Option tree (Fig. 7.10) and Fig. 711 (-ve means play =yes; + ve means play=no;)