Sampling

Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J. Han, Isabelle Guyon, Martin Bachler 2015年4月13日星期一 Data Mining: Concepts and Techniques 1 Continuous Attribute Temperature Outlook Tempreature Humidity Windy Class Sunny 40 high false N sunny 37 high true N overcast 34 high false P rain 26 high false P rain 15 normal false P rain 13 normal true N overcast 17 normal true P sunny 28 high false N sunny 25 normal false P rain 23 normal false P sunny 27 normal true P overcast 22 high true P overcast 40 normal false P rain 31 high true N Discretization  Three types of attributes:  Nominal — values from an unordered set  Example: attribute “outlook” from weather data   Values: “sunny”,”overcast”, and “rainy” Ordinal — values from an ordered set  Example: attribute “temperature” in weather data  Values: “hot” > “mild” > “cool” Continuous — real numbers Discretization:  divide the range of a continuous attribute into intervals  Some classification algorithms only accept categorical attributes.  Reduce data size by discretization  Supervised (entropy) vs. Unsupervised (binning)   Data Mining: Concepts and Techniques 3 Simple Discretization Methods: Binning  Equal-width (distance) partitioning:  It divides the range into N intervals of equal size: uniform grid  if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.    The most straightforward But outliers may dominate presentation: Skewed data is not handled well. Equal-depth (frequency) partitioning:  It divides the range into N intervals, each containing approximately same number of samples Data Mining: Concepts and Techniques 4 Histograms 30 25 20 15 10 Data Mining: Concepts and Techniques 100000 90000 80000 70000 60000 0 50000 5 40000  35 30000  40 20000  A popular data reduction technique Divide data into buckets and store average (sum) for each bucket Can be constructed optimally in one dimension using dynamic programming Related to quantization problems. 10000  5 Supervised Method: Entropy-Based Discretization  Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is | | | | E (S ,T )    S1 Ent ( | S| S 2 Ent ( ) )  S1 | S | S2 The boundary T that minimizes the entropy function over all possible boundaries is selected as a binary discretization. Greedy Method:  the process is recursively applied when T goes from smallest to largest value of attribute A, until some stopping criterion is met, e.g., for some user-given  Ent(S )  E (T , S )   Data Mining: Concepts and Techniques 6 How to Calculate ent(S)?  Given two classes Yes and No, in a set S,  Let p1 be the proportion of Yes  Let p2 be the proportion of No,  p1 + p2 = 100% Entropy is: ent(S) = -p1*log(p1) –p2*log(p2)  When p1=1, p2=0, ent(S)=0,  When p1=50%, p2=50%, ent(S)=maximum! Data Mining: Concepts and Techniques 7 Transformation: Normalization  min-max normalization v  min A v'  (new _ max A  new _ min A)  new _ min A max A  min A  z-score normalization v'   v  normalization by decimal scaling v v'  j 10 Where j is the smallest integer such that Max(| v ' |)<1 Data Mining: Concepts and Techniques 8 Transforming Ordinal to Boolean   Simple transformation allows to code ordinal attribute with n values using n-1 boolean attributes Example: attribute “temperature” Temperature Temperature > cold Temperature > medium Cold False False Medium True False Hot True True Original data  Transformed data How many binary attributes shall we introduce for nominal values such as “Red” vs. “Blue” vs. “Green”? Data Mining: Concepts and Techniques 9 Data Sampling 2015年4月13日星期一 Data Mining: Concepts and Techniques 10 Sampling    Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data  Simple random sampling may have very poor performance in the presence of skew (uneven) classes Develop adaptive sampling methods  Stratified sampling:  Approximate the percentage of each class (or subpopulation of interest) in the overall database  Used in conjunction with skewed data Data Mining: Concepts and Techniques 11 Sampling Raw Data Data Mining: Concepts and Techniques 12 Sampling Example Cluster/Stratified Sample Raw Data Data Mining: Concepts and Techniques 13 Summary  Data preparation is a big issue for data mining  Data preparation includes transformation, which are:   Data sampling and feature selection  Discretization  Missing value handling  Incorrect value handling Feature Selection and Feature Extraction Data Mining: Concepts and Techniques 14

Sampling

Related documents

Products

Support

Sampling

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib