特徵工程 【單元大綱】 1. 遺漏值(missing value)與插補(imputation) 2. 異常偵測(Anomaly detection)/離群值偵測(Outlier detection) 3. 抽樣(Sampling)/重複抽樣(Resampling) 4. 不平衡類別資料(Imbalanced Classification) 5. 標準化(Standardization)/常態化(Normalization) 6. 降維(Dimensionality reduction) 1. 遺漏值(missing value)與插補(imputation) 6 Different Ways to Compensate for Missing Values In a Dataset (Data Imputation with examples) Do Nothing Imputation Using (Mean/Median) Values Imputation Using (Most Frequent) or (Zero/Constant) Values Imputation Using k-NN Imputation Using Multivariate Imputation by Chained Equation (MICE) Imputation Using Deep Learning (Datawig) 【Imputation Using k-NN】 It creates a basic mean impute then uses the resulting complete list to construct a KDTree. Pros: Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset). Cons: Computationally expensive. KNN works by storing the whole training dataset in memory. K-NN is quite sensitive to outliers in the data (unlike SVM) 【Multiple Imputation by Chained Equations (MICE)教學影片】 【Multiple Imputation by Chained Equations (MICE)】 2. 異常偵測(Anomaly detection)/離群值偵測(Outlier detection) 【實作的三種方法】 Standard Deviation Method Interquartile Range Method 3 * Standard Deviation 1.5 * IQR Automatic Outlier Detection DBSCAN Isolation Forest 3. 抽樣(Sampling)/重複抽樣(Resampling) 基本的抽樣(Sampling)方法 [參考資料]https://zh.wikipedia.org/wiki/抽 樣 簡單隨機抽樣(simple random sampling) 系統抽樣(systematic sampling) 分層抽樣(stratified sampling) 整群抽樣(cluster sampling) 【資料來源】https://zh.wikipedia.org/wiki/抽樣 重複採抽樣(Resampling) -- 拔靴法(Bootstrap) 【資料來源】Bootstrapping – A Powerful Resampling Method in Statistics 4. 不平衡類別資料(Imbalanced Classification) 可食性蘑菇或毒蘑菇 為例</a> #### Oversample Minority Class (過採少數類別) #### Undersample Majority Class (欠採多數類別) #### SMOTE : Synthetic Minority Oversampling Technique(合成少數過採樣技術) 【資料來源】Machine Learning with Oversampling andUndersampling Techniques: Overview Study andExperimental Results 5. 標準化(Standardization)/常態化(Normalization) 【HOME】 標準化的常用方式: min max normalization: 會將特徵數據按比例縮放至某一個區間,例如 : [0, 1] or [-1, 1] 以 [0, 1] 為例 : xnew=x−minmax−minxnew=x−minmax−min standard deviation normalization: 會將特徵數據縮放成平均值為 0、標準差為 1 的標準常態分配 (x−μ)σ(x−μ)σ 6. 降維(Dimensionality reduction) 主成份分析(Principal component analysis)