Uploaded by Luo Yao Lin

特徵工程

advertisement
特徵工程
【單元大綱】
1. 遺漏值(missing value)與插補(imputation)
2. 異常偵測(Anomaly detection)/離群值偵測(Outlier detection)
3. 抽樣(Sampling)/重複抽樣(Resampling)
4. 不平衡類別資料(Imbalanced Classification)
5. 標準化(Standardization)/常態化(Normalization)
6. 降維(Dimensionality reduction)
1. 遺漏值(missing value)與插補(imputation)

6 Different Ways to Compensate for Missing Values In a Dataset (Data Imputation with
examples)

Do Nothing

Imputation Using (Mean/Median) Values

Imputation Using (Most Frequent) or (Zero/Constant) Values

Imputation Using k-NN

Imputation Using Multivariate Imputation by Chained Equation (MICE)

Imputation Using Deep Learning (Datawig)
【Imputation Using k-NN】


It creates a basic mean impute then uses the resulting complete list to construct a KDTree.
Pros:

Can be much more accurate than the mean, median or most frequent imputation
methods (It depends on the dataset).

Cons:

Computationally expensive. KNN works by storing the whole training dataset in
memory. K-NN is quite sensitive to outliers in the data (unlike SVM)
【Multiple Imputation by Chained Equations (MICE)教學影片】

【Multiple Imputation by Chained Equations (MICE)】
2. 異常偵測(Anomaly detection)/離群值偵測(Outlier detection)
【實作的三種方法】

Standard Deviation Method


Interquartile Range Method


3 * Standard Deviation
1.5 * IQR
Automatic Outlier Detection

DBSCAN

Isolation Forest
3. 抽樣(Sampling)/重複抽樣(Resampling)
基本的抽樣(Sampling)方法 [參考資料]https://zh.wikipedia.org/wiki/抽
樣

簡單隨機抽樣(simple random sampling)

系統抽樣(systematic sampling)

分層抽樣(stratified sampling)

整群抽樣(cluster sampling)
【資料來源】https://zh.wikipedia.org/wiki/抽樣
重複採抽樣(Resampling) -- 拔靴法(Bootstrap)
【資料來源】Bootstrapping – A Powerful Resampling Method in Statistics
4. 不平衡類別資料(Imbalanced Classification)

可食性蘑菇或毒蘑菇 為例</a>

#### Oversample Minority Class (過採少數類別)

#### Undersample Majority Class (欠採多數類別)

#### SMOTE : Synthetic Minority Oversampling Technique(合成少數過採樣技術)
【資料來源】Machine Learning with Oversampling andUndersampling Techniques: Overview
Study andExperimental Results
5. 標準化(Standardization)/常態化(Normalization) 【HOME】
標準化的常用方式:


min max normalization:

會將特徵數據按比例縮放至某一個區間,例如 : [0, 1] or [-1, 1]

以 [0, 1] 為例 : xnew=x−minmax−minxnew=x−minmax−min
standard deviation normalization:

會將特徵數據縮放成平均值為 0、標準差為 1 的標準常態分配

(x−μ)σ(x−μ)σ
6. 降維(Dimensionality reduction)

主成份分析(Principal component analysis)
Download