Lab 6: Dimensionality Reduction In this lab, we will explore dimensionality reduction, which can be divided into two components: Feature Selection and Feature Extraction. Feature selection is a process by which we automatically search for the best subset of attributes in our dataset. The notion of “best” is relative to the problem we are trying to solve, but typically means the highest performance score or the lowest error. Feature extraction reduces the data in a high dimensional space to a lower dimension space. Three key benefits of performing dimensionality reduction on the data are: Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise Improves Accuracy: Less misleading data means modelling accuracy improves Reduces Training Time: Less data means that algorithms train faster Prepare and Explore Data We will use the Pima Indians Diabetes dataset for this lab. The dataset corresponds to a classification task on which we need to predict if a person has diabetes based on 8 features. In particular, all patients are females at least 21 years old of Pima Indian heritage. There are 9 attributes in which the Outcome attribute is the class label (0: no diabetes and 1: has diabetes). Pregnancies: Number of times pregnant Glucose: Plasma glucose concentration (2 hours in an oral glucose tolerance test) BloodPressure: Diastolic blood pressure (mm Hg) SkinThickness: Triceps skin fold thickness (mm) Insulin: 2-Hour serum insulin (mu U/ml) BMI: Body mass index (weight in kg/(height in m)^2) DiabetesPedigreeFunction: Diabetes pedigree function Age: Age (years) Outcome: Class variable (0 or 1) First, let us load the diabetes dataset. In [1]: In [2]: # Import pandas library import pandas as pd # Read csv data file df = pd.read_csv('diabetes.csv') df.head() Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome 0 6 148 72 35 0 33.6 0.627 50 1 1 1 85 66 29 0 26.6 0.351 31 0 2 8 183 64 0 0 23.3 0.672 32 1 3 1 89 66 23 94 28.1 0.167 21 0 Out[2]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome 0 137 40 35 168 43.1 2.288 33 1 4 In [3]: # Display the summary of data df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count --- ------------------0 Pregnancies 768 non-null 1 Glucose 768 non-null 2 BloodPressure 768 non-null 3 SkinThickness 768 non-null 4 Insulin 768 non-null 5 BMI 768 non-null 6 DiabetesPedigreeFunction 768 non-null 7 Age 768 non-null 8 Outcome 768 non-null dtypes: float64(2), int64(7) memory usage: 54.1 KB Dtype ----int64 int64 int64 int64 int64 float64 float64 int64 int64 We first examine the features to gain a better understanding of the data types we are working with. From the summary, we observe that all 9 attributes are stored as continuous variables. Since this is a classification task, Outcome should be a categorical variable but we will not convert Outcome yet to a categorical variable (we will do the conversion later). In [4]: Out[4]: # Find out the number of instances and attributes df.shape (768, 9) There are 768 instances and 9 attributes in the dataset. In [4]: # Indicate the target column target = df['Outcome'] # Indicate the columns that will serve as features features = df.drop('Outcome', axis = 1) In [5]: # Split data into train and test sets # Import train_test_split function from sklearn.model_selection import train_test_split # Split the dataset into training and test sets x_train, x_test, y_train, y_test = train_test_split(features, target, \ test_size = 0.2, random_state = 0) We will select and extract features from the training set. Feature Selection We will implement two approaches to perform feature selection. Filter approach Wrapper approach Filter Approach The filter approach will take only the subset of the relevant features based on certain statistical measure and does not involve any machine learning algorithm. There are many statistical measures that can be used for filterbased feature selection. As such, the choice of statistical measures is highly dependent upon the feature and target variable data types. The data type can either be continuous/numerical or categorical. We will discuss four filter approaches: 1. Variance threshold 2. Correlation coefficient 3. Chi-squared 4. Information gain Variance Threshold Removing features with low variance. Motivated by the idea that low variance features contain less information. Calculate variance of each feature. Then, drop features with variance below some threshold. Important to make sure features have the same scale. In order to use this filter, we will first have to transform the features to have the same scale. In previous lab, we have experimented with data standardization for feature scaling. We will experiment with another feature scaling method known as data normalization. Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1. After applying normalization, scaler transforms features into an array data structure (not in the form of the data frame anymore). In [6]: # Import normalizer module from sklearn import preprocessing # Create the Scaler object scaler = preprocessing.Normalizer() # Fit the data on the Scaler object scaled_features = scaler.fit_transform(x_train) # View the first 5 rows of scaled_features array scaled_features[0:5] Out[6]: array([[3.14609065e-02, 5.66296317e-01, [3.28427192e-02, 0.00000000e+00, [0.00000000e+00, 9.59672638e-01, [5.35681315e-03, 7.23169775e-01, [6.17065882e-02, 0.00000000e+00, 6.74162282e-01, 1.58203415e-01, 7.96435940e-01, 2.31541170e-01, 2.32861743e-01, 7.38101161e-02, 5.83892633e-01, 1.34991691e-01, 9.25598823e-01, 2.31399706e-01, 3.50564386e-01, 3.11013533e-03, 4.92640787e-01, 3.63733115e-03, 1.27015496e-01, 6.02617965e-04, 2.99981536e-01, 4.46222535e-03, 0.00000000e+00, 1.41153821e-03, 1.30338041e-01, 2.42698421e-01], 1.88845635e-01, 1.80634955e-01], 4.65723486e-02, 3.24595157e-02], 1.12493076e-01, 1.23206702e-01], 0.00000000e+00, 2.93106294e-01]]) VarianceThreshold is a simple baseline approach to feature selection. It removes all features with variance that do not meet some threshold. We have to set this threshold parameter. If we are not sure what value of threshold to set, we can first examine the variance of each column. Variance usually only makes sense for continuous variables. In [7]: names = x_train.columns # After normalization, scaled_features is transformed into an array so we need to convert scaled_features_df = pd.DataFrame(scaled_features, columns = names) # Compute the variance of each column scaled_features_df.var() Out[7]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age dtype: float64 0.000457 0.026139 0.023945 0.008652 0.116097 0.003973 0.000004 0.006816 From the column variance observed, if we set the variance threshold too high, no features would be returned. Let us set the variance threshold = 0.1. In [8]: # Import VarianceThreshold module from sklearn.feature_selection import VarianceThreshold # Create VarianceThreshold object with a variance threshold of 0.1 thresholder = VarianceThreshold(threshold = 0.1) # Conduct variance thresholding - fit_transform() takes in an array features_high_variance = thresholder.fit_transform(scaled_features) # Use the get_support() function to identify the feature(s) above the variance threshold thresholder.get_support(indices = True) Out[8]: array([4], dtype=int64) [0] Pregnancies [1] Glucose [2] BloodPressure [3] SkinThickness [4] Insulin [5] BMI [6] DiabetesPedigreeFunction [7] Age Array index starts from [0]. Index [4] indicates column 5: Insulin. Only one feature is returned. When to use variance threshold filter As variance is a statistical measure for continuous variables, VarianceThreshold makes more sense to be used when features are continuous variables. When applied on categorical variables, the interpretation of variance may not make a lot of sense. Correlation Coefficient One way to select features that are good predictors of the target is to identify features that are highly correlated with the target. We can compute Pearson correlation between each feature and the target. Sklearn's feature_selection module does not provide a filter based on correlation coefficient but we can generate a correlation matrix to select good features. The correlation coefficient has values between -1 to 1. A value closer to 0 implies weaker correlation (exact 0 implying no correlation) A value closer to 1 implies stronger positive correlation A value closer to -1 implies stronger negative correlation Correlation coefficient only works for continuous variables. This filter approach is suitable when both features and target are continuous variables. This is the reason why we did not convert Outcome from integer to string earlier so we can compute the correlation between each feature to the target. We usually do not use the correlation coefficient filter when our target is a categorical variable. In [9]: # Place the x_train and y_train data frames side by side x = pd.concat([x_train, y_train], axis = 1) # Generate correlation matrix cor = x.corr() # Print correlation matrix cor Pregnancies Glucose BloodPressure SkinThickness Insulin BMI Pregnancies 1.000000 0.127642 0.141417 -0.084695 -0.080762 0.003036 Glucose 0.127642 1.000000 0.147744 0.057374 0.326961 0.234836 BloodPressure 0.141417 0.147744 1.000000 0.214119 0.081911 0.264209 SkinThickness -0.084695 0.057374 0.214119 1.000000 0.426754 0.405780 Insulin -0.080762 0.326961 0.081911 0.426754 1.000000 0.192086 BMI 0.003036 0.234836 0.264209 0.405780 0.192086 1.000000 DiabetesPedigreeFunction -0.047203 0.118450 0.052385 0.194594 0.189132 0.145073 Age 0.539582 0.278075 0.229556 -0.156347 -0.063597 0.009823 Outcome 0.193991 0.459278 0.057662 0.088248 0.119886 0.303850 Out[9]: In [10]: # Generating the correlation heatmap is optional # The heatmap is just a visualization of the correlation matrix # Import seaborn package to generate heatmap import seaborn as sns # Import pyplot to control the size of the plot import matplotlib.pyplot as plt # Set plot size plt.figure(figsize=(12,10)) # Generate the heatmap sns.heatmap( cor, vmin = -1, vmax = 1, center = 0, cmap = sns.diverging_palette(20, 220, n=200), square = True, annot = True ) Out[10]: <AxesSubplot:> DiabetesPedigree If you are interested to learn more about the heatmap parameter, check out: https://seaborn.pydata.org/generated/seaborn.heatmap.html. The darker the color of the cell, the higher the correlation. In [11]: # Select features above a correlation threshold to target # Correlation with target # Apply abs() to get the absolute value so no need to deal with negative correlations cor_target = abs(cor['Outcome']) # Selecting highly correlated features # Say we set the correlation threshold to 0.2 relevant_features = cor_target[cor_target > 0.2] relevant_features Out[11]: Glucose 0.459278 BMI 0.303850 Age 0.238986 Outcome 1.000000 Name: Outcome, dtype: float64 As we can observe, only the features [Glucose, BMI, Age] have correlation above the threshold with the target variable Outcome. Hence, we will drop all other features apart from these. We can even further reduce the selected features in the subset. Ideally, we want to keep features that are independent variables that are uncorrelated with each other. If these variables in the selected feature subset are correlated with each other, then we need to keep only one of them and drop the rest. This can be done by visually checking the correlation matrix above. When to use correlation coefficient filter When the data type of our feature to be tested and the target variable are both continuous. Chi-squared Chi-squared is a goodness of fit statistic. We can use sklearn's univariate feature selection to select the best features based on univariate statistical tests. We will use SelectKBest function to remove all but the highest scoring features. The k parameter indicates the number of top features to select. In [5]: Out[5]: In [5]: Out[5]: In [8]: # Currently the target is integer data type y_train.dtype dtype('int64') # Convert integer to string y_train.astype(str) 603 118 247 157 468 1 0 0 0 1 .. 763 0 192 1 629 0 559 0 684 0 Name: Outcome, Length: 614, dtype: object # Import SelectKBest and chi2 modules from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 # Create a selector # Setting k = 3 means we want the top 3 features selector = SelectKBest(chi2, k = 3) # Select top 3 features based on the training set x_new = selector.fit_transform(x_train, y_train) selector.get_support(indices=True) Out[8]: array([1, 4, 7], dtype=int64) [0] Pregnancies [1] Glucose [2] BloodPressure [3] SkinThickness [4] Insulin [5] BMI [6] DiabetesPedigreeFunction [7] Age The three best selected features are index[1], index[4] and index[7]: Glucose, Insulin and Age. When to use Chi-squared When the data type of our feature to be tested and the target variable are both categorical. Information Gain The information gain filter can be implemented using mutual_info_classif. In [107… # Import SelectKBest and mutual_info_classif modules from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import mutual_info_classif # Create a selector # Setting k = 3 means we want the top 3 features selector = SelectKBest(mutual_info_classif, k = 3) # Select top 3 features based on the training set x_new = selector.fit_transform(x_train, y_train) selector.get_support(indices=True) Out[107… array([1, 5, 7], dtype=int64) [0] Pregnancies [1] Glucose [2] BloodPressure [3] SkinThickness [4] Insulin [5] BMI [6] DiabetesPedigreeFunction [7] Age The two best selected features are index[1], index[5] and index[7]: Glucose, BMI and Age. When to use Information Gain When the data type of our feature to be tested and the target variable are both categorical. Summary: Filter Approach It is possible that different filter-based feature selection methods will return distinct subsets of top n best features. Variance threshold: [Insulin] Correlation coefficient: [Glucose, BMI, Age] Chi-squared: [Glucose, Insulin, Age] Information gain: [Glucose, BMI, Age] To identify determine which feature subset works the best, we can then proceed to examine the performance of the different subsets on the test set. Wrapper Approach We will implement Recursive Feature Elimination (RFE) which is a type of wrapper feature selection method. RFE works by recursively removing attributes and building a model on those attributes that remain. RFE selects features by recursively considering smaller and smaller sets of features (backward selection). It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute. To find out more details on the parameters of RFE, check out: https://scikitlearn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE In [9]: # Import RFE and machine learning algorithm from sklearn.feature_selection import RFE # Import SVM - We are using SVM as an example from sklearn.svm import SVC # Create a SVM classifier with linear kernel svmlinear = SVC(kernel = 'linear') # Use RFE to rank features and return top 3 features # Parameter step corresponds to the (integer) number of features to remove at each iterati rfe = RFE(estimator = svmlinear, n_features_to_select = 3, step = 1) rfe.fit(x_train, y_train) print("Number of Features: ", rfe.n_features_) print("Feature Ranking: ", rfe.ranking_) print("Selected Features: ", rfe.support_) Number of Features: 3 Feature Ranking: [1 2 4 5 6 1 1 3] Selected Features: [ True False False False False True True False] The selected features are marked True in the support array and marked with a choice “1” in the ranking array. The ranking array indicates the strength of these features. [0] Pregnancies [1] Glucose [2] BloodPressure [3] SkinThickness [4] Insulin [5] BMI [6] DiabetesPedigreeFunction [7] Age From the results above, the top 3 features are Pregnancies, BMI and DiabetesPedigreeFunction. RFE (SVM Linear Kernel): [Pregnancies, BMI, DiabetesPedigreeFunction] Feature Extraction Principal Component Analysis (PCA) Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. In [130… # Import PCA from sklearn.decomposition import PCA # Specify the number of components = 2 pca = PCA(n_components = 2) # Generate the principal components pca.fit(x_train) # Transform the training set into principal components train_pca = pca.transform(x_train) # Transform the test set into principal components test_pca = pca.transform(x_test) # Convert train set into a data frame to make it easier to view principalDf = pd.DataFrame(data = train_pca, \ columns = ['principal component 1', 'principal component 2']) principalDf.head() principal component 1 principal component 2 0 46.852538 -28.453231 1 -83.841176 19.211268 2 599.569091 9.606891 3 51.211001 20.305400 4 -83.989380 1.647829 Out[130… PCA transforms the 8 features below into 2 principal components above that can be used as new extracted features representing the data. This reduces the number of features from 8 to 2 to be fed into the machine learning model for training. In [124… # Print the original training set before PCA x_train.head() Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age 603 7 150 78 29 126 35.2 0.692 54 118 4 97 60 23 0 28.2 0.443 22 247 0 165 90 33 680 52.3 0.427 23 157 1 109 56 21 135 25.2 0.833 23 468 8 120 0 0 0 30.0 0.183 38 Out[124… In [145… # Print explained variance by principal components print('Explained variance by component: ', pca.explained_variance_ratio_) Explained variance by component: [0.89142243 0.059357 ] We can view the percentage of variance explained by each principal component using explained_varianceratio. The vector array provided by explained_varianceratio indicates that most of the variance is explained by principal component 1 (89.1%). Principal component 2 explains 5% of variance. It is therefore to conclude that the first two principal components explain the majority of the variance. Factor Analysis Factor Analysis (FA) can be used to search influential underlying factors or latent variables from a set of observed variables. It helps reduce the number of variables by explaining the variance among the observed variable and condense a set of the observed variable into the unobserved variable called factors. In [49]: # Import FactorAnalysis from sklearn.decomposition import FactorAnalysis # Set the number of factors to 4 factor = FactorAnalysis(n_components = 4, random_state = 101).fit(x_train) # Convert the factor loadings into a data frame factor_df = pd.DataFrame(factor.components_, columns = x_train.columns) factor_df Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age 0 -0.271827 10.712261 1.634843 6.837644 117.384877 1.536430 0.063206 -0.722935 1 -0.574407 -29.883031 -2.818284 1.459739 1.254262 -1.447046 -0.019330 -3.859221 2 -0.422100 0.515181 -19.009104 -4.219335 0.032111 -2.008041 -0.013237 -2.162622 3 -0.311025 0.135564 -1.519511 13.273705 -0.057911 2.614398 0.044057 -2.200015 Out[49]: After loading the data and having stored all the predictive features, the FactorAnalysis class is initialized with a request to look for four factors. The data is then fitted. We can explore the results by observing the components_ attribute, which returns an array containing measures of the relationship between the newly created factors, placed in rows, and the original features, placed in columns. The values are known as factor loadings. At the intersection of each factor and feature, a positive number indicates that a positive proportion exists between the two; a negative number, instead, points out that they diverge and one is the contrary to the other. Pay attention to the high factor loadings regardless of the sign (positive or negative). Method 1 Feature Selection Using Factor Analysis: One method to use factor analysis for feature selection is to study the relationships between the attributes to the factors. Performing factor analysis on this dataset actually did not group the original features into factors. For original features in a factor, we can pick one most relevant feature from a factor and drop the remaining features in the same factor to reduce the number of features. Think of factor analysis as an intermediate step to identify factors as a new set of features. In our example, we can try to group the original features into 4 meaningful factors. For example, Insulin is highly correlated with Factor 1, Glucose is highly correlated with Factor 2, and so on. The factor loadings for DiabetesPedigreeFunction is too low, which indicates it is uncorrelated with any of the factors so we can possibly remove it from further consideration. Method 2 Feature Extraction Using Factor Analysis: The second method is to transform the original data into factor scores so our original 8 attributes can be transformed into 4 attributes with each attribute representing a factor. In [35]: # Get factor scores for each row of data x_transformed = factor.fit_transform(x_train) # Returns factor scores in an array x_transformed array([[ 0.38291592, -0.84029596, -0.33905185, 0.36214101], [-0.70539703, 0.55392437, 0.29562535, 0.53180043], [ 5.08854146, 0.3295323 , -0.54832708, -1.68408777], ..., [-0.70644016, 0.64892983, 0.04124787, 0.35466829], [-0.71025868, 0.93100429, -0.34030281, -1.30005474], [-0.69202444, -0.76640417, -0.49131209, -1.33382239]]) Out[35]: In [51]: # Convert array into a data frame df_x_transformed = pd.DataFrame(x_transformed) # Set the index in df_x_transformed to match the index in x_train so can be mapped back to df_x_transformed.set_index(x_train.index, inplace = True) # Rename column names df_x_transformed.columns = ['Factor 1', 'Factor 2', 'Factor 3', 'Factor 4'] # Factor scores are transformed into features df_x_transformed Factor 1 Factor 2 Factor 3 Factor 4 603 0.382916 -0.840296 -0.339052 0.362141 118 -0.705397 0.553924 0.295625 0.531800 247 5.088541 0.329532 -0.548327 -1.684088 157 0.444040 0.567956 0.649008 -0.048220 468 -0.698988 -0.160049 3.525566 -0.034909 ... ... ... ... ... 763 0.824161 0.952684 -0.539542 1.174097 192 -0.684019 -1.513588 0.400504 -0.836227 629 -0.706440 0.648930 0.041248 0.354668 559 -0.710259 0.931004 -0.340303 -1.300055 684 -0.692024 -0.766404 -0.491312 -1.333822 Out[51]: 614 rows × 4 columns Factor analysis transforms the 8 features into 4 features (factors) that can be used as new extracted features representing the data. This reduces the number of features from 8 to 4 to be fed into the machine learning model for training.