Advanced Business Analytics Topics covered • Data Cleansing • Exploratory Data Analysis(EDA) • Random Forest Introduction to the dataset Objective: To find the dependence of house prices on different factors • • • • • Some features of the dataset: Number of columns: 25 Number of rows: 1460 Some of the categorical variables are already in encoded form in the dataset Data Cleansing: Columns having missing values • Response variable: SalePrice count 1460.000000 mean 180921.195890 std 79442.502883 min 34900.000000 25% 129975.000000 50% 163000.000000 PavedDrive(3.22%): Categorical 75% 214000.000000 Missing values for PavedDrive have been replaced by the category that occurs maximum times max 755000.000000 GrLivArea(2.81%) : Numerical • Missing values for GrLivArea have been replaced by their mean PavedDrive, SaleType and SaleCondition are the three categorical variables EDA HISTOGRAM After data cleansing the following steps are performed as part of EDA: 1. For the three categorical variables, the percentage of values belonging to each category are obtained. Observations: The PavedDrive variable has maximum percentage of values as Ye The SaleType variable has maximum percentage of values as WD(Warranty Deed) The Salecondition is Normal in most cases 2. For numerical variables, perform descriptive statistics Observations from the histogram: Deviation of SalePrice from normal distribution Finding mean, quartiles, minimum, maximum Highly Right Skewed (Skewness is 1.8829) values etc. for each of the 21 numerical Peakedness ( Kurtosis is 6.5363) variables Thus the response variable has presence of a Significant number of outliers EDA for numerical variables: Correlation matrix Identifying the explanatory variables with maximum correlation with SalePrice (correlation coefficient > 0.5) Observations: High correlation between some explanatory variables TotalBmstSF and 1stFlrSF GarageCars and GarageArea Final set of numerical explanatory variables: • • • • • • • OverallQual YearBuilt YearRemodAdd TotalBsmtSF GrLivArea FullBath GarageCars Note: Correlation matrix only includes the numerical variables and not the nonnumerical variables EDA for numerical variables: Scatter plot Observations: Strong Linear relationship between SalePrice and GrLivArea Linear/Exponential relationship between SalePrice and TotalBsmtSF EDA: Relationship with categorical features Box Plot Observations: SalePrice is on a higher side when the driveway is paved SalePrice increases with improved rating for Quality considering 1 being the lowest category and 10 being the highest Random Forest Random forest is a Supervised Learning algorithm which used for both classification and regression. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. The trees in random forests are run in parallel. There is no interaction between these trees while building the trees. STEPS in implementing Random Forest Regression All the non-numerical variables are encoded to numerical variables and the correlation matrix is rechecked. The correlation matrix remains the same in this case The given data set is split into train and test data sets The features variables include the explanatory variables only while labels include the target response variable Establish a baseline error: Our model will be accepted only if it can improve upon the baseline. Instantiate the model and fit it on the training data Make predictions on the test data using the model Observations: The mean absolute error of our predictions is less than the baseline error. Thus, our model can be accepted as it improves upon the baseline Determine performance metrics of our model • We calculate the accuracy of our predictions and it is found to be 89.6% Obtain feature importance: Quantification of the usefulness of all the feature variables in the entire random forest THANK YOU