Uploaded by Nabarati Bhattacharya

Presentation1

advertisement
Advanced Business
Analytics
Topics covered
• Data Cleansing
• Exploratory Data Analysis(EDA)
• Random Forest
Introduction to the dataset
Objective: To find the dependence of house prices on
different factors
•
•
•
•
•
Some features of the dataset:
Number of columns: 25
Number of rows: 1460
Some of the categorical variables are
already in encoded form in the dataset
Data Cleansing: Columns having
missing values
•
Response variable: SalePrice
count
1460.000000
mean 180921.195890
std
79442.502883
min
34900.000000
25%
129975.000000
50%
163000.000000
PavedDrive(3.22%): Categorical
75%
214000.000000
Missing values for PavedDrive have
been replaced by the category that
occurs maximum times
max
755000.000000
 GrLivArea(2.81%) : Numerical

•
Missing values for GrLivArea have been
replaced by their mean
PavedDrive, SaleType and
SaleCondition are the three categorical
variables
EDA
HISTOGRAM
After data cleansing the following steps are
performed as part of EDA:
1. For the three categorical variables, the
percentage of values belonging to each
category are obtained.
Observations:
The PavedDrive variable has maximum
percentage of values as Ye
The SaleType variable has maximum
percentage of values as WD(Warranty
Deed)
The Salecondition is Normal in most cases
2. For numerical variables, perform descriptive
statistics
Observations from the histogram:
Deviation of SalePrice from normal distribution
Finding mean, quartiles, minimum, maximum Highly Right Skewed (Skewness is 1.8829)
values etc. for each of the 21 numerical
Peakedness ( Kurtosis is 6.5363)
variables
Thus the response variable has presence of a
Significant number of outliers
EDA for numerical variables: Correlation
matrix
Identifying the explanatory variables with
maximum correlation with SalePrice
(correlation coefficient > 0.5)
Observations: High correlation between some
explanatory variables
TotalBmstSF and 1stFlrSF
GarageCars and GarageArea
Final set of numerical explanatory variables:
•
•
•
•
•
•
•
OverallQual
YearBuilt
YearRemodAdd
TotalBsmtSF
GrLivArea
FullBath
GarageCars
Note: Correlation matrix only includes the
numerical variables and not the nonnumerical variables
EDA for numerical variables: Scatter plot
Observations:
Strong Linear relationship between SalePrice and GrLivArea
Linear/Exponential relationship between SalePrice and
TotalBsmtSF
EDA: Relationship with categorical features
Box Plot
Observations:
SalePrice is on a higher side when the
driveway is paved
SalePrice increases with improved
rating for Quality considering 1
being the lowest category and 10
being the highest
Random Forest
Random forest is a Supervised
Learning algorithm which used
for both classification and
regression.
It operates by constructing a
multitude of decision trees at
training time and outputting
the class that is the mode of the
classes (classification) or mean
prediction (regression) of the
individual trees.
The trees in random forests are
run in parallel. There is no
interaction between these trees
while building the trees.
STEPS in implementing Random Forest
Regression
All the non-numerical variables are encoded to numerical variables and the correlation matrix is rechecked. The
correlation matrix remains the same in this case
The given data set is split into train and test data sets
The features variables include the explanatory variables only while labels include the target response variable
Establish a baseline error: Our model will be accepted only if it can improve upon the baseline.
Instantiate the model and fit it on the training data
Make predictions on the test data using the model
Observations:
The mean absolute error of our predictions is less than the baseline error. Thus, our model can be accepted as it
improves upon the baseline
Determine performance metrics of our model
•
We calculate the accuracy of our predictions and it is found to be 89.6%
Obtain feature importance: Quantification of the usefulness of all the feature variables in the entire random forest
THANK YOU
Download