Uploaded by Britt Smith

Final Project.pdf

advertisement
3/19/2021
Final Project
MSDS 430 Final Project
Complete the following and submit your notebook, html to Canvas. Your completed notebook should include all output, i.e. run each cell and save your file before
submitting.
In this final project you will continue working with the building prices dataset. You have already run Python code to fix the bad data detected in Milestone 1.
In Milestone 2 you selected a sngle variable that you think is best for predicting sale price. Use your single variable from Milestone 2 to complete this assignment. You can change your mind but tell me what you
are doing and why.</font>
The code provided below uses age_home for the input variable. This is just so the code will run. Replace age_home with the name of the variable you selected in Milestone 2.
Section 1 Import and Explore
Start by rebuilding the dataframes you used in Milestone 1 and 2.
Changing single variable from age_home to taxes due to the higher correlation coefficients we identified in Milestone 2
In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import seaborn as sns
sns.set(style="ticks", palette="bright")
In [14]:
df = pd.read_csv('building prices 2018-1.csv')
df.columns = [s.lower() for s in df.columns]
df.head()
price
taxes
bathrooms
lot_size
living_space
num_garage
num_rooms
num_bedrms
age_home
num_fplaces
0
25.9
4.918
1.0
3.472
0.998
1.0
7
4
42
0
1
29.5
5.021
1.0
3.531
1.500
2.0
7
4
62
0
2
27.9
4.543
1.0
2.275
1.175
1.0
6
3
40
0
3
25.9
4.557
1.0
4.050
1.232
1.0
6
3
54
0
4
29.9
5.060
1.0
0.000
1.121
1.0
6
3
0
0
Out[14]:
In [15]:
def scat(dataframe,var1,var2):
plt.scatter(dataframe[var1],dataframe[var2], s = 30)
plt.title('Scatter')
plt.xlabel(var1)
plt.ylabel(var2)
scat(df, 'taxes', 'price')
Section 2 Base Model and Residual Mean Square
Run a base model and calculate the residual mean square for the model. Kaggle is set up to score your models using a different calculation called root mean square error (RMSE). Using residual mean square here
because statsmodels least squares calculates it for you.
Using the statsmodels API for this linear model and age_home for the single prediction variable. age_home will be used for the single prediction variable for each of the sections where the change will be
improvement in the data due to the EDA. Result will be model residual mean square improvement due to the EDA rather than changes to the input variable.
In [16]:
base_model = smf.ols(formula='price ~ taxes', data=df).fit()
predictions = base_model.fittedvalues
predictions.head()
33.003663
Out[16]: 0
1
33.163371
2
32.422201
3
32.443909
4
33.223843
dtype: float64
Join the actual price with the model predicted price from the model so you can look at a plot later.
In [17]:
d = {'p_price': predictions}
d2=pd.DataFrame(data=d)
base_join = pd.concat([df,d2],axis = 1).reindex(df.index)
Note, statsmodels has a feature that calculates some very useful things for you regarding your model. Use dir(base_model) to see what these are.
In [18]:
dir(base_model)
Out[18]: ['HC0_se',
'HC1_se',
'HC2_se',
'HC3_se',
'_HCCM',
'__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
This study source
was downloaded
'__ge__',
by 100000824742498 from CourseHero.com on 02-22-2022 19:25:36 GMT -06:00
file:///E:/Documents/Downloads/Final Project.html
https://www.coursehero.com/file/84994658/Final-Projectpdf/
1/4
3/19/2021
Final Project
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'_abat_diagonal',
'_cache',
'_data_attr',
'_data_in_cache',
'_get_robustcov_results',
'_is_nested',
'_use_t',
'_wexog_singular_values',
'aic',
'bic',
'bse',
'centered_tss',
'compare_f_test',
'compare_lm_test',
'compare_lr_test',
'condition_number',
'conf_int',
'conf_int_el',
'cov_HC0',
'cov_HC1',
'cov_HC2',
'cov_HC3',
'cov_kwds',
'cov_params',
'cov_type',
'df_model',
'df_resid',
'eigenvals',
'el_test',
'ess',
'f_pvalue',
'f_test',
'fittedvalues',
'fvalue',
'get_influence',
'get_prediction',
'get_robustcov_results',
'initialize',
'k_constant',
'llf',
'load',
'model',
'mse_model',
'mse_resid',
'mse_total',
'nobs',
'normalized_cov_params',
'outlier_test',
'params',
'predict',
'pvalues',
'remove_data',
'resid',
'resid_pearson',
'rsquared',
'rsquared_adj',
'save',
'scale',
'ssr',
'summary',
'summary2',
't_test',
't_test_pairwise',
'tvalues',
'uncentered_tss',
'use_t',
'wald_test',
'wald_test_terms',
'wresid']
Find the residual mean square and r squared for this model so you can use it to compare with your other models.
In [19]:
base_mse = base_model.mse_resid
base_mse = np.around([base_mse], decimals = 2)
base_mse
Out[19]: array([23.16])
In [20]:
base_rsquared = np.around([base_model.rsquared], decimals = 2)
base_rsquared
Out[20]: array([0.35])
Look at a scatter plot of price vs p_price to see if this model looks reasonable (hope your model is better than my example).
In [21]:
scat(base_join, 'price', 'p_price')
This study source was downloaded by 100000824742498 from CourseHero.com on 02-22-2022 19:25:36 GMT -06:00
file:///E:/Documents/Downloads/Final Project.html
https://www.coursehero.com/file/84994658/Final-Projectpdf/
2/4
3/19/2021
Final Project
Section 3 Residual Mean Square (after removing duplicate records)
In [22]:
df1 = df.drop_duplicates(keep='first')
print('Number of Records:', len(df1))
Number of Records: 24
Model the data with the nodups dataset df1:
In [23]:
nodups_model = smf.ols(formula='price ~ taxes', data=df1).fit()
nodups_preds = nodups_model.fittedvalues
nodups_preds.head()
33.046073
Out[23]: 0
1
33.197726
2
32.493937
3
32.514551
4
33.255148
dtype: float64
Calculate the NODUPS residual mean square and r squared.:
In [24]:
d = {'p_price': nodups_preds}
d2=pd.DataFrame(data=d)
nodups_join = pd.concat([df,d2],axis = 1).reindex(df.index)
In [25]:
nodups_mse = nodups_model.mse_resid
nodups_mse = np.around([nodups_mse], decimals = 2)
nodups_mse
Out[25]: array([25.08])
In [26]:
nodups_rsquared = np.around([nodups_model.rsquared], decimals = 2)
nodups_rsquared
Out[26]: array([0.34])
In [27]:
scat(nodups_join, 'price', 'p_price')
Section 4 Residual Mean Square & R Squared (after EDA, fixing outliers, and missing data)
In [28]:
df2 = df1.copy()
df2=df2.replace({'living_space': {9999: 0}})
df2=df2.replace({'age_home': {9999: 0}})
m = np.median(df2.taxes[df2.taxes>0])
df2=df2.replace({'taxes': {0: m}})
n = np.median(df2.lot_size[df2.lot_size>0])
df2=df2.replace({'lot_size': {0: n}})
p = np.median(df2.living_space[df2.living_space>0])
df2=df2.replace({'living_space': {0: p}})
q = np.median(df2.age_home[df2.age_home>0])
df2=df2.replace({'age_home': {0: q}})
Use df2 with outlier and missing data fixes in place.
In [34]:
taxes_model = smf.ols(formula='price ~ taxes', data=df2).fit()
taxes_model.summary()
taxes_preds = age_home_model.fittedvalues
taxes_preds.head()
29.044122
Out[34]: 0
1
29.408913
2
27.716001
3
27.765584
4
29.547037
dtype: float64
Calculate taxes_mse for the nodups, no outliers and no missing values data df2.
In [35]:
d = {'p_price': taxes_preds}
d2=pd.DataFrame(data=d)
taxes_join = pd.concat([df,d2],axis = 1).reindex(df.index)
In [36]:
taxes_mse = taxes_model.mse_resid
taxes_mse = np.around([taxes_mse], decimals = 2)
taxes_mse
Out[36]: array([8.62])
In [37]:
taxes_rsquared = np.around([taxes_model.rsquared], decimals = 2)
taxes_rsquared
Out[37]: array([0.77])
In [38]:
scat(taxes_join, 'price', 'p_price')
# you hope this will look like a line group.
#
Your plot should look better than mine
This study source was downloaded by 100000824742498 from CourseHero.com on 02-22-2022 19:25:36 GMT -06:00
file:///E:/Documents/Downloads/Final Project.html
https://www.coursehero.com/file/84994658/Final-Projectpdf/
3/4
3/19/2021
Final Project
In your Python career you will see instances where others have offered a predictive model without
doing any EDA and no data fixes. You know how to fix this. Be kind in sharing your skill.
Summarize the Residual Mean Square and r squared for each of the models.
In [39]:
models_df = ({'Base Model':base_mse,
'nodups Model':nodups_mse,
'taxes Model':taxes_mse})
models = pd.Series(models_df)
models
Out[39]: Base Model
nodups Model
taxes Model
dtype: object
In [40]:
[23.16]
[25.08]
[8.62]
models_df = ({'Base Model':base_rsquared,
'nodups Model':nodups_rsquared,
'taxes Model':taxes_rsquared})
models = pd.Series(models_df)
models
Out[40]: Base Model
nodups Model
taxes Model
dtype: object
[0.35]
[0.34]
[0.77]
Bonus Points
Submit your model to Kaggle.com for scoring. Use the code below to create the csv file for Kaggle.
You may have to add the column variable names to the csv file to submit to kaggle. Kaggle expects
two columns labeled index and predict. Use Excel to change the column names.
In [ ]:
age_home_preds.to_csv('c:/data/predictions.csv')
Everything below is optional
age_home model with removal of duplicate records and fixes for outliers and missing data show major improvement for the simple single variable model. Still more you can do by including other variables.
One more look at a pairplot for price vs p_price with the single variable model:
In [ ]:
sns.set(style="ticks", palette="bright")
sns.pairplot(x_vars=["price"], y_vars=["p_price"], data=age_home_join, height = 6)
In [ ]:
import seaborn as sns
sns.set(style="ticks", palette="bright")
sns.pairplot(x_vars=["age_home"], y_vars=["price"], data=df2, hue="bathrooms",
palette=["g", "b"],height = 6)
Bathrooms may be a good addition to the model. Maybe something else too. You decide.
This study source was downloaded by 100000824742498 from CourseHero.com on 02-22-2022 19:25:36 GMT -06:00
file:///E:/Documents/Downloads/Final Project.html
https://www.coursehero.com/file/84994658/Final-Projectpdf/
Powered by TCPDF (www.tcpdf.org)
4/4
Download