3/19/2021 Final Project MSDS 430 Final Project Complete the following and submit your notebook, html to Canvas. Your completed notebook should include all output, i.e. run each cell and save your file before submitting. In this final project you will continue working with the building prices dataset. You have already run Python code to fix the bad data detected in Milestone 1. In Milestone 2 you selected a sngle variable that you think is best for predicting sale price. Use your single variable from Milestone 2 to complete this assignment. You can change your mind but tell me what you are doing and why.</font> The code provided below uses age_home for the input variable. This is just so the code will run. Replace age_home with the name of the variable you selected in Milestone 2. Section 1 Import and Explore Start by rebuilding the dataframes you used in Milestone 1 and 2. Changing single variable from age_home to taxes due to the higher correlation coefficients we identified in Milestone 2 In [13]: import pandas as pd import numpy as np import matplotlib.pyplot as plt import statsmodels.formula.api as smf import seaborn as sns sns.set(style="ticks", palette="bright") In [14]: df = pd.read_csv('building prices 2018-1.csv') df.columns = [s.lower() for s in df.columns] df.head() price taxes bathrooms lot_size living_space num_garage num_rooms num_bedrms age_home num_fplaces 0 25.9 4.918 1.0 3.472 0.998 1.0 7 4 42 0 1 29.5 5.021 1.0 3.531 1.500 2.0 7 4 62 0 2 27.9 4.543 1.0 2.275 1.175 1.0 6 3 40 0 3 25.9 4.557 1.0 4.050 1.232 1.0 6 3 54 0 4 29.9 5.060 1.0 0.000 1.121 1.0 6 3 0 0 Out[14]: In [15]: def scat(dataframe,var1,var2): plt.scatter(dataframe[var1],dataframe[var2], s = 30) plt.title('Scatter') plt.xlabel(var1) plt.ylabel(var2) scat(df, 'taxes', 'price') Section 2 Base Model and Residual Mean Square Run a base model and calculate the residual mean square for the model. Kaggle is set up to score your models using a different calculation called root mean square error (RMSE). Using residual mean square here because statsmodels least squares calculates it for you. Using the statsmodels API for this linear model and age_home for the single prediction variable. age_home will be used for the single prediction variable for each of the sections where the change will be improvement in the data due to the EDA. Result will be model residual mean square improvement due to the EDA rather than changes to the input variable. In [16]: base_model = smf.ols(formula='price ~ taxes', data=df).fit() predictions = base_model.fittedvalues predictions.head() 33.003663 Out[16]: 0 1 33.163371 2 32.422201 3 32.443909 4 33.223843 dtype: float64 Join the actual price with the model predicted price from the model so you can look at a plot later. In [17]: d = {'p_price': predictions} d2=pd.DataFrame(data=d) base_join = pd.concat([df,d2],axis = 1).reindex(df.index) Note, statsmodels has a feature that calculates some very useful things for you regarding your model. Use dir(base_model) to see what these are. ['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', '_HCCM', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abat_diagonal', '_cache', '_data_attr', '_data_in_cache', '_get_robustcov_results', '_is_nested', '_use_t', '_wexog_singular_values', 'aic', 'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params', 'cov_type', 'df_model', 'df_resid', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 'fittedvalues', 'fvalue', 'get_influence', 'get_prediction', 'get_robustcov_results', 'initialize', 'k_constant', 'llf', 'load', 'model', 'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 'summary', 'summary2', 't_test', 't_test_pairwise', 'tvalues', 'uncentered_tss', 'use_t', 'wald_test', 'wald_test_terms', 'wresid'] Find the residual mean square and r squared for this model so you can use it to compare with your other models. In [19]: base_mse = base_model.mse_resid base_mse = np.around([base_mse], decimals = 2) base_mse Out[19]: array([23.16]) In [20]: base_rsquared = np.around([base_model.rsquared], decimals = 2) base_rsquared Out[20]: array([0.35]) Look at a scatter plot of price vs p_price to see if this model looks reasonable (hope your model is better than my example). In [21]: scat(base_join, 'price', 'p_price') Section 3 Residual Mean Square (after removing duplicate records) In [34]: taxes_model = smf.ols(formula='price ~ taxes', data=df2).fit() taxes_model.summary() taxes_preds = age_home_model.fittedvalues taxes_preds.head() 29.044122 Out[34]: 0 1 29.408913 2 27.716001 3 27.765584 4 29.547037 dtype: float64 Calculate taxes_mse for the nodups, no outliers and no missing values data df2. In [35]: d = {'p_price': taxes_preds} d2=pd.DataFrame(data=d) taxes_join = pd.concat([df,d2],axis = 1).reindex(df.index) In [36]: taxes_mse = taxes_model.mse_resid taxes_mse = np.around([taxes_mse], decimals = 2) taxes_mse Out[36]: array([8.62]) In [37]: taxes_rsquared = np.around([taxes_model.rsquared], decimals = 2) taxes_rsquared Out[37]: array([0.77]) In [38]: scat(taxes_join, 'price', 'p_price') # you hope this will look like a line group. # Your plot should look better than mine In your Python career you will see instances where others have offered a predictive model without doing any EDA and no data fixes. You know how to fix this. Be kind in sharing your skill. Summarize the Residual Mean Square and r squared for each of the models. In [39]: models_df = ({'Base Model':base_mse, 'nodups Model':nodups_mse, 'taxes Model':taxes_mse}) models = pd.Series(models_df) models Out[39]: Base Model nodups Model taxes Model dtype: object In [40]: [23.16] [25.08] [8.62] models_df = ({'Base Model':base_rsquared, 'nodups Model':nodups_rsquared, 'taxes Model':taxes_rsquared}) models = pd.Series(models_df) models Out[40]: Base Model nodups Model taxes Model dtype: object [0.35] [0.34] [0.77] Bonus Points Submit your model to Kaggle.com for scoring. Use the code below to create the csv file for Kaggle. You may have to add the column variable names to the csv file to submit to kaggle. Kaggle expects two columns labeled index and predict. Use Excel to change the column names. In [ ]: age_home_preds.to_csv('c:/data/predictions.csv') Everything below is optional age_home model with removal of duplicate records and fixes for outliers and missing data show major improvement for the simple single variable model. Still more you can do by including other variables. One more look at a pairplot for price vs p_price with the single variable model: In [ ]: sns.set(style="ticks", palette="bright") sns.pairplot(x_vars=["price"], y_vars=["p_price"], data=age_home_join, height = 6) In [ ]: import seaborn as sns sns.set(style="ticks", palette="bright") sns.pairplot(x_vars=["age_home"], y_vars=["price"], data=df2, hue="bathrooms", palette=["g", "b"],height = 6) Bathrooms may be a good addition to the model. Maybe something else too. You decide.