• karlduane

Ames Housing Data model

Updated: Jan 19, 2021


The primary purpose of this exercise is to develop a regression model that accurately predicts home values for homes at sale in Ames, Iowa, based on a database of sales from 2007 to 2010.

For this model, I will be looking at linear regression, lasso and ridge models with the goal of getting the Root Mean Squared Error (RMSE) as close to 0 as possible for the model.

Let's get started. We need to import a few packages:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression,  LassoCV, Ridge, RidgeCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

and pull in the data:

df = pd.read_csv('datasets/train.csv', index_col = 'PID')
df.drop(df[df['Gr Liv Area'] > 4000].index, inplace = True)  

In this case, we're dropping some known outliers that jse.amstat.org, the originators of the dataset, recommend eliminating as true outliers. We'll also drop a few columns based on their correlation to sales price and their level of collinearity.

In this chart, we're looking at the correlation between each feature and the final sales price, where the closer to 0 the correlation factor, the less it affects the final prince. Unsurprisingly, the two highest correlation features are Overall Quality and Above Ground Living Area. We'll rule out anything less than 0.4 as it has a weak relation. We'll also need to one hot encode several of the categorical columns.

df = pd.get_dummies(df, columns = ["Neighborhood",
                                   "Roof Style",
                                   "Mas Vnr Type"
                    drop_first= True)

explanatory_vars = df.drop(columns = ['BsmtFin SF 1', 'Screen   Porch','Enclosed Porch', 'Kitchen AbvGr', 'Lot Frontage','Id', 'Overall Cond', 'Mas Vnr Area','Paved Drive', 'Sale Type', 'MS Zoning','Functional', 'Heating QC', 'House Style','Lot Frontage', 'MS SubClass', 'Bsmt Half Bath','Low Qual Fin SF', 'Yr Sold', 'Misc Val','BsmtFin SF 2', 'Pool Area', 'Mo Sold', '3Ssn Porch', 
'Bedroom AbvGr', 'Bsmt Unf SF', '2nd Flr SF','Half Bath', 'Bsmt Full Bath', "Lot Area",'Wood Deck SF', 'Open Porch SF', 'Street',
'Alley', 'Lot Shape', 'Land Contour','Utilities', 'Lot Config', 'Land Slope','Condition 1', 'Condition 2', 'Garage Cars','Bldg Type', 'House Style','Roof Matl', 'Exterior 1st', 'Exterior 2nd',
 'Exter Qual', 'Exter Cond','Foundation', 'Bsmt Qual', 'Bsmt Cond','Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin Type 2','Heating', 'Electrical', 'Kitchen Qual',"Fireplace Qu", 'Garage Type', 'Garage Finish','Garage Qual', 'Garage Cond', 'Paved Drive','Pool QC', 'Fence', 'Misc Feature','Central Air', 'Garage Yr Blt']).columns

That reduces the number of features he have significantly. We'll also need to fill the empty cells. In this case, we're left with a few columns such as 'garage area' which are currently left blank if there is no garage, so it is reasonable to fill the blanks with 0, for no garage square footage.

model = df[explanatory_vars].copy()
model.fillna(0, inplace = True)

In this case, our explanatory variables, X, are everything left and the target, y, for our predictive model will be the Sale Price.

X = model.drop(columns = 'SalePrice')
y = np.log(model['SalePrice'])

We'll also take a moment to split our data into training data and testing data so we can evaluate our model's performance on data it hasn't encountered before.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

At this point, we're finally ready to fit our model.

ols = LinearRegression()
ols.fit(X_train, y_train);

Evaluating the Model

ols.score(X_train, y_train), ols.score(X_test, y_test)

(0.8874287745574233, 0.8564766118679286)

cross_val_score(ols, X_train, y_train).mean()


Okay, we have a cross validated score of 87.6%, meaning that the features we're examining can explain 87.6% of the variation in the sale price. We can see that this changes somewhat depending on whether it's working with data it has examined before (88.7%) compared to data it has not yet seen (85.6%).

While this is promising, I'm more interested in the RMSE- Root Mean Squared Error as this is a bit more of an interpretable metric. For this model our RMSE - $22,942, means that our model can predict a home's value to within $22,942. While this may not be a big deal when looking at homes on the higher end of the scale such as the 500,000 - 600,000 range, it is likely to have a larger effect when examining a home on the lower end of the scale, say priced between 50,000 and 100,000.


There are several different strategies that can be pursued around maximizing sale price of a home and the strategies will differ somewhat depending on the type of stakeholder. The Linear Regression model selected is able to account for 88% of the variability in the home's sale price and is able to predict the value within a 23k dollar range.

Let's take a closer look at the factors with the biggest effect on the sale price:

And at the features with a negative effect-

For a homeowner looking to make changes to an existing home, the biggest impact to the home sale price would be generated by increasing the overall living area, where, all else being constant, each additional square foot added is expected to increase the home sale price by $138. Barring the ability to add on to the home (due to zoning or space restrictions), the homeowner may want to consider if it is feasible to remove an masonry veneer attached to the home, as the impact of no masonry veneer is greater than the impact of any other option.

Finally, let's take a look a the location. Different neighborhoods also have an (un)surprising effect on the sale price.

We also see that neighborhoods have a strong impact on the sale price of a home.

For an existing homeowner, this isn't as much in their control. When looking to buy, a potential homeowner may want to consider how much of the home price is due to the location.

Recommendations for Further Research

Due to time constraints, I was unfortunately unable to engineer as many features as I would like and I would be interested in seeing the comparative differences between Lot Size and Square footage. One interpretation of the results of this model is that for a real estate developer, their best strategy would be to buy a number of lots in the desirable Green Hills Neighborhood and build houses with as large a footprint as possible on the lot. I would be interested in plotting more of these interactions and examining the effect on the model.

17 views0 comments

Recent Posts

See All

Last week I had a conversation with the CTO of a small company (10-50 people) about how the company uses its data and a portion of the conversation has stuck with me ever since. In the conversation o

Recently I had the opportunity to attend the LinkedIn DATAcated conference. It included a number of excerpts from big names in the data science world and I wanted to record some of my favorite quotes,