Ames Housing Model
The primary purpose of this project is to develop a regression model that accurately predicts home values for homes at sale in the Ames Iowa Housing area.
This will be looking at linear regression, lasso and ridge models with the goal of getting the Root Mean Squared Error (RMSE) as close to 0 as possible.
Exploratory Data Analysis:
In this case, we're dropping some known outliers that jse.amstat.org, the originators of the dataset, recommend eliminating as true outliers. We'll also drop a few columns based on their correlation to sales price and their level of collinearity.
we're looking at the correlation between each feature and the final sales price, where the closer to 0 the correlation factor, the less it affects the final prince. Unsurprisingly, the two highest correlation features are Overall Quality and Above Ground Living Area. We'll rule out anything less than 0.4 as it has a weak relation. We'll also need to one hot encode several of the categorical columns.
That reduces the number of features we have significantly. We'll also need to fill the empty cells. In this case, we're left with a few columns such as 'garage area' which are currently left blank if there is no garage, so it is reasonable to fill the blanks with 0, for no garage square footage.
We have a cross validated score of 87.6%, meaning that the features we're examining can explain 87.6% of the variation in the sale price. We can see that this changes somewhat depending on whether it's working with data it has examined before (88.7%) compared to data it has not yet seen (85.6%).
While this is promising, I'm more interested in the RMSE- Root Mean Squared Error as this is a bit more of an interpretable metric. For this model our RMSE - $22,942, means that our model can predict a home's value to within $22,942. While this may not be a big deal when looking at homes on the higher end of the scale such as the 500,000 - 600,000 range, it is likely to have a larger effect when examining a home on the lower end of the scale, say priced between 50,000 and 100,000.
There are several different strategies that can be pursued around maximizing sale price of a home and the strategies will differ somewhat depending on the type of stakeholder. The Linear Regression model selected is able to account for 88% of the variability in the home's sale price and is able to predict the value within a 23k dollar range.
For a homeowner looking to make changes to an existing home, the biggest impact to the home sale price would be generated by increasing the overall living area, where, all else being constant, each additional square foot added is expected to increase the home sale price by $138. Barring the ability to add on to the home (due to zoning or space restrictions), the homeowner may want to consider if it is feasible to remove an masonry veneer attached to the home, as the impact of no masonry veneer is greater than the impact of any other option.
We also see that neighborhoods have a strong impact on the sale price of a home.
For an existing homeowner, this isn't as much in their control. When looking to buy, a potential homeowner may want to consider how much of the home price is due to the location.