Project Luther
Predicting Neighborhood Stability via Property Tax Assessment Data
Project Goals
Can we predict neighborhood stability?
Let's use time since the last sale of a house as a proxy.
Can we predict this from available property tax data?
First, can we accumulate enough property tax data?
Web Scraping
The Will County website has the data we need.
Extracting data is straightforward if you have the address or PIN.
But guessing these can be slow.
Web Scraping
Methods and Tools
A bit of AJAX Hijacking
Scrapy Spiders - parallel
MongoDB for storage
Results
~60,000 records
~25% of all households
Is this enough?
Yes, as we'll see...
Linear Regression
Can we predict "Longevity" (time since last sale)?
Not very well.
The mean "Longevity" is about 13 years.
The typical error of the better models was 5 years.
Learning Curves
The model converges quickly.
More data is likely not going to help.
Learning Curves
With more data, we continue to see huge variance.
Poorly correlated features?
Correlation Matrix Snippet
Most of our features have poor correlation with our target.
Feature evaluation confirms Sale Amount is the strongest signal.
Linear Regression - Assumptions
Have we met the key assumptions for Linear Regression?
Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity
Homoscedasticity
Collinearity
There were several groupings or clusters of similar things among the features.
For each of these all but one of the features were removed.
Normally distributed features?
Several features were transformed via a log function.
Outlier removal followed.
Normally distributed features?
Normally distributed target?
The target also wasn't very guassian.
Transforming didn't improve the modeling.
Normally distributed target?
Normally distributed target?