Project Luther

Predicting Neighborhood Stability via Property Tax Assessment Data

Project Goals

  • Can we predict neighborhood stability?
  • Let's use time since the last sale of a house as a proxy.
  • Can we predict this from available property tax data?
  • First, can we accumulate enough property tax data?

Web Scraping

  • The Will County website has the data we need.
  • Extracting data is straightforward if you have the address or PIN.
  • But guessing these can be slow.
chart

Web Scraping

Methods and Tools

  • A bit of AJAX Hijacking
  • Scrapy Spiders - parallel
  • MongoDB for storage

Results

  • ~60,000 records
  • ~25% of all households
  • Is this enough?
  • Yes, as we'll see...

Linear Regression

  • Can we predict "Longevity" (time since last sale)?
  • Not very well.
  • The mean "Longevity" is about 13 years.
  • The typical error of the better models was 5 years.

Learning Curves

chart
  • The model converges quickly.
  • More data is likely not going to help.

Learning Curves

  • With more data, we continue to see huge variance.
chart

Poorly correlated features?

Correlation Matrix Snippet chart
  • Most of our features have poor correlation with our target.
  • Feature evaluation confirms Sale Amount is the strongest signal.

Linear Regression - Assumptions

Have we met the key assumptions for Linear Regression?

  • Linear relationship
  • Multivariate normality
  • No or little multicollinearity
  • No auto-correlation
  • Homoscedasticity

Homoscedasticity

chart

Collinearity

  • There were several groupings or clusters of similar things among the features.
  • For each of these all but one of the features were removed.

Normally distributed features?

  • Several features were transformed via a log function.
  • Outlier removal followed.

Normally distributed features?

chart

Normally distributed target?

  • The target also wasn't very guassian.
  • Transforming didn't improve the modeling.

Normally distributed target?

chart

Normally distributed target?

chart