Back

Predict House Pricing

work in progres


Future additions:

  • Cross-validation
  • Method to calculate best parameters for the model
  • better data cleaning (Experimenting with SimpleImputer(), one hot encoding for more columns)
  • more visualizations for categorical data
In [1]:
# Import all modules

import numpy as np # linear algebra
import pandas as pd # data processing

# ML model: RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

# Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

# XGBRegressor
from xgboost import XGBRegressor

# Filling missing values
from sklearn.impute import SimpleImputer

# Visualization
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

print('Import completed')
Import completed
In [2]:
# Load csv files
train_data = pd.read_csv('../input/home-data-for-ml-course/train.csv', index_col='Id')
test_data = pd.read_csv('../input/home-data-for-ml-course/test.csv')

1. Exploratory data analysis

In [3]:
# First look at the data
print(train_data.shape)
train_data.head()
(1460, 80)
Out[3]:
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Id
1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 80 columns

As you can see we have a dataset containing 80 columns (with both numerical and categorical values) and 1460 rows (houses).
The first step for now is to split the dataset in numerical and categorical subsets.
We will also set our target to be 'SalePrice' since it is the attribute we will want to predict.

In [4]:
# Set target which will be predicted later on
target = train_data['SalePrice']

# Splitting data in numerical and categorical subsets
num_attr = train_data.select_dtypes(exclude='object').drop('SalePrice', axis=1).copy()
cat_attr = train_data.select_dtypes(include='object').copy()

1.1 Analyzing numerical attributes

In [5]:
# Finding outliers by graphing numerical attributes to SalePrice
plots = plt.figure(figsize=(12,20))

print('Loading 35 plots ...')
for i in range(len(num_attr.columns)-1):
    plots.add_subplot(9, 4, i+1)
    sns.regplot(num_attr.iloc[:,i], target)
    
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
Loading 35 plots ...

Outliers (don't follow regression line):

  • LotFrontage > 200
  • LotArea > 100,000
  • BsmtFinSF1 > 4000
  • TotalBsmtSF > 6000
  • GrLivArea > 4000 + SalePrics < 300,000
  • LowQualFinSF > 550

1.2 Analyzing categorical attributes

In [6]:
cat_attr.columns
Out[6]:
Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition'],
      dtype='object')
In [7]:
sns.countplot(x='SaleCondition', data=cat_attr)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f700f6e9290>

2. Data Cleaning

In [8]:
# Missing values for numerical attributes
num_attr.isna().sum().sort_values(ascending=False).head()
Out[8]:
LotFrontage    259
GarageYrBlt     81
MasVnrArea       8
YrSold           0
BsmtFinSF2       0
dtype: int64

259 missing values for LotFrontage --> we will use SimpleImputer() to fill them with averaged values.

In [9]:
# Missing values for categorical attributes
cat_attr.isna().sum().sort_values(ascending=False).head(16)
Out[9]:
PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
GarageCond        81
GarageQual        81
GarageFinish      81
GarageType        81
BsmtFinType2      38
BsmtExposure      38
BsmtFinType1      37
BsmtQual          37
BsmtCond          37
MasVnrType         8
Electrical         1
dtype: int64

There are a lot of missing values here.
We can drop nearly all of them as one hot encoding them won't be useful with this high amount of missing values. (It would dramatically increase the amount of columns in the dataset)

  • MasVnrType and MasVnrArea both have 8 missing values
  • Electrical --> One hot encoding (pd.get_dummies())
In [10]:
# Copy the data to prevent changes to original data
data_copy = train_data.copy()

data_copy.MasVnrArea = data_copy.MasVnrArea.fillna(0)

# Columns which can be filled with 'None'
cat_cols_fill_none = ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu',
                     'GarageCond', 'GarageQual', 'GarageFinish', 'GarageType',
                     'BsmtFinType2', 'BsmtExposure', 'BsmtFinType1', 'BsmtQual', 'BsmtCond',
                     'MasVnrType']
for cat in cat_cols_fill_none:
    data_copy[cat] = data_copy[cat].fillna("None")
    
data_copy.isna().sum().sort_values(ascending=False).head()
Out[10]:
LotFrontage    259
GarageYrBlt     81
Electrical       1
SalePrice        0
Foundation       0
dtype: int64

Missing values left in the dataset: LotFrontage (259), GarageYrBuilt (81), Electrical (1)

In [11]:
# Dropping outliers found when visualizing the numerical subset of our dataset
data_copy = data_copy.drop(data_copy['LotFrontage'][data_copy['LotFrontage']>200].index)
data_copy = data_copy.drop(data_copy['LotArea'][data_copy['LotArea']>100000].index)
data_copy = data_copy.drop(data_copy['BsmtFinSF1'][data_copy['BsmtFinSF1']>4000].index)
data_copy = data_copy.drop(data_copy['TotalBsmtSF'][data_copy['TotalBsmtSF']>6000].index)
data_copy = data_copy.drop(data_copy['1stFlrSF'][data_copy['1stFlrSF']>4000].index)
data_copy = data_copy.drop(data_copy.GrLivArea[(data_copy['GrLivArea']>4000) & (target<300000)].index)
data_copy = data_copy.drop(data_copy.LowQualFinSF[data_copy['LowQualFinSF']>550].index)

X = data_copy.drop('SalePrice', axis=1)

y = data_copy.SalePrice

# Splitting into training and validation data (cross validation will be added in the future)
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

numerical_transformer = SimpleImputer(strategy='mean')

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, num_attr.columns),
        ('cat', categorical_transformer, cat_attr.columns)
    ])

3. Building two models (RandomForestRegressor and XGBRegressor)

3.1 XGBRegressor

Parameters for XGBRegressor seem to yield good results. (Future addition: tweaking parameters to fine-tune algorithm)
Same applys to RandomForestRegressor.

In [12]:
xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.05)

xgb_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', xgb_model)
                             ])
xgb_pipeline.fit(train_X, train_y)
xgb_val_predictions = xgb_pipeline.predict(val_X)

# Mean absolute error
xgb_val_mae = mean_absolute_error(xgb_val_predictions, val_y)
print(xgb_val_mae)
15450.050092544765

3.2 RandomForestRegressor

In [13]:
# Create RandomForestRegressor model, fitting it with train_data and create validation predictions to calculate MAE
rf_model = RandomForestRegressor(n_estimators=500, random_state=1)

rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', rf_model)
                             ])
rf_pipeline.fit(train_X, train_y)
rf_val_predictions = rf_pipeline.predict(val_X)

# Mean absolute error
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)
print(rf_val_mae)
16282.215327823691

Comparing the MAE of both models we can say, that the XGBRegressor model works better. So we will use this model for our final predictions.

In [14]:
# Applying the same data cleaning we used for the training data to the test data

test_X = test_data.copy()
test_X.MasVnrArea = test_X.MasVnrArea.fillna(0)
test_X = test_X.drop('Id', axis=1)

4. Final predictions and submission

In [15]:
test_preds = xgb_pipeline.predict(test_X)
test_preds
Out[15]:
array([120019.76, 157997.67, 181656.42, ..., 167159.48, 131784.97,
       221632.  ], dtype=float32)

This array contains all predictions. To create the submission.csv file, we will run the next code cell:

In [16]:
output = pd.DataFrame({'Id': test_data.Id,
                      'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)
print('Submitted')
Submitted
In [ ]: