Prepare and Fit Spatial Regression Models 20190222




Pay Notebook Creator: Roy Hyunjin Han0
Set Container: Numerical CPU with TINY Memory for 10 Minutes 0
Total0

Compare Spatial Regression Models on Airbnb Listings

The following example is adapted from http://darribas.org/gds_scipy16/ipynb_md/08_spatial_regression.html

For other Airbnb Listing URLs, please see http://insideairbnb.com/get-the-data.html

{ airbnb_listing_url : Airbnb Listing URL ? Specify a URL containing Airbnb Listings }

{ model1_feature_select : Features for Model 1 ? Select features to include in model }

{ model2_feature_select : Features for Model 2 ? Select features to include in model }

In [1]:
# Press the blue paper plane to preview this as a CrossCompute Tool
target_folder = '/tmp'
airbnb_listing_url = 'http://data.insideairbnb.com/canada/bc/vancouver/2018-11-07/data/listings.csv.gz'
model1_feature_select = """
    host_listings_count
    bathrooms
    bedrooms
    beds
    guests_included

    host_acceptance_rate          
    host_listings_count           
    host_total_listings_count     
    accommodates                  
    bathrooms                     
    bedrooms                      
    beds                          
    square_feet                   
    guests_included               
    minimum_nights                
    maximum_nights                
    availability_30               
    availability_60               
    availability_90               
    availability_365              
    number_of_reviews             
    review_scores_rating          
    review_scores_accuracy        
    review_scores_cleanliness     
    review_scores_checkin         
    review_scores_communication   
    review_scores_location        
    review_scores_value           
    calculated_host_listings_count
    reviews_per_month             
"""
model2_feature_select = """
    square_feet
    number_of_reviews
    review_scores_rating

    host_acceptance_rate          
    host_listings_count           
    host_total_listings_count     
    accommodates                  
    bathrooms                     
    bedrooms                      
    beds                          
    square_feet                   
    guests_included               
    minimum_nights                
    maximum_nights                
    availability_30               
    availability_60               
    availability_90               
    availability_365              
    number_of_reviews             
    review_scores_rating          
    review_scores_accuracy        
    review_scores_cleanliness     
    review_scores_checkin         
    review_scores_communication   
    review_scores_location        
    review_scores_value           
    calculated_host_listings_count
    reviews_per_month             
"""
In [2]:
# Get selected features

def get_selected_lines(select_text):
    lines = []
    for line in select_text.strip().splitlines():
        line = line.strip()
        if not line:
            break
        lines.append(line)
    return lines

model1_features = get_selected_lines(model1_feature_select)
model2_features = get_selected_lines(model2_feature_select)
In [3]:
# Enable inline plots
# %matplotlib inline
In [4]:
# Install packages
# import subprocess
# subprocess.call('pip install -U pysal'.split())
In [5]:
import pandas as pd
import numpy as np
In [6]:
# Download listings

def download(target_path, source_url):
    from urllib.request import urlretrieve
    urlretrieve(source_url, target_path)    
    return target_path

"""
source_url = (
    'http://data.insideairbnb.com/united-states/ny/new-york-city/'
    '2018-12-06/data/listings.csv.gz')
"""
source_archive_path = '/tmp/listings.csv.gz'
source_path = '/tmp/listings.csv'
download(source_archive_path, airbnb_listing_url)
Out[6]:
'/tmp/listings.csv.gz'
In [7]:
# Unpack gzip archive
import gzip
with gzip.open(source_archive_path, 'rb') as f:
    open(source_path, 'wb').write(f.read())
In [8]:
import pandas as pd
t = pd.read_csv(source_path)
t.iloc[0]
Out[8]:
id                                                                              10080
listing_url                                        https://www.airbnb.com/rooms/10080
scrape_id                                                              20181107122143
last_scraped                                                               2018-11-07
name                                                   D1 -  Million Dollar View 2 BR
summary                             Stunning two bedroom, two bathroom apartment. ...
space                               Bed setup: 2 x queen, I can add up to 2 twin s...
description                         Stunning two bedroom, two bathroom apartment. ...
experiences_offered                                                              none
neighborhood_overview                                                             NaN
notes                               1. CHECK-IN TIME IS AFTER 3PM PST AND CHECK-OU...
transit                                                                           NaN
access                                 There is no access to the building ammenities.
interaction                                                                       NaN
house_rules                         1. CHECK-IN TIME IS AFTER 3 PM PST AND CHECK-O...
thumbnail_url                                                                     NaN
medium_url                                                                        NaN
picture_url                         https://a0.muscache.com/im/pictures/55778229/c...
xl_picture_url                                                                    NaN
host_id                                                                         30899
host_url                                      https://www.airbnb.com/users/show/30899
host_name                                                                        Rami
host_since                                                                 2009-08-10
host_location                                     Vancouver, British Columbia, Canada
host_about                                               I will be happy to host you.
host_response_time                                                     within an hour
host_response_rate                                                               100%
host_acceptance_rate                                                              NaN
host_is_superhost                                                                   f
host_thumbnail_url                  https://a0.muscache.com/im/users/30899/profile...
                                                          ...                        
extra_people                                                                    $0.00
minimum_nights                                                                     60
maximum_nights                                                                   1124
calendar_updated                                                          2 weeks ago
has_availability                                                                    t
availability_30                                                                     1
availability_60                                                                     1
availability_90                                                                     1
availability_365                                                                  252
calendar_last_scraped                                                      2018-11-07
number_of_reviews                                                                  16
first_review                                                               2011-11-15
last_review                                                                2017-02-26
review_scores_rating                                                               93
review_scores_accuracy                                                              9
review_scores_cleanliness                                                           9
review_scores_checkin                                                              10
review_scores_communication                                                         9
review_scores_location                                                             10
review_scores_value                                                                 9
requires_license                                                                    t
license                                                                     18-476608
jurisdiction_names                  {"British Columbia"," Canada"," Vancouver"," B...
instant_bookable                                                                    f
is_business_travel_ready                                                            f
cancellation_policy                                       strict_14_with_grace_period
require_guest_profile_picture                                                       f
require_guest_phone_verification                                                    f
calculated_host_listings_count                                                     27
reviews_per_month                                                                0.19
Name: 0, Length: 96, dtype: object
In [9]:
t.dtypes
Out[9]:
id                                    int64
listing_url                          object
scrape_id                             int64
last_scraped                         object
name                                 object
summary                              object
space                                object
description                          object
experiences_offered                  object
neighborhood_overview                object
notes                                object
transit                              object
access                               object
interaction                          object
house_rules                          object
thumbnail_url                       float64
medium_url                          float64
picture_url                          object
xl_picture_url                      float64
host_id                               int64
host_url                             object
host_name                            object
host_since                           object
host_location                        object
host_about                           object
host_response_time                   object
host_response_rate                   object
host_acceptance_rate                float64
host_is_superhost                    object
host_thumbnail_url                   object
                                     ...   
extra_people                         object
minimum_nights                        int64
maximum_nights                        int64
calendar_updated                     object
has_availability                     object
availability_30                       int64
availability_60                       int64
availability_90                       int64
availability_365                      int64
calendar_last_scraped                object
number_of_reviews                     int64
first_review                         object
last_review                          object
review_scores_rating                float64
review_scores_accuracy              float64
review_scores_cleanliness           float64
review_scores_checkin               float64
review_scores_communication         float64
review_scores_location              float64
review_scores_value                 float64
requires_license                     object
license                              object
jurisdiction_names                   object
instant_bookable                     object
is_business_travel_ready             object
cancellation_policy                  object
require_guest_profile_picture        object
require_guest_phone_verification     object
calculated_host_listings_count        int64
reviews_per_month                   float64
Length: 96, dtype: object
In [10]:
# Select columns that have all numerical values
numerics = 'int16', 'int32', 'int64', 'float16', 'float32', 'float64'
selected_t = t.select_dtypes(include=numerics)
selected_t.dtypes
Out[10]:
id                                  int64
scrape_id                           int64
thumbnail_url                     float64
medium_url                        float64
xl_picture_url                    float64
host_id                             int64
host_acceptance_rate              float64
host_listings_count               float64
host_total_listings_count         float64
neighbourhood_group_cleansed      float64
latitude                          float64
longitude                         float64
accommodates                        int64
bathrooms                         float64
bedrooms                          float64
beds                              float64
square_feet                       float64
guests_included                     int64
minimum_nights                      int64
maximum_nights                      int64
availability_30                     int64
availability_60                     int64
availability_90                     int64
availability_365                    int64
number_of_reviews                   int64
review_scores_rating              float64
review_scores_accuracy            float64
review_scores_cleanliness         float64
review_scores_checkin             float64
review_scores_communication       float64
review_scores_location            float64
review_scores_value               float64
calculated_host_listings_count      int64
reviews_per_month                 float64
dtype: object
In [11]:
sorted(t.columns)
Out[11]:
['access',
 'accommodates',
 'amenities',
 'availability_30',
 'availability_365',
 'availability_60',
 'availability_90',
 'bathrooms',
 'bed_type',
 'bedrooms',
 'beds',
 'calculated_host_listings_count',
 'calendar_last_scraped',
 'calendar_updated',
 'cancellation_policy',
 'city',
 'cleaning_fee',
 'country',
 'country_code',
 'description',
 'experiences_offered',
 'extra_people',
 'first_review',
 'guests_included',
 'has_availability',
 'host_about',
 'host_acceptance_rate',
 'host_has_profile_pic',
 'host_id',
 'host_identity_verified',
 'host_is_superhost',
 'host_listings_count',
 'host_location',
 'host_name',
 'host_neighbourhood',
 'host_picture_url',
 'host_response_rate',
 'host_response_time',
 'host_since',
 'host_thumbnail_url',
 'host_total_listings_count',
 'host_url',
 'host_verifications',
 'house_rules',
 'id',
 'instant_bookable',
 'interaction',
 'is_business_travel_ready',
 'is_location_exact',
 'jurisdiction_names',
 'last_review',
 'last_scraped',
 'latitude',
 'license',
 'listing_url',
 'longitude',
 'market',
 'maximum_nights',
 'medium_url',
 'minimum_nights',
 'monthly_price',
 'name',
 'neighborhood_overview',
 'neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed',
 'notes',
 'number_of_reviews',
 'picture_url',
 'price',
 'property_type',
 'require_guest_phone_verification',
 'require_guest_profile_picture',
 'requires_license',
 'review_scores_accuracy',
 'review_scores_checkin',
 'review_scores_cleanliness',
 'review_scores_communication',
 'review_scores_location',
 'review_scores_rating',
 'review_scores_value',
 'reviews_per_month',
 'room_type',
 'scrape_id',
 'security_deposit',
 'smart_location',
 'space',
 'square_feet',
 'state',
 'street',
 'summary',
 'thumbnail_url',
 'transit',
 'weekly_price',
 'xl_picture_url',
 'zipcode']
In [12]:
# Limit table to selected columns
selected_t = t[model1_features + ['price', 'longitude', 'latitude']].dropna()
In [13]:
selected_t.head()
Out[13]:
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
host_listings_count bathrooms bedrooms beds guests_included price longitude latitude
0 27.0 2.0 2.0 4.0 1 $295.00 -123.121103 49.287716
1 1.0 1.0 1.0 1.0 1 $60.00 -123.112659 49.253756
2 1.0 1.0 0.0 2.0 2 $110.00 -123.105158 49.245770
3 1.0 1.0 1.0 1.0 1 $119.00 -123.125150 49.282090
4 1.0 1.0 1.0 2.0 2 $140.00 -123.081077 49.249739
In [14]:
# Prepare target value that we want to predict
import numpy as np
y = np.log(selected_t['price'].apply(
    lambda x: float(x.strip('$').replace(',', ''))) + 0.000001)
y[:5]
Out[14]:
0    5.686975
1    4.094345
2    4.700480
3    4.779124
4    4.941642
Name: price, dtype: float64
In [15]:
xys = selected_t[['longitude', 'latitude']].values
xys[:5]
Out[15]:
array([[-123.12110275,   49.28771582],
       [-123.11265899,   49.25375607],
       [-123.10515816,   49.24577007],
       [-123.12514983,   49.28208989],
       [-123.08107678,   49.2497391 ]])
In [16]:
from pysal.lib.cg import KDTree, RADIUS_EARTH_KM
kd_tree = KDTree(xys, distance_metric='Arc', radius=RADIUS_EARTH_KM)
/home/user/.virtualenvs/crosscompute/lib/python3.6/site-packages/pysal/lib/weights/util.py:19: UserWarning: geopandas not available. Some functionality will be disabled.
  warn('geopandas not available. Some functionality will be disabled.')
/home/user/.virtualenvs/crosscompute/lib/python3.6/site-packages/pysal/model/spvcm/abstracts.py:10: UserWarning: The `dill` module is required to use the sqlite backend fully.
  from .sqlite import head_to_sql, start_sql
In [17]:
# Prepare spatial weights
from pysal.lib.weights import KNN
w = KNN(kd_tree, k=2)
w.set_transform('R')
w
/home/user/.virtualenvs/crosscompute/lib/python3.6/site-packages/pysal/lib/weights/weights.py:170: UserWarning: The weights matrix is not fully connected. There are 296 components
  warnings.warn("The weights matrix is not fully connected. There are %d components" % self.n_components)
Out[17]:
<pysal.lib.weights.distance.KNN at 0x7fb3cbc2a898>
In [18]:
# Fit model using ordinary least squares
from pysal.model.spreg import OLS

model1 = OLS(
    y.values[:, None],
    selected_t.drop('price', axis=1).values,
    w=w,
    spat_diag=True,
    name_x=selected_t.drop('price', axis=1).columns.tolist(),
    name_y='ln(price)')
In [19]:
print(model1.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   ln(price)                Number of Observations:        4662
Mean dependent var  :      4.8039                Number of Variables   :           8
S.D. dependent var  :      0.7371                Degrees of Freedom    :        4654
R-squared           :      0.3801
Adjusted R-squared  :      0.3792
Sum squared residual:    1569.596                F-statistic           :    407.7206
Sigma-square        :       0.337                Prob(F-statistic)     :           0
S.E. of regression  :       0.581                Log likelihood        :   -4077.503
Sigma-square ML     :       0.337                Akaike info criterion :    8171.006
S.E of regression ML:      0.5802                Schwarz criterion     :    8222.584

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT    -823.1150548      31.9222602     -25.7849867       0.0000000
 host_listings_count      -0.0016191       0.0004578      -3.5369718       0.0004087
           bathrooms       0.0109342       0.0075994       1.4388301       0.1502659
            bedrooms       0.2846769       0.0154308      18.4486677       0.0000000
                beds       0.0986688       0.0119785       8.2371684       0.0000000
     guests_included       0.0286971       0.0056219       5.1045273       0.0000003
           longitude      -3.0630239       0.2167956     -14.1286257       0.0000000
            latitude       9.1385140       0.4189173      21.8145997       0.0000000
------------------------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER        11466.057

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2    10588480.618           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test                7         201.863           0.0000
Koenker-Bassett test              7           1.717           0.9738

DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                           MI/DF       VALUE           PROB
Lagrange Multiplier (lag)         1         114.311           0.0000
Robust LM (lag)                   1           2.329           0.1270
Lagrange Multiplier (error)       1         128.770           0.0000
Robust LM (error)                 1          16.788           0.0000
Lagrange Multiplier (SARMA)       2         131.099           0.0000

================================ END OF REPORT =====================================
In [20]:
# Here is an example model that tries to predict listing price
# based on whether NEARBY listings have high prices
from pysal.model.spreg import GM_Lag

model2 = GM_Lag(
    y.values[:, None],
    selected_t.drop('price', axis=1).values,
    w=w,
    spat_diag=True,
    name_x=selected_t.drop('price', axis=1).columns.tolist(),
    name_y='ln(price)')
print(model2.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   ln(price)                Number of Observations:        4662
Mean dependent var  :      4.8039                Number of Variables   :           9
S.D. dependent var  :      0.7371                Degrees of Freedom    :        4653
Pseudo R-squared    :      0.3881
Spatial Pseudo R-squared:  0.3804

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT    -789.8445200      38.1073892     -20.7268075       0.0000000
 host_listings_count      -0.0016161       0.0004544      -3.5563764       0.0003760
           bathrooms       0.0115378       0.0075539       1.5274061       0.1266600
            bedrooms       0.2833788       0.0153407      18.4722910       0.0000000
                beds       0.0981890       0.0118953       8.2544556       0.0000000
     guests_included       0.0283975       0.0055843       5.0852797       0.0000004
           longitude      -2.9228345       0.2329623     -12.5463855       0.0000000
            latitude       8.8092660       0.4656328      18.9189111       0.0000000
         W_ln(price)       0.0437882       0.0278545       1.5720366       0.1159421
------------------------------------------------------------------------------------
Instrumented: W_ln(price)
Instruments: W_bathrooms, W_bedrooms, W_beds, W_guests_included,
             W_host_listings_count, W_latitude, W_longitude

DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                           MI/DF       VALUE           PROB
Anselin-Kelejian Test             1          19.134          0.0000
================================ END OF REPORT =====================================
In [21]:
model2.betas
Out[21]:
array([[-7.89844520e+02],
       [-1.61610985e-03],
       [ 1.15378342e-02],
       [ 2.83378758e-01],
       [ 9.81889518e-02],
       [ 2.83975234e-02],
       [-2.92283446e+00],
       [ 8.80926599e+00],
       [ 4.37882442e-02]])
In [23]:
from sklearn.metrics import mean_squared_error as mse
from pysal.lib.cg import KDTree, RADIUS_EARTH_KM
from pysal.lib.weights import KNN
from pysal.model.spreg import GM_Lag, OLS

result_lines = []
for features in [model1_features, model2_features]:
    # Limit table to selected columns
    selected_t = t[features + ['price', 'longitude', 'latitude']].dropna()
    # Prepare target value we want to predict
    y = np.log(selected_t['price'].apply(
        lambda x: float(x.strip('$').replace(',', ''))) + 0.000001)
    # Prepare spatial weights
    xys = selected_t[['longitude', 'latitude']].values
    kd_tree = KDTree(xys, distance_metric='Arc', radius=RADIUS_EARTH_KM)
    w = KNN(kd_tree, k=2)
    w.set_transform('R')
    
    # Fit using ordinary least squares
    ols_model = OLS(
        y.values[:, None],
        selected_t.drop('price', axis=1).values,
        w=w,
        spat_diag=True,
        name_x=selected_t.drop('price', axis=1).columns.tolist(),
        name_y='ln(price)')
    mean_squared_error = mse(y, ols_model.predy.flatten())
    result_lines.append('OLS Model Mean Squared Error %s' % mean_squared_error)
    for feature, coefficient in zip(features, ols_model.betas):
        result_lines.append('%s %s' % (coefficient, feature))
    result_lines.append('')
    
    # Fit using spatial lag model
    lag_model = GM_Lag(
        y.values[:, None],
        selected_t.drop('price', axis=1).values,
        w=w,
        spat_diag=True,
        name_x=selected_t.drop('price', axis=1).columns.tolist(),
        name_y='ln(price)')
    mean_squared_error = mse(y, lag_model.predy_e)
    result_lines.append('LAG Model Mean Squared Error %s' % mean_squared_error)
    for feature, coefficient in zip(features, ols_model.betas):
        result_lines.append('%s %s' % (coefficient, feature))
    result_lines.append('')
/home/user/.virtualenvs/crosscompute/lib/python3.6/site-packages/pysal/lib/weights/weights.py:170: UserWarning: The weights matrix is not fully connected. There are 7 components
  warnings.warn("The weights matrix is not fully connected. There are %d components" % self.n_components)
In [24]:
from os.path import join
result_text = '\n'.join(result_lines)
result_text_path = join(target_folder, 'result.txt')
open(result_text_path, 'wt').write(result_text)
print('result_text_path = %s' % result_text_path)
result_text_path = /tmp/result.txt
In [25]:
print(result_text)
OLS Model Mean Squared Error 0.33667861041140706
[-823.11505478] host_listings_count
[-0.00161906] bathrooms
[0.01093423] bedrooms
[0.28467685] beds
[0.09866878] guests_included

LAG Model Mean Squared Error 0.33651252904985834
[-823.11505478] host_listings_count
[-0.00161906] bathrooms
[0.01093423] bedrooms
[0.28467685] beds
[0.09866878] guests_included

OLS Model Mean Squared Error 0.27812922572410775
[-880.80941305] square_feet
[2.58452926e-05] number_of_reviews
[-0.0007263] review_scores_rating

LAG Model Mean Squared Error 23.98914258358615
[-880.80941305] square_feet
[2.58452926e-05] number_of_reviews
[-0.0007263] review_scores_rating

Comparison of Models for Estimating Airbnb Listing Price

  • Lower mean squared error is better.
  • The magnitude of the coefficient suggests how much influence a feature might have on the price.

{ result_text : Result Summary ? Review mean squared error and coefficients }