Prepare and Fit Spatial Regression Models 20190222




Pay Notebook Creator: Roy Hyunjin Han0
Set Container: Numerical CPU with TINY Memory for 10 Minutes 0
Total0

Techniques for Preparing Your Spatial Regression Dataset

  1. Identify your independent variable (the variable you want to predict).
  2. Identify your dependent variables (the features you want to use to predict the independent variable).

In this notebook, we will demonstrate different techniques for preparing your dependent variables from spatial vectors (tables) and spatial rasters (images). We will be working with three datasets provided courtesy of NYC Open Data.

Here are the techniques we will cover:

  1. Prepare Dependent Variable By Aggregating Over Column
  2. Prepare Neighborhood Polygons
  3. Prepare Dependent Variable By Aggregating Over Image
  4. Prepare Dependent Variable By Aggregating Nearby Coordinates

Each of these techniques should result in a new column in your dataset that you can use as a dependent variable.

First, let's define some convenience functions.

In [ ]:
import geotable
import numpy as np
import pandas as pd
In [ ]:
import gzip
from invisibleroads_macros.disk import uncompress
from os.path import exists
from urllib.request import urlretrieve

def download(target_path, source_url):
    if not exists(target_path):
        urlretrieve(source_url, target_path)    
    return target_path

def download_zip(target_folder, source_url):
    archive_path = download(target_folder + '.zip', source_url)
    return uncompress(archive_path, target_folder)
            
def download_gz(target_path, source_url):
    archive_path = download(target_path + '.gz', source_url)
    with gzip.open(archive_path, 'rb') as f:
        open(target_path, 'wb').write(f.read())
    return target_path
In [ ]:
# url = 'https://data.cityofnewyork.us/download/q68s-8qxv/application%2Fzip'
# download_zip('/tmp/pollution', url)
In [ ]:
# url = 'http://data.insideairbnb.com/canada/bc/vancouver/2018-11-07/data/listings.csv.gz'
# download_gz('/tmp/list.csv', url)

Prepare Dependent Variable By Aggregating Over Column

Here we will average "HIV DIAGNOSES PER 100,000 POPULATION" over each neighborhood.

In [ ]:
url = 'https://data.cityofnewyork.us/api/views/ykvb-493p/rows.csv'
In [ ]:
import pandas as pd
t = disease_table = pd.read_csv(url, na_values=['*'])
len(t)
In [ ]:
t.head()
In [ ]:
len(t['Neighborhood (U.H.F)'])
In [ ]:
sorted(t['Neighborhood (U.H.F)'].unique())
In [ ]:
t.dtypes
In [ ]:
selected_t = t.dropna(subset=['HIV DIAGNOSES PER 100,000 POPULATION'])
selected_t = t[t['Neighborhood (U.H.F)'] != 'All']
len(selected_t)
In [ ]:
selected_t.dtypes
In [ ]:
# Aggregate by neigborhood
aggregated_disease_table = selected_t.groupby('Neighborhood (U.H.F)').mean()
aggregated_disease_table.reset_index(inplace=True)
aggregated_disease_table = aggregated_disease_table[[
    'Neighborhood (U.H.F)',
    'HIV DIAGNOSES PER 100,000 POPULATION',
]]
aggregated_disease_table[:5]
In [ ]:
aggregated_disease_table = aggregated_disease_table.rename(columns={
    'Neighborhood (U.H.F)': 'Neighborhood Name',
    'HIV DIAGNOSES PER 100,000 POPULATION': 'HIV Diagnoses Per 100,000 Population',
})
aggregated_disease_table[:3]

Prepare Neighborhood Polygons

In [ ]:
# https://www1.nyc.gov/assets/doh/downloads/pdf/ah/zipcodetable.pdf
In [ ]:
zipcodes_by_neighborhood = {
    '101 Kingsbridge - Riverdale': [10463, 10471],
    '102 Northeast Bronx': [10466, 10469, 10470, 10475],
    '103 Fordham - Bronx Park': [10458, 10467, 10468],
    '104 Pelham - Throgs Neck': [10461, 10462, 10464, 10465, 10472, 10473],
    '105 Crotona - Tremont': [10453, 10457, 10460],
    '106 High Bridge - Morrisania': [10451, 10452, 10456],
    '107 Hunts Point - Mott Haven': [10454, 10455, 10459, 10474],
    '201 Greenpoint': [11211, 11222],
    '202 Downtown - Heights - Park Slope': [11201, 11205, 11215, 11217, 11231],
    '203 Bedford Stuyvesant - Crown Heights': [11213, 11212, 11216, 11233, 11238],
    '204 East New York': [11207, 11208],
    '205 Sunset Park': [11220, 11232],
    '206 Borough Park': [11204, 11218, 11219, 11230],
    '207 East Flatbush - Flatbush': [11203, 11210, 11225, 11226],
    '208 Canarsie - Flatlands': [11234, 11236, 11239],
    '209 Bensonhurst - Bay Ridge': [11209, 11214, 11228],
    '210 Coney Island - Sheepshead Bay': [11223, 11224, 11229, 11235],
    '211 Williamsburg - Bushwick': [11206, 11221, 11237],
    '301 Washington Heights - Inwood': [10031, 10032, 10033, 10034, 10040],
    '302 Central Harlem - Morningside Heights': [10026, 10027, 10030, 10037, 10039],
    '303 East Harlem': [10029, 10035],
    '304 Upper West Side': [10023, 10024, 10025],
    '305 Upper East Side': [10021, 10028, 10044, 10128],
    '306 Chelsea - Clinton': [10001, 10011, 10018, 10019, 10020, 10036],
    '307 Gramercy Park - Murray Hill': [10010, 10016, 10017, 10022],
    '308 Greenwich Village - SoHo': [10012, 10013, 10014],
    '309 Union Square - Lower East Side': [10002, 10003, 10009],
    '310 Lower Manhattan': [10004, 10005, 10006, 10007, 10038, 10280],
    '401 Long Island City - Astoria': [11101, 11102, 11103, 11104, 11105, 11106],
    '402 West Queens': [11368, 11369, 11370, 11372, 11373, 11377, 11378],
    '403 Flushing - Clearview': [11354, 11355, 11356, 11357, 11358, 11359, 11360],
    '404 Bayside - Little Neck': [11361, 11362, 11363, 11364],
    '405 Ridgewood - Forest Hills': [11374, 11375, 11379, 11385],
    '406 Fresh Meadows': [11365, 11366, 11367],
    '407 Southwest Queens': [11414, 11415, 11416, 11417, 11418, 11419, 11420, 11421],
    '408 Jamaica': [11412, 11423, 11432, 11433, 11434, 11435, 11436],
    '409 Southeast Queens': [11004, 11005, 11411, 11413, 11422, 11426, 11427, 11428, 11429],
    '410 Rockaway': [11691, 11692, 11693, 11694, 11695, 11697],
    '501 Port Richmond': [10302, 10303, 10310],
    '502 Stapleton - St. George': [10301, 10304, 10305],
    '503 Willowbrook': [10314],
    '504 South Beach - Tottenville': [10306, 10307, 10308, 10309, 10312],
}
In [ ]:
def get_name(neighborhood):
    return ' '.join(neighborhood.split()[1:])

get_name('504 South Beach - Tottenville')
In [ ]:
def get_code(neighborhood):
    return neighborhood.split()[0]

get_code('504 South Beach - Tottenville')
In [ ]:
zipcodes_by_neighborhood_name = {
    get_name(k): v
    for k, v in zipcodes_by_neighborhood.items()
}
In [ ]:
zipcodes_by_neighborhood_code = {
    get_code(k): v
    for k, v in zipcodes_by_neighborhood.items()
}
In [ ]:
neighborhood_code_by_name = {
    get_name(k): get_code(k)
    for k in zipcodes_by_neighborhood
}
In [ ]:
# https://www.census.gov/geo/maps-data/data/cbf/cbf_zcta.html
# http://www2.census.gov/geo/tiger/GENZ2017/shp/cb_2017_us_zcta510_500k.zip
source_url = (
    'http://www2.census.gov/geo/tiger/'
    'GENZ2017/shp/cb_2017_us_zcta510_500k.zip')
source_path = '/tmp/zipcodes.zip'
download_zip('/tmp/zipcodes', source_url)
In [ ]:
import geotable
t = geotable.load(source_path)
In [ ]:
t[:5]
In [ ]:
from invisibleroads_macros.iterable import flatten_lists
zipcodes = flatten_lists(zipcodes_by_neighborhood.values())
len(zipcodes)
In [ ]:
zipcodes[0]
In [ ]:
zipcode_strings = [str(x) for x in zipcodes]
In [ ]:
t[t['ZCTA5CE10'].isin(zipcode_strings)].draw()
In [ ]:
t[t['ZCTA5CE10'] == '10282'].draw()
In [ ]:
from shapely.ops import unary_union

# Get geometry for each neighborhood
geometry_by_neighborhood_code = {}
for neighborhood_code, zipcodes in zipcodes_by_neighborhood_code.items():
    zipcode_strings = [str(x) for x in zipcodes]
    zipcode_table = t[t['ZCTA5CE10'].isin(zipcode_strings)]
    zipcode_geometries = zipcode_table.geometries
    geometry_by_neighborhood_code[neighborhood_code] = unary_union(zipcode_geometries)
geometry_by_neighborhood_code
In [ ]:
neighborhood_code_by_name[
    'Union Square - Lower Eastside'
] = neighborhood_code_by_name['Union Square - Lower East Side']
In [ ]:
neighborhood_code_by_name[
    'Upper Eastside'
] = neighborhood_code_by_name['Upper East Side']
In [ ]:
neighborhood_code_by_name[
    'Upper Westside'
] = neighborhood_code_by_name['Upper West Side']
In [ ]:
def get_geometry(neighborhood_name):
    try:
        neighborhood_code = neighborhood_code_by_name[neighborhood_name]
        return geometry_by_neighborhood_code[neighborhood_code]
    except KeyError as e:
        print('could not match %s' % e)
        return

aggregated_disease_table['geometry_object'] = aggregated_disease_table[
    'Neighborhood Name'].apply(get_geometry)
In [ ]:
aggregated_disease_table[:5]
In [ ]:
aggregated_disease_table.iloc[0]['geometry_object']
In [ ]:
aggregated_disease_table.iloc[2]['geometry_object']
In [ ]:
len(aggregated_disease_table)
In [ ]:
polygon_disease_table = aggregated_disease_table.dropna(subset=['geometry_object'])
len(polygon_disease_table)

Prepare Dependent Variable By Aggregating Over Image

Here we will work with the NYC Air Pollution Rasters.

Citywide raster files of annual average predicted surface for

  • nitrogen dioxide (NO2)
  • fine particulate matter (PM2.5)
  • black carbon (BC)
  • nitric oxide (NO)
  • summer average for ozone (O3)
  • winter average for sulfure dioxide (SO2)

File type is ESRI grid raster files at 300 m resolution, NAD83 New York Long Island State Plane FIPS, feet projection. Prediction surface generated from Land Use Regression modeling of December 2008- December 2015 (years 1-7) New York Community Air Survey monitoring data. As these are estimated annual average levels produced by a statistical model, they are not comparable to short term localized monitoring or monitoring done for regulatory purposes. For description of NYCCAS design and Land Use Regression Modeling process see: http://www1.nyc.gov/assets/doh/downloads/pdf/environmental/comm-air-survey-08-14.pdf

In [ ]:
# Load air pollution raster
source_url = (
    'https://data.cityofnewyork.us/download/'
    'q68s-8qxv/application%2Fzip')
source_folder = '/tmp/pollution'
download_zip(source_folder, source_url)
In [ ]:
# Show nitrogen dioxide images
from glob import glob
sorted(glob('/tmp/pollution/*no2*.ovr'))
In [ ]:
# Show nitrogen dioxide images in aa2
from glob import glob
sorted(glob('/tmp/pollution/aa2_no2300m/*'))
In [ ]:
# Install packages
# If this keeps failing, try starting the notebook with more memory (e.g. MEDIUM or LARGE)
import subprocess
assert subprocess.call('pip install rasterio'.split()) == 0
In [ ]:
# Load image
import rasterio
pollution_raster = rasterio.open('/tmp/pollution/aa2_no2300m/prj.adf')
In [ ]:
# Get coordinate reference (defines how each pixel corresponds to a specific location)
pollution_proj4 = pollution_raster.crs.to_proj4()
pollution_proj4
In [ ]:
# Load image array
x = pollution_raster.read()
In [ ]:
x.shape
In [ ]:
pollution_raster.height, pollution_raster.width
In [ ]:
# Get coordinates for specific pixel
pollution_raster.xy(pollution_raster.height // 2, pollution_raster.width // 2)
In [ ]:
x.max()
In [ ]:
x.min()
In [ ]:
x.shape
In [ ]:
# Enable inline plots
%matplotlib inline
In [ ]:
array = x[0]
array.shape
In [ ]:
# from matplotlib.colors import Normalize
# normalize = Normalize(vmin=x.min(), vmax=x.max())
In [ ]:
import matplotlib.pyplot as plt
plt.imshow(array, vmin=0);
In [ ]:
# Convert neighborhood geometries into pollution coordinate reference
source_proj4 = neighborhood_proj4 = geotable.LONGITUDE_LATITUDE_PROJ4
target_proj4 = pollution_proj4
from geotable.projections import get_transform_shapely_geometry
f = get_transform_shapely_geometry(source_proj4, target_proj4)
In [ ]:
from rasterio.mask import mask

for index, row in aggregated_disease_table.iterrows():
    old_g = row['geometry_object']
    new_g = f(old_g)
    # print(old_g, new_g)
    break
In [ ]:
output_image, output_transform = mask(pollution_raster, [new_g], crop=True)
In [ ]:
# pollution_raster.meta
In [ ]:
output_image.shape
In [ ]:
output_image.sum()
In [ ]:
output_image[output_image < 0] = 0
In [ ]:
output_image.sum()
In [ ]:
# Convert neighborhood geometries into pollution coordinate reference
from geotable.projections import get_transform_shapely_geometry
source_proj4 = neighborhood_proj4 = geotable.LONGITUDE_LATITUDE_PROJ4
target_proj4 = pollution_proj4
f = get_transform_shapely_geometry(source_proj4, target_proj4)

# For each neighborhood polygon, get sum of nitrogen dioxide particulate matter
from rasterio.mask import mask

def get_pollutant_sum(old_polygon):
    new_polygon = f(old_polygon)
    output_image, output_transform = mask(pollution_raster, [new_polygon], crop=True)
    output_image[output_image < 0] = 0  # Zero null values so that we do not get -inf
    return output_image.sum()

# Drop rows that do not have a geometry
aggregated_disease_table.dropna(subset=['geometry_object'], inplace=True)
# Define new column
aggregated_disease_table['pollutant_sum'] = aggregated_disease_table[
    'geometry_object'].apply(get_pollutant_sum)
In [ ]:
aggregated_disease_table[:5]

Prepare Dependent Variable By Aggregating Nearby Coordinates

  1. Extract a table of recreation centers.
  2. Extract a table of schools.
  3. Prepare a kd-tree of school locations.
  4. Define a column called school_count that counts the number of schools within a 0.5 mile radius of each recreation center.
In [ ]:
# https://data.cityofnewyork.us/City-Government/Facilities-Database/ji82-xba5
# https://data.cityofnewyork.us/api/views/ji82-xba5/rows.csv?accessType=DOWNLOAD

url = (
    'https://data.cityofnewyork.us/api/views/'
    'ji82-xba5/rows.csv?accessType=DOWNLOAD')
path = '/tmp/facility.csv'
download(path, url)
In [ ]:
t = pd.read_csv(path)
t.iloc[0]
In [ ]:
t['facgroup'].unique()
In [ ]:
t[t['overagency'] == 'NYC Department of Parks and Recreation'].iloc[0]
In [ ]:
t['factype'].unique()
In [ ]:
t[t['overagency'] == 'NYC Department of Parks and Recreation']['factype'].unique()
In [ ]:
facility_types = t[t['overagency'] == 'NYC Department of Parks and Recreation']['factype'].unique()
[x for x in facility_types if 'Recreation' in x]
In [ ]:
# Drop rows where factype is null
t.dropna(subset=['factype'], inplace=True)
In [ ]:
# Extract table of recreation centers
recreation_center_table = t[t['factype'].str.contains('Recreation')].copy()
recreation_center_table[:3]
In [ ]:
len(recreation_center_table)
In [ ]:
recreation_center_table.iloc[0]
In [ ]:
# Extract table of schools
school_table = t[t['factype'].str.contains('School')]
school_table[:3]
In [ ]:
# For each recreation center, count the number of schools within a 0.5 mile radius
In [ ]:
# Get longitude, latitude array
"""
school_xys = []
for index, row in school_table.iterrows():
    school_xys.append((row['longitude'], row['latitude']))
school_xys[:5]
""";
school_xys = school_table[['longitude', 'latitude']].values
In [ ]:
# Make school kdtree
from pysal.lib.cg import KDTree, RADIUS_EARTH_MILES
school_tree = KDTree(school_xys, distance_metric='Arc', radius=RADIUS_EARTH_MILES)
In [ ]:
school_count = len(school_xys)
school_count
In [ ]:
xy = -73.97975799999999, 40.615953999999995
distances, indices = school_tree.query(xy, k=school_count, distance_upper_bound=0.5)
In [ ]:
indices
In [ ]:
indices[indices < school_count]
In [ ]:
len(indices[indices < school_count])
In [ ]:
radius_in_miles = 0.5

def get_school_count(r):
    xy = r['longitude'], r['latitude']
    distances, indices = school_tree.query(
        xy, k=school_count, distance_upper_bound=radius_in_miles)
    return sum(indices < school_count)

recreation_center_table['nearby_school_count'] = recreation_center_table.apply(
    get_school_count, axis=1)
In [ ]:
recreation_center_table.iloc[0]