Prepare and Fit Spatial Regression Models 20190222




Pay Notebook Creator: Roy Hyunjin Han0
Set Container: Numerical CPU with TINY Memory for 10 Minutes 0
Total0

Augment NYC Dataset with Tree Statistics

Given a spatial NYC dataset, return an augmented dataset with additional columns for tree statistics as taken from the 2015 Street Tree Census. You can use the tree statistics as feature variables to train a fun predictive model.

  • If a geometry is a point, the tool will use the point search radius to search for trees around the point.
  • If a geometry is a polygon, the tool will search for trees within the polygon.
  • Make sure to reserve SMALL memory and at least 10 minutes when running this tool using the full tree dataset.

{ source_geotable : Source NYC Spatial Dataset ? Make sure that your CSV has a WKT column that defines the geometry of each row }

{ point_search_radius_in_meters : Point Search Radius in Meters ? Search for trees using this search distance (only for points, does not affect polygons) }

{ source_sample_count: Source Sample Count ? Set to -1 to use the full source dataset }

{ tree_sample_count : Tree Sample Count ? Set to -1 to use the full tree dataset }

In [1]:
# CrossCompute
source_geotable_path = 'geometry.csv'
point_search_radius_in_meters = 100
tree_sample_count = 100000
source_sample_count = 10
target_folder = '/tmp'
In [2]:
import geotable
source_geotable = geotable.load(source_geotable_path)
len(source_geotable)
Out[2]:
6
In [3]:
utm_proj4 = geotable.load_utm_proj4(source_geotable_path)
utm_geotable = geotable.load(source_geotable_path, target_proj4=utm_proj4)
utm_geotable['Area in Sq Km'] = utm_geotable['geometry_object'].apply(lambda g: g.area / 1000000)
source_geotable['Area in Sq Km'] = utm_geotable['Area in Sq Km']
source_geotable
Out[3]:
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
geometry_layer geometry_proj4 geometry_object Area in Sq Km
0 geometry +proj=longlat +datum=WGS84 +no_defs POINT (-73.96277710000001 40.7768607) 0.000000
1 geometry +proj=longlat +datum=WGS84 +no_defs POINT (-73.96055009999999 40.7776435) 0.000000
2 geometry +proj=longlat +datum=WGS84 +no_defs POINT (-73.9269635 40.7550475) 0.000000
3 geometry +proj=longlat +datum=WGS84 +no_defs POLYGON ((-73.99309797037012 40.77272824543018... 1.748151
4 geometry +proj=longlat +datum=WGS84 +no_defs POLYGON ((-74.01653259232452 40.71865914101984... 0.145103
5 geometry +proj=longlat +datum=WGS84 +no_defs POLYGON ((-73.83760758526363 40.74323764647701... 6.760576
In [4]:
import pandas as pd

def load(
    endpoint_url,
    selected_columns=None,
    buffer_size=1000,
    search_term_by_column=None,
    **kw,
):
    buffer_url = (f'{endpoint_url}?$limit={buffer_size}')
    if selected_columns:
        select_string = ','.join(selected_columns)
        buffer_url += f'&$select={select_string}'
    for column, search_term in (search_term_by_column or {}).items():
        buffer_url += f'&$where={column}+like+"%25{search_term}%25"'
    print(buffer_url)
    tables = []
    
    if endpoint_url.endswith('.json'):
        f = pd.read_json
    else:
        f = pd.read_csv

    t = f(buffer_url, **kw)
    while len(t):
        print(len(tables) * buffer_size + len(t))
        tables.append(t)
        offset = buffer_size * len(tables)
        t = f(buffer_url + f'&$offset={offset}', **kw)
    return pd.concat(tables, ignore_index=True, sort=False)

Load Tree Table

In [5]:
endpoint_url = 'https://data.cityofnewyork.us/resource/uvpi-gqnh.csv'
selected_columns = 'tree_id', 'tree_dbh', 'latitude', 'longitude'
buffer_size = 100000
tree_table = load(endpoint_url, selected_columns, buffer_size)
tree_table[:5]
https://data.cityofnewyork.us/resource/uvpi-gqnh.csv?$limit=100000&$select=tree_id,tree_dbh,latitude,longitude
100000
200000
300000
400000
500000
600000
683788
Out[5]:
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
latitude longitude tree_dbh tree_id
0 40.723092 -73.844215 3 180683
1 40.794111 -73.818679 21 200540
2 40.717581 -73.936608 3 204026
3 40.713537 -73.934456 10 204337
4 40.666778 -73.975979 21 189565
In [6]:
tree_table.dropna(subset=['longitude', 'latitude'], inplace=True)
len(tree_table)
Out[6]:
683788

Prepare Sample of Tree Locations

In [7]:
if tree_sample_count < 0:
    tree_sample_count = len(tree_table)
tree_sample_count
Out[7]:
100000
In [8]:
import numpy as np
import random
sample_tree_labels = random.sample(list(tree_table.index), tree_sample_count)
sample_tree_labels = np.array(sample_tree_labels)
sample_tree_labels[:3]
Out[8]:
array([605943, 271592, 484569])
In [9]:
sample_tree_table = tree_table.loc[sample_tree_labels]
sample_tree_table[:3]
Out[9]:
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
latitude longitude tree_dbh tree_id
605943 40.788409 -73.954749 11 110971
271592 40.626775 -74.161810 5 467049
484569 40.631334 -74.013344 19 711944
In [10]:
sample_tree_xys = sample_tree_table[['longitude', 'latitude']].values
sample_tree_xys[:3]
Out[10]:
array([[-73.95474877,  40.78840867],
       [-74.16180988,  40.62677546],
       [-74.01334448,  40.6313344 ]])

Prepare MultiPoint of Tree Locations

In [11]:
from shapely.geometry import MultiPoint, Point
In [12]:
sample_tree_points = [Point(_) for _ in sample_tree_xys]
In [13]:
sample_tree_collection = MultiPoint(sample_tree_points)

Prepare KD Tree of Tree Locations

In [14]:
from pysal.lib.cg import KDTree, RADIUS_EARTH_KM

kd_tree = KDTree(
    sample_tree_xys,
    distance_metric='Arc',
    radius=RADIUS_EARTH_KM * 1000)
/home/user/.virtualenvs/crosscompute/lib/python3.6/site-packages/pysal/lib/weights/util.py:19: UserWarning: geopandas not available. Some functionality will be disabled.
  warn('geopandas not available. Some functionality will be disabled.')
/home/user/.virtualenvs/crosscompute/lib/python3.6/site-packages/pysal/model/spvcm/abstracts.py:10: UserWarning: The `dill` module is required to use the sqlite backend fully.
  from .sqlite import head_to_sql, start_sql
In [15]:
def query_kd_tree(g, k=tree_sample_count):
    xy = g.coords[0]
    distances, indices = kd_tree.query(
        xy,
        k=k,
        distance_upper_bound=point_search_radius_in_meters)
    if k == 1:
        relative_distance = distances
        relative_index = indices
        return [relative_distance], [relative_index]
    relative_distances = distances[indices < k]
    relative_indices = indices[indices < k]
    relative_indices = [int(_) for _ in relative_indices]
    return relative_distances, relative_indices

Define Aggregation Functions

In [16]:
from shapely.geometry import Point

def get_tree_count(r):
    g = r['geometry_object']
    if g.area:
        selected_tree_collection = g.intersection(sample_tree_collection)
        if selected_tree_collection.is_empty:
            return 0
        if isinstance(selected_tree_collection, Point):
            return 1
        return len(selected_tree_collection)
    relative_distances, relative_indices = query_kd_tree(g)
    return len(relative_indices)
In [17]:
from geopy.distance import distance as get_distance

def get_geodesic_distance_in_meters(g1, g2):
    latitude_longitude1 = g1.centroid.coords[:2][::-1]
    latitude_longitude2 = g2.centroid.coords[:2][::-1]
    return get_distance(
        latitude_longitude1,
        latitude_longitude2).meters

def get_sum_distance(r):
    g = r['geometry_object']
    if g.area:
        selected_tree_collection = g.intersection(sample_tree_collection)
        if selected_tree_collection.is_empty:
            return 0
        g_centroid = g.centroid
        return sum(get_geodesic_distance_in_meters(
            g_centroid, _) for _ in selected_tree_collection)
    relative_distances, relative_indices = query_kd_tree(g)
    return sum(relative_distances)
In [18]:
import numpy as np

def get_average_diameter_from(relative_indices):
    tree_labels = sample_tree_labels[relative_indices]
    selected_tree_table = tree_table.loc[tree_labels]
    return selected_tree_table['tree_dbh'].mean()

def get_average_diameter(r):
    g = r['geometry_object']
    if g.area:
        selected_tree_collection = g.intersection(sample_tree_collection)
        if selected_tree_collection.is_empty:
            return 0
        if isinstance(selected_tree_collection, Point):
            selected_tree_collection = [selected_tree_collection]        
        values = []
        for g in selected_tree_collection:
            relative_distances, relative_indices = query_kd_tree(g, k=1)
            values.append(get_average_diameter_from(relative_indices))
        return np.mean(values)
    relative_distances, relative_indices = query_kd_tree(g)
    return get_average_diameter_from(relative_indices)

Augment Dataset

In [20]:
if source_sample_count < 0:
    source_sample_count = len(source_geotable)
    
target_geotable = source_geotable[:source_sample_count].copy()

target_geotable[
    'Tree Count'
] = target_geotable.apply(get_tree_count, axis=1)

target_geotable[
    'Tree Count Per Sq Km'
] = target_geotable['Tree Count'] / target_geotable['Area in Sq Km']

target_geotable[
    'Tree Distance Sum'
] = target_geotable.apply(get_sum_distance, axis=1)

target_geotable[
    'Tree Distance Sum Per Sq Km'
] = target_geotable['Tree Distance Sum'] / target_geotable['Area in Sq Km']

target_geotable[
    'Average Tree Diameter'
] = target_geotable.apply(get_average_diameter, axis=1)

target_geotable[:5]
Out[20]:
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
geometry_layer geometry_proj4 geometry_object Area in Sq Km Tree Count Tree Count Per Sq Km Tree Distance Sum Tree Distance Sum Per Sq Km Average Tree Diameter
0 geometry +proj=longlat +datum=WGS84 +no_defs POINT (-73.96277710000001 40.7768607) 0.000000 15 inf 1011.476671 inf 9.333333
1 geometry +proj=longlat +datum=WGS84 +no_defs POINT (-73.96055009999999 40.7776435) 0.000000 14 inf 1123.257550 inf 7.357143
2 geometry +proj=longlat +datum=WGS84 +no_defs POINT (-73.9269635 40.7550475) 0.000000 6 inf 437.903296 inf 8.000000
3 geometry +proj=longlat +datum=WGS84 +no_defs POLYGON ((-73.99309797037012 40.77272824543018... 1.748151 261 149.300606 134219.539077 7.677800e+04 7.134100
4 geometry +proj=longlat +datum=WGS84 +no_defs POLYGON ((-74.01653259232452 40.71865914101984... 0.145103 43 296.341460 3879.871613 2.673876e+04 7.720930

Save and Render Dataset

In [21]:
target_path = target_folder + '/augmented-dataset.csv'
target_geotable.to_csv(target_path, index=False)
print(f'augmented_table_path = {target_path}')
augmented_table_path = /tmp/augmented-dataset.csv
In [23]:
target_path = target_folder + '/augmented-dataset-map.csv'
map_geotable = target_geotable.copy()
map_geotable['FillGreens'] = map_geotable['Tree Count']
map_geotable.to_csv(target_path, index=False)
print(f'augmented_geotable_path = {target_path}')
augmented_geotable_path = /tmp/augmented-dataset-map.csv

Augmented NYC Dataset with Tree Statistics

{ augmented_table : Dataset with Tree Statistics ? Your dataset with tree statistics }

{ augmented_geotable : Map with Tree Statistics ? Your dataset with tree statistics on a map }