Augment NYC Dataset with Tree Statistics¶

Given a spatial NYC dataset, return an augmented dataset with additional columns for tree statistics as taken from the 2015 Street Tree Census. You can use the tree statistics as feature variables to train a fun predictive model.

If a geometry is a point, the tool will use the point search radius to search for trees around the point.
If a geometry is a polygon, the tool will search for trees within the polygon.
Make sure to reserve SMALL memory and at least 10 minutes when running this tool using the full tree dataset.

{ source_geotable : Source NYC Spatial Dataset ? Make sure that your CSV has a WKT column that defines the geometry of each row }

{ point_search_radius_in_meters : Point Search Radius in Meters ? Search for trees using this search distance (only for points, does not affect polygons) }

{ source_sample_count: Source Sample Count ? Set to -1 to use the full source dataset }

{ tree_sample_count : Tree Sample Count ? Set to -1 to use the full tree dataset }

In [1]:

# CrossCompute
source_geotable_path = 'geometry.csv'
point_search_radius_in_meters = 100
tree_sample_count = 100000
source_sample_count = 10
target_folder = '/tmp'

In [2]:

import geotable
source_geotable = geotable.load(source_geotable_path)
len(source_geotable)

Out[2]:

In [3]:

utm_proj4 = geotable.load_utm_proj4(source_geotable_path)
utm_geotable = geotable.load(source_geotable_path, target_proj4=utm_proj4)
utm_geotable['Area in Sq Km'] = utm_geotable['geometry_object'].apply(lambda g: g.area / 1000000)
source_geotable['Area in Sq Km'] = utm_geotable['Area in Sq Km']
source_geotable

Out[3]:

	geometry_layer	geometry_proj4	geometry_object	Area in Sq Km
0	geometry	+proj=longlat +datum=WGS84 +no_defs	POINT (-73.96277710000001 40.7768607)	0.000000
1	geometry	+proj=longlat +datum=WGS84 +no_defs	POINT (-73.96055009999999 40.7776435)	0.000000
2	geometry	+proj=longlat +datum=WGS84 +no_defs	POINT (-73.9269635 40.7550475)	0.000000
3	geometry	+proj=longlat +datum=WGS84 +no_defs	POLYGON ((-73.99309797037012 40.77272824543018...	1.748151
4	geometry	+proj=longlat +datum=WGS84 +no_defs	POLYGON ((-74.01653259232452 40.71865914101984...	0.145103
5	geometry	+proj=longlat +datum=WGS84 +no_defs	POLYGON ((-73.83760758526363 40.74323764647701...	6.760576

In [4]:

import pandas as pd

def load(
    endpoint_url,
    selected_columns=None,
    buffer_size=1000,
    search_term_by_column=None,
    **kw,
):
    buffer_url = (f'{endpoint_url}?$limit={buffer_size}')
    if selected_columns:
        select_string = ','.join(selected_columns)
        buffer_url += f'&$select={select_string}'
    for column, search_term in (search_term_by_column or {}).items():
        buffer_url += f'&$where={column}+like+"%25{search_term}%25"'
    print(buffer_url)
    tables = []
    
    if endpoint_url.endswith('.json'):
        f = pd.read_json
    else:
        f = pd.read_csv

    t = f(buffer_url, **kw)
    while len(t):
        print(len(tables) * buffer_size + len(t))
        tables.append(t)
        offset = buffer_size * len(tables)
        t = f(buffer_url + f'&$offset={offset}', **kw)
    return pd.concat(tables, ignore_index=True, sort=False)

Load Tree Table¶

In [5]:

endpoint_url = 'https://data.cityofnewyork.us/resource/uvpi-gqnh.csv'
selected_columns = 'tree_id', 'tree_dbh', 'latitude', 'longitude'
buffer_size = 100000
tree_table = load(endpoint_url, selected_columns, buffer_size)
tree_table[:5]

https://data.cityofnewyork.us/resource/uvpi-gqnh.csv?$limit=100000&$select=tree_id,tree_dbh,latitude,longitude
100000
200000
300000
400000
500000
600000
683788

Out[5]:

	latitude	longitude	tree_dbh	tree_id
0	40.723092	-73.844215	3	180683
1	40.794111	-73.818679	21	200540
2	40.717581	-73.936608	3	204026
3	40.713537	-73.934456	10	204337
4	40.666778	-73.975979	21	189565

In [6]:

tree_table.dropna(subset=['longitude', 'latitude'], inplace=True)
len(tree_table)

Out[6]:

Prepare Sample of Tree Locations¶

In [7]:

if tree_sample_count < 0:
    tree_sample_count = len(tree_table)
tree_sample_count

Out[7]:

In [8]:

import numpy as np
import random
sample_tree_labels = random.sample(list(tree_table.index), tree_sample_count)
sample_tree_labels = np.array(sample_tree_labels)
sample_tree_labels[:3]

Out[8]:

array([605943, 271592, 484569])

In [9]:

sample_tree_table = tree_table.loc[sample_tree_labels]
sample_tree_table[:3]

Out[9]:

	latitude	longitude	tree_dbh	tree_id
605943	40.788409	-73.954749	11	110971
271592	40.626775	-74.161810	5	467049
484569	40.631334	-74.013344	19	711944

In [10]:

sample_tree_xys = sample_tree_table[['longitude', 'latitude']].values
sample_tree_xys[:3]

Out[10]:

array([[-73.95474877,  40.78840867],
       [-74.16180988,  40.62677546],
       [-74.01334448,  40.6313344 ]])

Prepare MultiPoint of Tree Locations¶

In [11]:

from shapely.geometry import MultiPoint, Point

In [12]:

sample_tree_points = [Point(_) for _ in sample_tree_xys]

In [13]:

sample_tree_collection = MultiPoint(sample_tree_points)

Prepare KD Tree of Tree Locations¶

In [14]:

from pysal.lib.cg import KDTree, RADIUS_EARTH_KM

kd_tree = KDTree(
    sample_tree_xys,
    distance_metric='Arc',
    radius=RADIUS_EARTH_KM * 1000)

/home/user/.virtualenvs/crosscompute/lib/python3.6/site-packages/pysal/lib/weights/util.py:19: UserWarning: geopandas not available. Some functionality will be disabled.
  warn('geopandas not available. Some functionality will be disabled.')
/home/user/.virtualenvs/crosscompute/lib/python3.6/site-packages/pysal/model/spvcm/abstracts.py:10: UserWarning: The `dill` module is required to use the sqlite backend fully.
  from .sqlite import head_to_sql, start_sql

In [15]:

def query_kd_tree(g, k=tree_sample_count):
    xy = g.coords[0]
    distances, indices = kd_tree.query(
        xy,
        k=k,
        distance_upper_bound=point_search_radius_in_meters)
    if k == 1:
        relative_distance = distances
        relative_index = indices
        return [relative_distance], [relative_index]
    relative_distances = distances[indices < k]
    relative_indices = indices[indices < k]
    relative_indices = [int(_) for _ in relative_indices]
    return relative_distances, relative_indices

Define Aggregation Functions¶

In [16]:

from shapely.geometry import Point

def get_tree_count(r):
    g = r['geometry_object']
    if g.area:
        selected_tree_collection = g.intersection(sample_tree_collection)
        if selected_tree_collection.is_empty:
            return 0
        if isinstance(selected_tree_collection, Point):
            return 1
        return len(selected_tree_collection)
    relative_distances, relative_indices = query_kd_tree(g)
    return len(relative_indices)

In [17]:

from geopy.distance import distance as get_distance

def get_geodesic_distance_in_meters(g1, g2):
    latitude_longitude1 = g1.centroid.coords[:2][::-1]
    latitude_longitude2 = g2.centroid.coords[:2][::-1]
    return get_distance(
        latitude_longitude1,
        latitude_longitude2).meters

def get_sum_distance(r):
    g = r['geometry_object']
    if g.area:
        selected_tree_collection = g.intersection(sample_tree_collection)
        if selected_tree_collection.is_empty:
            return 0
        g_centroid = g.centroid
        return sum(get_geodesic_distance_in_meters(
            g_centroid, _) for _ in selected_tree_collection)
    relative_distances, relative_indices = query_kd_tree(g)
    return sum(relative_distances)

In [18]:

import numpy as np

def get_average_diameter_from(relative_indices):
    tree_labels = sample_tree_labels[relative_indices]
    selected_tree_table = tree_table.loc[tree_labels]
    return selected_tree_table['tree_dbh'].mean()

def get_average_diameter(r):
    g = r['geometry_object']
    if g.area:
        selected_tree_collection = g.intersection(sample_tree_collection)
        if selected_tree_collection.is_empty:
            return 0
        if isinstance(selected_tree_collection, Point):
            selected_tree_collection = [selected_tree_collection]        
        values = []
        for g in selected_tree_collection:
            relative_distances, relative_indices = query_kd_tree(g, k=1)
            values.append(get_average_diameter_from(relative_indices))
        return np.mean(values)
    relative_distances, relative_indices = query_kd_tree(g)
    return get_average_diameter_from(relative_indices)

Augment Dataset¶

In [20]:

if source_sample_count < 0:
    source_sample_count = len(source_geotable)
    
target_geotable = source_geotable[:source_sample_count].copy()

target_geotable[
    'Tree Count'
] = target_geotable.apply(get_tree_count, axis=1)

target_geotable[
    'Tree Count Per Sq Km'
] = target_geotable['Tree Count'] / target_geotable['Area in Sq Km']

target_geotable[
    'Tree Distance Sum'
] = target_geotable.apply(get_sum_distance, axis=1)

target_geotable[
    'Tree Distance Sum Per Sq Km'
] = target_geotable['Tree Distance Sum'] / target_geotable['Area in Sq Km']

target_geotable[
    'Average Tree Diameter'
] = target_geotable.apply(get_average_diameter, axis=1)

target_geotable[:5]

Out[20]:

	geometry_layer	geometry_proj4	geometry_object	Area in Sq Km	Tree Count	Tree Count Per Sq Km	Tree Distance Sum	Tree Distance Sum Per Sq Km	Average Tree Diameter
0	geometry	+proj=longlat +datum=WGS84 +no_defs	POINT (-73.96277710000001 40.7768607)	0.000000	15	inf	1011.476671	inf	9.333333
1	geometry	+proj=longlat +datum=WGS84 +no_defs	POINT (-73.96055009999999 40.7776435)	0.000000	14	inf	1123.257550	inf	7.357143
2	geometry	+proj=longlat +datum=WGS84 +no_defs	POINT (-73.9269635 40.7550475)	0.000000	6	inf	437.903296	inf	8.000000
3	geometry	+proj=longlat +datum=WGS84 +no_defs	POLYGON ((-73.99309797037012 40.77272824543018...	1.748151	261	149.300606	134219.539077	7.677800e+04	7.134100
4	geometry	+proj=longlat +datum=WGS84 +no_defs	POLYGON ((-74.01653259232452 40.71865914101984...	0.145103	43	296.341460	3879.871613	2.673876e+04	7.720930

Save and Render Dataset¶

In [21]:

target_path = target_folder + '/augmented-dataset.csv'
target_geotable.to_csv(target_path, index=False)
print(f'augmented_table_path = {target_path}')

augmented_table_path = /tmp/augmented-dataset.csv

In [23]:

target_path = target_folder + '/augmented-dataset-map.csv'
map_geotable = target_geotable.copy()
map_geotable['FillGreens'] = map_geotable['Tree Count']
map_geotable.to_csv(target_path, index=False)
print(f'augmented_geotable_path = {target_path}')

augmented_geotable_path = /tmp/augmented-dataset-map.csv

Augmented NYC Dataset with Tree Statistics¶

{ augmented_table : Dataset with Tree Statistics ? Your dataset with tree statistics }

{ augmented_geotable : Map with Tree Statistics ? Your dataset with tree statistics on a map }

Pay Notebook Creator: Roy Hyunjin Han	0
Set Container: Numerical CPU with TINY Memory for 10 Minutes	0
Total	0

Prepare and Fit Spatial Regression Models 20190222