Atlas ESCP¶

In [1]:

# CrossCompute
ready_table_path = 'Table with average monthly savings(within 0.5 Mile) and tree count(within 0.5 Mile).csv'
search_radius_in_miles = 0.5
industry_select = """
Manufacturing

Manufacturing
Wholesale/Warehouse/Distribution
Commercial
Landlord
Public Benefit Corp
Other

"""
program_select = """
ICIP

City/State
ICAP
ICIP
IDA
Relocator
Tenant

"""
target_folder = '/tmp'

In [2]:

import pandas as pd
import numpy as np
import matplotlib 
import matplotlib.pyplot as plt

from pandas.plotting import scatter_matrix
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

import subprocess
subprocess.call('pip install geopandas'.split())
subprocess.call('pip install dill'.split())

matplotlib.rcParams['figure.figsize'] = (10,10)

Prepare building dataset¶

<font color=blue>Tasks: <br> 1. Feature selection: Company Name, Address, Industry, Program, Savings<br> 2. Smooth out time factor by calculating "Periodic Savings over months" starting from effective date (round to month) to 2017-12-31

In [3]:

# Prepare builing data
url1 = 'https://data.cityofnewyork.us/api/views/ukdt-xm28/rows.csv?accessType=DOWNLOAD'
building_table = pd.read_csv(url1, na_values = 'n/a')
# Rename features for better readability
building_table = building_table.rename(columns={'Company Type': 'Program', 
                                                'Savings from begining receiving benefits': 'Savings'})
# Feature selection
# Come back to add features later if necessary
building_table = building_table[['Company Name', 'Address','Industry','Program', 
                                 'Effective Date','Savings',
                                 'Borough','Latitude','Longitude']]
building_table[:3]

Out[3]:

	Company Name	Address	Industry	Program	Effective Date	Savings	Borough	Latitude	Longitude
0	139 ACA Realty, Inc.	43-23 35th Street	Commercial	ICIP	04/07/2008	123975.24	QUEENS	40.745706	-73.929565
1	141 Lake Avenue Realty c/o JR Produce, Inc.	141 Lake Avenue	Wholesale/Warehouse/Distribution	ICIP	12/08/2009	47512.89	STATEN IS	40.633153	-74.150999
2	14-10 123rd Street LLC	14-10 123rd Street	Commercial	ICIP	03/04/2011	21322.89	QUEENS	40.785144	-73.844833

In [4]:

len(building_table)

Out[4]:

In [5]:

# Drop rows when lonlat is NULL
building_table.dropna(axis=0, subset=['Longitude','Latitude'], inplace = True)
len(building_table)

Out[5]:

In [6]:

building_table.dtypes

Out[6]:

Company Name       object
Address            object
Industry           object
Program            object
Effective Date     object
Savings           float64
Borough            object
Latitude          float64
Longitude         float64
dtype: object

In [7]:

building_table.dtypes

Out[7]:

Company Name       object
Address            object
Industry           object
Program            object
Effective Date     object
Savings           float64
Borough            object
Latitude          float64
Longitude         float64
dtype: object

In [8]:

from datetime import date
building_table['Effective Date'] = pd.to_datetime(building_table['Effective Date'].str.strip(), format='%m/%d/%Y')
building_table['Month Count'] = ((pd.to_datetime('2017-12-31') - building_table['Effective Date']) / np.timedelta64(1, 'M'))
building_table['Month Count'] = building_table['Month Count'].astype(int)  

# Periodic Savings over months starting from effective date (round to month) to 2017-12-31
building_table['Periodic Savings over Months'] = (building_table['Savings'] / building_table['Month Count']).apply(lambda x: round(x, 2))

building_table.drop(['Savings' #,'Effective Date','Month Count' 
                         ], axis=1) 
building_table.head()

Out[8]:

	Company Name	Address	Industry	Program	Effective Date	Savings	Borough	Latitude	Longitude	Month Count	Periodic Savings over Months
0	139 ACA Realty, Inc.	43-23 35th Street	Commercial	ICIP	2008-04-07	123975.24	QUEENS	40.745706	-73.929565	116	1068.75
1	141 Lake Avenue Realty c/o JR Produce, Inc.	141 Lake Avenue	Wholesale/Warehouse/Distribution	ICIP	2009-12-08	47512.89	STATEN IS	40.633153	-74.150999	96	494.93
2	14-10 123rd Street LLC	14-10 123rd Street	Commercial	ICIP	2011-03-04	21322.89	QUEENS	40.785144	-73.844833	81	263.25
3	183 Lorriane Street LLC	183 Lorraine Street	Wholesale/Warehouse/Distribution	ICIP	2015-11-06	105016.49	BROOKLYN	40.673106	-74.002300	25	4200.66
4	21st Century Optics, Inc.	47-00 33rd Street	Manufacturing	Tenant	2009-01-07	215757.20	QUEENS	40.742386	-73.932148	107	2016.42

<font color=blue>Note: Building table feature selection is ready to this point<br>Time factor is smoothed out. Please refer to feature "Periodic Savings over Months" as reliable statistics for all participants<br>

Now load the tree dataset¶

<font color=blue>Tasks: <br>1. Feature selection: Longitude, Latitude, status, tree_id for later tree count around each location from building table<br> 2. Define a function to expedite data processing for large dataset

In [9]:

# Load function
def load(
    endpoint_url,
    selected_columns=None,
    buffer_size=1000,
    search_term_by_column=None,
    **kw,
):
    buffer_url = (f'{endpoint_url}?$limit={buffer_size}')
    if selected_columns:
        select_string = ','.join(selected_columns)
        buffer_url += f'&$select={select_string}'
    for column, search_term in (search_term_by_column or {}).items():
        buffer_url += f'&$where={column}+like+"%25{search_term}%25"'
    print(buffer_url)
    tables = []
    
    if endpoint_url.endswith('.json'):
        f = pd.read_json
    else:
        f = pd.read_csv

    t = f(buffer_url, **kw)
    while len(t):
        print(len(tables) * buffer_size + len(t))
        tables.append(t)
        offset = buffer_size * len(tables)
        t = f(buffer_url + f'&$offset={offset}', **kw)
    return pd.concat(tables, sort=False)

In [10]:

%%time
endpoint_url = 'https://data.cityofnewyork.us/resource/nwxe-4ae8.csv'
selected_columns = 'tree_id', 'status', 'Latitude', 'Longitude'
buffer_size = 100000
tree_table = load(endpoint_url, selected_columns, buffer_size)

https://data.cityofnewyork.us/resource/nwxe-4ae8.csv?$limit=100000&$select=tree_id,status,Latitude,Longitude
100000
200000
300000
400000
500000
600000
683788
CPU times: user 505 ms, sys: 76.4 ms, total: 581 ms
Wall time: 1.59 s

In [11]:

tree_table.head()

Out[11]:

	Latitude	Longitude	status	tree_id
0	40.723092	-73.844215	Alive	180683
1	40.794111	-73.818679	Alive	200540
2	40.717581	-73.936608	Alive	204026
3	40.713537	-73.934456	Alive	204337
4	40.666778	-73.975979	Alive	189565

In [12]:

tree_table.dropna(subset=['Longitude', 'Latitude'], inplace=True)
tree_table[tree_table['status'] == 'Alive'].iloc[:3,] #filter out stump and dead trees
tree_table.drop(['status'], axis = 1).head()

Out[12]:

	Latitude	Longitude	tree_id
0	40.723092	-73.844215	180683
1	40.794111	-73.818679	200540
2	40.717581	-73.936608	204026
3	40.713537	-73.934456	204337
4	40.666778	-73.975979	189565

<font color=blue>Tree table is clean and ready at this point<br> Next step: import ketree to get total tree count for each location in building table

Get total tree count within 0.5 mile for each location in building table¶

In [13]:

# building_table.loc[building_table['Industry'] == industry_select].loc[building_table['Program'] == program_select]
building_table = building_table.loc[building_table['Industry'] == "Commercial"].loc[building_table['Program'] == "ICIP"]

In [22]:

# Make kdtree
from pysal.lib.cg.kdtree import KDTree
from pysal.lib.cg import RADIUS_EARTH_MILES

tree_xys = tree_table[['Longitude', 'Latitude']].values
tree_xys

Out[22]:

array([[-73.84421522,  40.72309177],
       [-73.81867946,  40.79411067],
       [-73.9366077 ,  40.71758074],
       ...,
       [-74.13651724,  40.62076153],
       [-73.90311472,  40.85082819],
       [-73.78752646,  40.73216525]])

In [23]:

tree_count = len(tree_xys)
tree_count

Out[23]:

In [14]:

tree_count = len(tree_xys)
bin_tree = KDTree(tree_xys, distance_metric='Arc', radius=RADIUS_EARTH_MILES)
radius_in_miles = 0.5


def get_tree_count(r):
    xy = r['Longitude'], r['Latitude']
    distances, indices = bin_tree.query(
        xy, k=tree_count, distance_upper_bound=radius_in_miles)
    indices = indices[~np.isnan(indices)]
    return len(indices)

In [15]:

building_table['Total Tree Count within 0.5 Mile'] = building_table.apply(get_tree_count, axis=1)
building_table[:3]

Out[15]:

	Company Name	Address	Industry	Program	Effective Date	Savings	Borough	Latitude	Longitude	Month Count	Periodic Savings over Months	Total Tree Count within 0.5 Mile
0	139 ACA Realty, Inc.	43-23 35th Street	Commercial	ICIP	2008-04-07	123975.24	QUEENS	40.745706	-73.929565	116	1068.75	683788
2	14-10 123rd Street LLC	14-10 123rd Street	Commercial	ICIP	2011-03-04	21322.89	QUEENS	40.785144	-73.844833	81	263.25	683788
7	2840 Atlantic Avenue Realty Corp	2840 Atlantic Avenue	Commercial	ICIP	2008-06-16	105702.57	BROOKLYN	40.676789	-73.889346	114	927.22	683788

In [16]:

# Save your output files in target_folder
target_folder = '/tmp'
target_path = target_folder + '/Table with tree count.csv'
building_table.to_csv(target_path, index=False)

# Render the file as a table
print('final_table_path = %s' % target_path)

final_table_path = /tmp/Table with tree count.csv

In [17]:

building_table.to_csv('Table with tree count.csv', sep=',')

<font color=blue>Note: Tree count within 0.5 miles is attached as a new column to the building table at this point.<br> Ready table is saved to file 'Table with tree count.csv' for later analysis</font><br><br>

In [18]:

import pandas as pd
import numpy as np
table_tree_count = pd.read_csv('Table with tree count.csv',na_values='n/a')
table_tree_count.drop(['Unnamed: 0'], axis=1)
table_tree_count[:3]

Out[18]:

	Unnamed: 0	Company Name	Address	Industry	Program	Effective Date	Savings	Borough	Latitude	Longitude	Month Count	Periodic Savings over Months	Total Tree Count within 0.5 Mile
0	0	139 ACA Realty, Inc.	43-23 35th Street	Commercial	ICIP	2008-04-07	123975.24	QUEENS	40.745706	-73.929565	116	1068.75	683788
1	2	14-10 123rd Street LLC	14-10 123rd Street	Commercial	ICIP	2011-03-04	21322.89	QUEENS	40.785144	-73.844833	81	263.25	683788
2	7	2840 Atlantic Avenue Realty Corp	2840 Atlantic Avenue	Commercial	ICIP	2008-06-16	105702.57	BROOKLYN	40.676789	-73.889346	114	927.22	683788

Get average periodic savings from all companies within 0.5 mile for each location in building table (to be added - count in industry and program)¶

In [19]:

# Make kdtree
radius_in_miles = 0.5

from pysal.lib.cg.kdtree import KDTree
from pysal.lib.cg import RADIUS_EARTH_MILES
sav_xys = table_tree_count[['Longitude', 'Latitude']].values
bin_sav = KDTree(sav_xys, distance_metric='Arc', radius=RADIUS_EARTH_MILES)
sav_count = len(sav_xys)

In [20]:

def get_sav_average(r):
    xy = r['Longitude'], r['Latitude']
    distances, indices = bin_sav.query(
        xy,
        k=len(bin_sav.data),
        distance_upper_bound=radius_in_miles)
    indices = indices[~np.isnan(indices)]
    indices = [int(x) for x in indices]
    selected_sav_table = table_tree_count.loc[table_tree_count.index[indices]]
    return selected_sav_table['Periodic Savings over Months'].mean()

In [21]:

print(table_tree_count.loc([]))
table_tree_count['Periodic Savings within 0.5 Mile'] = table_tree_count.apply(get_sav_average, axis=1)
table_tree_count[:3]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-f1dae61f720d> in <module>
----> 1 print(table_tree_count.loc([]))
      2 table_tree_count['Periodic Savings within 0.5 Mile'] = table_tree_count.apply(get_sav_average, axis=1)
      3 table_tree_count[:3]

~/.virtualenvs/crosscompute/lib/python3.6/site-packages/pandas/core/indexing.py in __call__(self, axis)
    100 
    101         if axis is not None:
--> 102             axis = self.obj._get_axis_number(axis)
    103         new_self.axis = axis
    104         return new_self

~/.virtualenvs/crosscompute/lib/python3.6/site-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
    349     @classmethod
    350     def _get_axis_number(cls, axis):
--> 351         axis = cls._AXIS_ALIASES.get(axis, axis)
    352         if is_integer(axis):
    353             if axis in cls._AXIS_NAMES:

TypeError: unhashable type: 'list'

In [ ]:

# Save your output files in target_folder
target_path = target_folder + '/Table with average monthly savings(within 0.5 Mile) and tree count(within 0.5 Mile).csv'
table_tree_count.to_csv(target_path, index=False)

# Render the file as a table
print('final_table_path = %s' % target_path)

In [ ]:

table_tree_count.to_csv('Table with average monthly savings(within 0.5 Mile) and tree count \
                        (within 0.5 Mile).csv', sep=',')

<font color=blue>Note: Tree count and average monthly savings whinin 0.5 miles are attached to building table at this point. Ready table is saved to file 'Table with average monthly savings(within 0.5 Mile) and tree count(within 0.5 Mile).csv' for later analysis

Training Model (in process)¶

In [ ]:

# Algorithm

# find dtypes

In [ ]:

# Begin creating our model
# Rearrange col
df = df[['Industry', 'Program', 'Postcode', 'Borough', 'Savings']]
#df.columns
# Split our validation dataset
array = df.values
X = array[:, 0:3]
Y = array[:, 3]
validation_size = .20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(
    X, Y, test_size = validation_size, random_state = seed)

In [ ]:

# Spot check algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

In [ ]:

# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

In [ ]:

# Save your output files in target_folder
target_path = target_folder + '/Table with periodic savings and tree count.csv'
building_table.to_csv(target_path, index=False)

# Render the file as a table
print('final_table_path = %s' % target_path)

In [ ]:

# Export model

In [ ]:

# How is the tool made in cross compute?

In [ ]:

# import sys
# 'geopandas' in sys.modules

In [ ]:

# Convert WKT coordinates
# from geopandas import GeoDataFrame
# from shapely.geometry import Point

# geometry = [Point(xy) for xy in zip(building_table.Longitude, building_table.Latitude)]
# building_table = building_table.drop(['Longitude', 'Latitude'], axis=1)
# crs = {'init': 'epsg:4326'}
# gdf = GeoDataFrame(building_table, crs=crs, geometry=geometry)

# building_table.head()

import subprocess
subprocess.call('pip install geopandas'.split())

Get geometry points for map display¶

In [ ]:

ready_table_path = 'Table with average monthly savings(within 0.5 Mile) and tree count(within 0.5 Mile).csv'
search_radius_in_miles = 0.5

user_address = '28-10 Jackson Ave'    # this address can be located
#user_address = '236-238 25TH STREET' # this is an address geocode can't locate
target_folder = '/tmp'

In [ ]:

import pandas as pd
ready_table = pd.read_csv(ready_table_path,na_values='n/a')

In [ ]:

from shapely.geometry import Point
# Combining Lattitude and Longitude to create company geo coordinates:
ready_table['Coordinate'] = ready_table[['Longitude', 'Latitude']].values.tolist()
# Change the coordinates to a geoPoint
ready_table['Coordinate'] = ready_table['Coordinate'].apply(Point)
ready_table[:3]

In [ ]:

# Save your output files in target_folder
target_path = target_folder + '/Table with geometry points.csv'
ready_table.to_csv(target_path, index=False)

# Render the file as a table
print('final_table_path = %s' % target_path)

In [ ]:

ready_table.to_csv('Table with geometry points', sep=',')

In [ ]:

ready_table[:3]

<font color=blue>Note: geometry is attached to ready table at this point. <br>Next step: Filter out addresses from training dataset which can't be converted to coordinates in order to be displayed as reference companies on map.

Check all addresses in training dataset and delete those can't be converted to coordinate¶

<font color=blue>Note: This step is necessary because program will stop at addresses that can't be converted to geo coordiate and affect funtionality of reference data display on map.

In [ ]:

input_filename = 'Table with geometry points.csv'

In [ ]:

import pandas as pd
import requests
import logging
import time

logger = logging.getLogger('root')
logger.setLevel(logging.DEBUG)
# create console handler
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
logger.addHandler(ch)

#------------------ CONFIGURATION -------------------------------

API_KEY = 'AIzaSyDNqc0tWzXHx_wIp1w75-XTcCk4BSphB5w'

BACKOFF_TIME = 30
output_filename = 'Table with all addresses check.csv'
input_filename = 'Table with geometry points.csv'
address_column_name = 'Address'
RETURN_FULL_RESULTS = False

#------------------ DATA LOADING --------------------------------

data = pd.read_csv(input_filename, encoding='utf8')
if address_column_name not in data.columns:
    raise ValueError('Missing Address column in input data')
addresses = data['Address'].tolist()

#------------------	FUNCTION DEFINITIONS ------------------------

def get_google_results(address, api_key='AIzaSyDNqc0tWzXHx_wIp1w75-XTcCk4BSphB5w', return_full_response=False):
    geocode_url = 'https://maps.googleapis.com/maps/api/geocode/json?address={}'.format(address)
    if api_key is not None:
        geocode_url = geocode_url + '&key={}'.format(api_key)
    results = requests.get(geocode_url)
    results = results.json()
    
    if len(results['results']) == 0:
        output = {
            'formatted_address' : None,
            'latitude': None,
            'longitude': None,
        }
    else:    
        answer = results['results'][0]
        output = {
            'formatted_address' : answer.get('formatted_address'),
            'latitude': answer.get('geometry').get('location').get('lat'),
            'longitude': answer.get('geometry').get('location').get('lng'),
}
    # Append some other details:    
    output['input_string'] = address
    output['status'] = results.get('status')
    if return_full_response is True:
        output['response'] = results
    
    return output

#------------------ PROCESSING LOOP -----------------------------

# Create a list to hold results
results = []

for address in addresses:
    # While the address geocoding is not finished:
    geocoded = False
    while geocoded is not True:
        # Geocode the address with google
        try:
            geocode_result = get_google_results(address, API_KEY, return_full_response=RETURN_FULL_RESULTS)
        except Exception as e:
            logger.exception(e)
            logger.error('Major error with {}'.format(address))
            logger.error('Skipping!')
            geocoded = True
            
        # If we're over the API limit, backoff for a while and try again later.
        if geocode_result['status'] == 'OVER_QUERY_LIMIT':
            logger.info('Hit Query Limit! Backing off for a bit.')
            time.sleep(BACKOFF_TIME * 60) # sleep for 30 minutes
            geocoded = False
        else:
            # If we're ok with API use, save the results
            # Note that the results might be empty / non-ok - log this
            if geocode_result['status'] != 'OK':
                logger.warning('Error geocoding {}: {}'.format(address, geocode_result['status']))
            logger.debug('Geocoded: {}: {}'.format(address, geocode_result['status']))
            results.append(geocode_result)           
            geocoded = True

    # Print status every 100 addresses
    if len(results) % 100 == 0:
        logger.info('Completed {} of {} address'.format(len(results), len(addresses)))
            
    # Every 500 addresses, save progress to file(in case of a failure so you have something!)
    if len(results) % 500 == 0:
        pd.DataFrame(results).to_csv('{}_bak'.format(output_filename))

logger.info('Finished geocoding all addresses')
pd.DataFrame(results).to_csv(output_filename, encoding='utf8')

<font color=blue>Note: Finished geocoding all addresses at this point.

In [ ]:

geocoded_table = pd.read_csv('Table with all addresses check.csv')
#geocoded_table=pd.DataFrame(results)
geocoded_table = geocoded_table.rename(columns={'input_string': 'Address',
                                               'input_string': 'Address'}) 
len(geocoded_table)

<font color=blue>Note: <br>1. Check geocoded_table in this code, or output table "Table with all addresses check.csv". We know 12 addresses can't be converted, which need to be deleted from building data for map display to function properly.<br>2. Next step: delete addresses can't be converted.

In [ ]:

merged_table = pd.merge(ready_table,
                 geocoded_table[['Address','formatted_address','status']],
                 on='Address', how='right')
df2 = merged_table[merged_table.status == 'OK']
df2 = df2.drop(df2[(df2.status != 'OK')].index)
df2 = df2.drop_duplicates()
df2

In [ ]:

# Save your output files in target_folder
target_folder = '/tmp'
target_path = target_folder + '/Ready Table with 490 rows.csv'
df2.to_csv(target_path, index=False)

# Render the file as a table
print('final_table_path = %s' % target_path)

In [ ]:

df2.to_csv('Ready Table with 490 rows.csv', sep=',')

Now, we have 490 rows left in our training dataset with¶

-- 1. periodic savings over the months from each company¶

-- 2. tree count within 0.5 miles around center point (user input address)¶

-- 3. average priodic savings from all companies within 0.5 miles around center point (user input address)¶

(To do: filter also by industry and program)¶

-- 4. geometry points of all companies to display on map as reference data¶

In [ ]:

Pay Notebook Creator: Haige Cui	0
Set Container: Numerical CPU with TINY Memory for 10 Minutes	0
Total	0

ECSP

Atlas ESCP¶

Prepare building dataset¶

Now load the tree dataset¶

Get total tree count within 0.5 mile for each location in building table¶

Get average periodic savings from all companies within 0.5 mile for each location in building table (to be added - count in industry and program)¶

Training Model (in process)¶

Get geometry points for map display¶

Check all addresses in training dataset and delete those can't be converted to coordinate¶

Now, we have 490 rows left in our training dataset with¶

-- 1. periodic savings over the months from each company¶

-- 2. tree count within 0.5 miles around center point (user input address)¶

-- 3. average priodic savings from all companies within 0.5 miles around center point (user input address)¶

(To do: filter also by industry and program)¶

-- 4. geometry points of all companies to display on map as reference data¶