Prepare and Fit Spatial Regression Models 20190222




Pay Notebook Creator: Roy Hyunjin Han0
Set Container: Numerical CPU with TINY Memory for 10 Minutes 0
Total0

Load NYC Open Data

Here are different ways to load NYC Open Data.

  1. Open the page containing your dataset (e.g. https://data.cityofnewyork.us/Environment/2015-Street-Tree-Census-Tree-Data/pi5s-9p35).
  2. Click Export > SODA API and find the API Endpoint.
  3. Copy the CSV API Endpoint (e.g. https://data.cityofnewyork.us/resource/nwxe-4ae8.csv).
  4. Paste the API Endpoint into the code block below.

Get CSV Endpoint URL

In [ ]:
url = 'https://data.cityofnewyork.us/resource/nwxe-4ae8.csv'
In [ ]:
assert url.endswith('.csv'), 'Use CSV endpoint because it is faster'
In [ ]:
import pandas as pd

Get Count

In [ ]:
pd.read_csv(url + '?$select=count(*)')

Load One

In [ ]:
t = pd.read_csv(url + '?$limit=1')
t.iloc[0]

Select Rows

In [ ]:
pd.read_csv(url + '?$select=count(*)&$where=status!="Alive"')

Select Columns

In [ ]:
pd.read_csv(url + '?$select=tree_id,latitude,longitude&$limit=3')

Load with Buffering

Use buffering when if you would like to load a large dataset but memory is limited (i.e. you keep getting a message that the kernel died).

In [ ]:
total_count = pd.read_csv(url + '?$select=count(*)')['count'][0]
total_count
In [ ]:
import pandas as pd

def load(
    endpoint_url,
    selected_columns=None,
    buffer_size=1000,
    search_term_by_column=None,
    **kw,
):
    buffer_url = (f'{endpoint_url}?$limit={buffer_size}')
    if selected_columns:
        select_string = ','.join(selected_columns)
        buffer_url += f'&$select={select_string}'
    for column, search_term in (search_term_by_column or {}).items():
        buffer_url += f'&$where={column}+like+"%25{search_term}%25"'
    print(buffer_url)
    tables = []
    
    if endpoint_url.endswith('.json'):
        f = pd.read_json
    else:
        f = pd.read_csv

    t = f(buffer_url, **kw)
    while len(t):
        print(len(tables) * buffer_size + len(t))
        tables.append(t)
        offset = buffer_size * len(tables)
        t = f(buffer_url + f'&$offset={offset}', **kw)
    return pd.concat(tables, ignore_index=True, sort=False)
In [ ]:
endpoint_url = 'https://data.cityofnewyork.us/resource/nwxe-4ae8.csv'
selected_columns = 'tree_id', 'latitude', 'longitude'
buffer_size = 100000
t = load(endpoint_url, selected_columns, buffer_size)
In [ ]:
assert len(t) == total_count