Prepare and Fit Spatial Regression Models 20190222




Pay Notebook Creator: Roy Hyunjin Han0
Set Container: Numerical CPU with TINY Memory for 10 Minutes 0
Total0

Geocode Addresses

Here we review different techniques for handling errors when geocoding addresses.

In [ ]:
from geopy import GoogleV3
api_key = 'AIzaSyDNqc0tWzXHx_wIp1w75-XTcCk4BSphB5w'
geocode = GoogleV3(api_key).geocode

Fill Incomplete Addresses

If an address is incomplete, the geocoder may not be able to geocode a unique address.

In [ ]:
address = '236-238 25TH STREET'
geocode(address) is not None
In [ ]:
address = '236-238 25TH STREET, BROOKLYN, NY'
geocode(address)
In [ ]:
address = '236-238 25TH STREET, BRONX, NY'
geocode(address)

You can use the usaddress package to detect if the city is missing.

In [ ]:
import subprocess
assert subprocess.call('pip install usaddress'.split()) == 0
In [ ]:
address = '236-238 25TH STREET'
address
In [ ]:
import usaddress
parts = usaddress.parse(address)
parts
In [ ]:
value_by_type = {v: k for k, v in parts}
value_by_type
In [ ]:
missing_place = 'PlaceName' not in value_by_type
missing_state = 'StateName' not in value_by_type
missing_zip = 'ZipCode' not in value_by_type

if missing_place and missing_state and missing_zip:
    address += ', Brooklyn, NY'
address
In [ ]:
geocode(address)

Geocode Address Table using DataFrame.apply

In [ ]:
from geopy import GoogleV3
api_key = 'AIzaSyDNqc0tWzXHx_wIp1w75-XTcCk4BSphB5w'
geocode = GoogleV3(api_key).geocode
In [ ]:
import subprocess
assert subprocess.call('pip install usaddress'.split()) == 0
In [ ]:
import numpy as np
from usaddress import parse as parse_address

def fix_address(address, default_region):
    address_parts = parse_address(address)
    value_by_type = {v: k for k, v in address_parts}
    missing_place = 'PlaceName' not in value_by_type
    missing_state = 'StateName' not in value_by_type
    missing_zip = 'ZipCode' not in value_by_type
    if missing_place and missing_state and missing_zip:
        address += ', ' + default_region
    return address
    
def get_location(row):
    address = row['address']
    address = fix_address(address, default_region='New York, NY')
    location = geocode(address)
    if location is None:
        return np.nan
    row['longitude'] = location.longitude
    row['latitude'] = location.latitude
    return row
In [ ]:
import pandas as pd

address_table = pd.DataFrame([
    ['118 West 22nd Street'],
    ['415 E 71st St, New York, NY'],
    ['abcdefg'],
    ['65-60 Kissena Blvd, Flushing, NY'],
], columns=['address'])
address_table
In [ ]:
geolocated_table = address_table.apply(get_location, axis=1)
geolocated_table
In [ ]:
clean_table = geolocated_table.dropna(subset=['longitude', 'latitude'])
clean_table

Geocode Address Table using pandas.Series

Although using DataFrame.apply is recommended because it is more flexible, you can also use pandas.Series for cases where you want to define a new column but some values are missing.

In [ ]:
import pandas as pd
from shapely.geometry import Point

d = {}
for index, row in address_table.iterrows():
    address = row['address']
    location = geocode(address)
    if not location:
        continue
    geometry = Point(location.longitude, location.latitude)
    d[index] = geometry.wkt

address_table['wkt'] = pd.Series(d)
address_table