Prepare and Fit Spatial Regression Models 20190222




Pay Notebook Creator: Roy Hyunjin Han0
Set Container: Numerical CPU with TINY Memory for 10 Minutes 0
Total0

Train Model to Estimate Graduation Rate from Tree Count

Train Dummy Model

In [ ]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
In [ ]:
import pandas as pd
dataset = pd.DataFrame([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
], columns=['x1', 'x2', 'y'])
dataset
In [ ]:
X = dataset[['x1', 'x2']].values
X
In [ ]:
y = dataset['y'].values
y
In [ ]:
model.fit(X, y)
In [ ]:
model.predict([[8, 9]])
In [ ]:
model.predict([
    [0, 1],
    [8, 9],
])

Save Dummy Model

In [ ]:
# Save using pickle
from pickle import dump
dump(model, open('dummy-model.pkl', 'wb'))
In [ ]:
# Save using joblib which is another option
import subprocess
subprocess.call('pip install joblib'.split())
from joblib import dump
dump(model, '/tmp/dummy-model.joblib')

Load Dummy Model

In [ ]:
from pickle import load
model = load(open('dummy-model.pkl', 'rb'))
model
In [ ]:
# Load using joblib which is another option
# import subprocess
# subprocess.call('pip install joblib'.split())
from joblib import load
model = load('/tmp/dummy-model.joblib')
model

Train Example Model

Load Your Training Dataset

In [ ]:
import pandas as pd
t = pd.read_csv('example-dataset.csv')
t

Prepare Feature Matrices X1, X2 and Target Variable y

In [ ]:
X1 = t[[
    'Tree Count Within 100 Meters',
    'Sum of Distances from Trees Within 100 Meters',
    'Average Risk of Trees Within 100 Meters']].values
X1
In [ ]:
X2 = t[[
    'Tree Count Within 100 Meters',
    'Average Risk of Trees Within 100 Meters']].values
X2
In [ ]:
y = t['Graduation Rate']
y

Compare Models That Use Different Features and Algorithms

You will need to choose an appropriate metric to evaluate the performance of your fitted model.

Which metric you choose depends on whether you are performing classification, clustering or regression.

If the target variable that we want to predict is ...

  • a category (classification) then use a classification metric like f1
  • a number (regression) then use a regression metric like neg_mean_absolute_error

Click here for more information.

In [ ]:
from sklearn.model_selection import cross_val_score
models = []
scores = []

def train(model, X):
    model.fit(X, y)
    models.append(model)
    score = cross_val_score(
        model, X, y, cv=3,
        scoring='neg_mean_absolute_error',
    ).mean()
    scores.append(score)
    return score
In [ ]:
from sklearn.linear_model import LinearRegression
train(LinearRegression(), X1)
In [ ]:
train(LinearRegression(), X2)
In [ ]:
from sklearn.linear_model import BayesianRidge
train(BayesianRidge(), X1)
In [ ]:
train(BayesianRidge(), X2)
In [ ]:
from sklearn.svm import SVR
train(SVR(gamma='scale'), X1)
In [ ]:
from sklearn.svm import SVR
train(SVR(gamma='scale'), X2)

Choose Model with Least Error

In [ ]:
import numpy as np
best_index = np.argmax(scores)
best_index
In [ ]:
best_model = models[best_index]
best_model
In [ ]:
import pickle
pickle.dump(best_model, open('/tmp/model.pkl', 'wb'))