Introduction to Computational Analysis




Pay Notebook Creator: Roy Hyunjin Han0
Set Container: Numerical CPU with TINY Memory for 10 Minutes 0
Total0
In [1]:
import numpy as np
from scripts import make_shirts
shirts = make_shirts()

Split testing analysis

Stacy runs an online custom t-shirt business. She is experimenting with layout design to increase sales. Here are the different tweaks she has tried:

  • Include a photo of the t-shirt.
  • Show a real person wearing the t-shirt in the model.
  • Vary price.
  • Encourage more reviews.
  • Encourage longer reviews.
  • Advertise a t-shirt design on the homepage.
  • List a t-shirt design as being on sale.

Since budget is limited, Stacy wants to focus on the layout enhancements that actually affect sales. Please rank the layout enhancements based on 500 product records.

Explore dataset

In [2]:
# Look at the first record
zip(shirts.feature_names, shirts.data[0])
In [3]:
# Check whether the first product sold
print shirts.target[0]

Count the number of shirts that sold.

In [4]:
# Type your solution here and press CTRL-ENTER

Compare price histograms between shirts that sold and shirts that didn't sell.

In [5]:
# Type your solution here and press CTRL-ENTER

Select model

In [6]:
from sklearn.cross_validation import StratifiedKFold, cross_val_score
from sklearn.metrics import zero_one

def evaluate_model(model):
    return np.mean(cross_val_score(
        model, 
        shirts.data, 
        shirts.target, 
        score_func=zero_one,
        cv=StratifiedKFold(shirts.target, 3),
        n_jobs=-1))
In [7]:
from sklearn.naive_bayes import GaussianNB
evaluate_model(GaussianNB())
In [8]:
from sklearn.neighbors import KNeighborsClassifier
evaluate_model(KNeighborsClassifier())
In [9]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

models = [
    DecisionTreeClassifier(),
    KNeighborsClassifier(),
    LogisticRegression(),
    GaussianNB(),
    SVC(),
]
bestScore = 0
bestModel = None
for model in models:
    score = evaluate_model(model)
    if score > bestScore:
        bestScore = score
        bestModel = model
print bestModel
print bestScore

Rank features

In [10]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

featureSelector = RFE(estimator=LogisticRegression(), n_features_to_select=1, step=1)
featureSelector.fit(shirts.data, shirts.target)
sorted(zip(featureSelector.ranking_, shirts.feature_names))

Select features

In [11]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import zero_one

featureSelector = RFECV(
    estimator=LogisticRegression(), 
    step=1, 
    cv=StratifiedKFold(shirts.target, 3),
    loss_func=zero_one)
featureSelector.fit(shirts.data, shirts.target)

# Plot number of features against cross-validation scores
import pylab as pl
pl.figure()
pl.xlabel('# of features selected')
pl.ylabel('# of misclassifications')
pl.plot(xrange(1, len(featureSelector.cv_scores_) + 1), featureSelector.cv_scores_)
pl.show()
In [12]:
print 'Optimal number of features = %d' % featureSelector.n_features_
print sorted(zip(featureSelector.ranking_, shirts.feature_names))[:featureSelector.n_features_]