FHV_BUS_Trips




Pay Notebook Creator: Ning Wei0
Set Container: Numerical CPU with TINY Memory for 10 Minutes 0
Total0
In [1]:
#CrossCompute
monthly_for_hire_vehicle_count_table_path = 'monthly_FHV_count.csv'
target_folder = '/tmp'
In [2]:
from io import BytesIO
from zipfile import ZipFile
import requests

url = 'http://web.mta.info/persdashboard/perxml/MTA_Performance_Datall.zip'
content = requests.get(url)

#unzip the content
f = ZipFile(BytesIO(content.content))
f.namelist()
Out[2]:
['MTA_Performance_Agencies.csv',
 'MTA_Performance_LIRR.csv',
 'MTA_Performance_MNR.csv',
 'MTA_Performance_MTABUS.csv',
 'MTA_Performance_NYCT.csv',
 'MTA_Performance_TBTA.csv',
 'Performance-MetaData.xlsx',
 'Performance_LIRR.csv',
 'Performance_LIRR.xml',
 'Performance_MNRR.csv',
 'Performance_MNRR.xml',
 'Performance_MTABUS.csv',
 'Performance_MTABUS.xml',
 'Performance_NYCT.csv',
 'Performance_NYCT.xml',
 'Performance_schema.xsd',
 'Performance_TBTA.csv',
 'Performance_TBTA.xml']
In [6]:
import pandas as pd
df1 = pd.read_csv(f.open('Performance_MTABUS.csv'))
In [7]:
df1 = df1[["INDICATOR_NAME", "PERIOD_YEAR", "PERIOD_MONTH", "MONTHLY_ACTUAL"]]
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1813 entries, 0 to 1812
Data columns (total 4 columns):
INDICATOR_NAME    1813 non-null object
PERIOD_YEAR       1813 non-null int64
PERIOD_MONTH      1813 non-null int64
MONTHLY_ACTUAL    1813 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 56.7+ KB
In [8]:
mta_df = df1[((df1.PERIOD_YEAR > 2015))&(df1.INDICATOR_NAME == "Total Ridership - MTA Bus ")]
mta_df
Out[8]:
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
INDICATOR_NAME PERIOD_YEAR PERIOD_MONTH MONTHLY_ACTUAL
334 Total Ridership - MTA Bus 2016 1 9464783.00
335 Total Ridership - MTA Bus 2016 2 9650946.00
336 Total Ridership - MTA Bus 2016 3 11393621.00
337 Total Ridership - MTA Bus 2016 4 10566032.00
338 Total Ridership - MTA Bus 2016 5 11022447.00
339 Total Ridership - MTA Bus 2016 6 10778433.00
340 Total Ridership - MTA Bus 2016 7 10002577.00
341 Total Ridership - MTA Bus 2016 8 10491025.00
342 Total Ridership - MTA Bus 2016 9 10898004.00
343 Total Ridership - MTA Bus 2016 10 10831627.00
344 Total Ridership - MTA Bus 2016 11 10414870.00
345 Total Ridership - MTA Bus 2016 12 10102793.00
346 Total Ridership - MTA Bus 2017 1 9840105.00
347 Total Ridership - MTA Bus 2017 2 9289269.80
348 Total Ridership - MTA Bus 2017 3 10774447.00
349 Total Ridership - MTA Bus 2017 4 9868714.00
350 Total Ridership - MTA Bus 2017 5 10948781.97
351 Total Ridership - MTA Bus 2017 6 10415892.00
352 Total Ridership - MTA Bus 2017 7 9835319.00
353 Total Ridership - MTA Bus 2017 8 10182683.00
354 Total Ridership - MTA Bus 2017 9 10293093.00
355 Total Ridership - MTA Bus 2017 10 10879804.00
356 Total Ridership - MTA Bus 2017 11 10230691.00
357 Total Ridership - MTA Bus 2017 12 9592517.00
358 Total Ridership - MTA Bus 2018 1 9378094.00
359 Total Ridership - MTA Bus 2018 2 9153053.00
360 Total Ridership - MTA Bus 2018 3 10383080.00
361 Total Ridership - MTA Bus 2018 4 10145682.00
In [9]:
dates = pd.to_datetime(dict(year=mta_df.PERIOD_YEAR, month=mta_df.PERIOD_MONTH, day=1))
mta_df.loc[:,'Date'] = dates
/home/user/.virtualenvs/crosscompute/lib/python3.6/site-packages/pandas/core/indexing.py:362: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
/home/user/.virtualenvs/crosscompute/lib/python3.6/site-packages/pandas/core/indexing.py:543: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
In [10]:
df2 = pd.read_csv(f.open('Performance_NYCT.csv'), encoding="iso-8859-1")
df2 = df2[["INDICATOR_NAME", "PERIOD_YEAR", "PERIOD_MONTH", "MONTHLY_ACTUAL"]]
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9213 entries, 0 to 9212
Data columns (total 4 columns):
INDICATOR_NAME    9213 non-null object
PERIOD_YEAR       9213 non-null int64
PERIOD_MONTH      9213 non-null int64
MONTHLY_ACTUAL    9213 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 288.0+ KB
In [16]:
df2['INDICATOR_NAME'].unique()
Out[16]:
array(['Employee Lost Time and Restricted Duty Rate ',
       'Mean Distance Between Failures - Subways',
       'Wait Assessment - Subways (Inactive, Historic Calculations)',
       'Customer Injury Rate - Subways',
       'Mean Distance Between Failures - Staten Island Railway ',
       'On-Time Performance - Staten Island Railway',
       'Total Ridership - Subways', 'Elevator Availability - Subways',
       'Escalator Availability - Subways',
       'On-Time Performance (Terminal)', 'OTP (Terminal) - 1 Line',
       'OTP (Terminal) - 2 Line', 'OTP (Terminal) - 3 Line',
       'OTP (Terminal) - 4 Line', 'OTP (Terminal) - 5 Line',
       'OTP (Terminal) - 6 Line', 'OTP (Terminal) - 7 Line',
       'OTP (Terminal) - S Line 42 St.', 'OTP (Terminal) - A Line',
       'OTP (Terminal) - B Line', 'OTP (Terminal) - C Line',
       'OTP (Terminal) - D Line', 'OTP (Terminal) - E Line',
       'OTP (Terminal) - F Line', 'OTP (Terminal) - S Fkln Line',
       'OTP (Terminal) - G Line', 'OTP (Terminal) - J Z Line',
       'OTP (Terminal) - L Line', 'OTP (Terminal) - M Line',
       'OTP (Terminal) - N Line', 'OTP (Terminal) - Q Line',
       'OTP (Terminal) - R Line', 'OTP (Terminal) - S Line Rock',
       'OTP (Terminal) - W Line', 'Subway Wait Assessment ',
       'Subway Wait Assessment - 1 Line',
       'Subway Wait Assessment - 2 Line',
       'Subway Wait Assessment - 3 Line',
       'Subway Wait Assessment - 4 Line',
       'Subway Wait Assessment - 5 Line',
       'Subway Wait Assessment - 6 Line',
       'Subway Wait Assessment - 7 Line',
       'Subway Wait Assessment - S 42 St',
       'Subway Wait Assessment - A Line',
       'Subway Wait Assessment - B Line',
       'Subway Wait Assessment - C Line',
       'Subway Wait Assessment - D Line',
       'Subway Wait Assessment - E Line',
       'Subway Wait Assessment - F Line',
       'Subway Wait Assessment - G Line',
       'Subway Wait Assessment - S Rock ',
       'Subway Wait Assessment - J Z Line',
       'Subway Wait Assessment - L Line',
       'Subway Wait Assessment - M Line',
       'Subway Wait Assessment - N Line',
       'Subway Wait Assessment - Q Line',
       'Subway Wait Assessment - R Line',
       'Subway Wait Assessment - S Fkln',
       'Subway Wait Assessment - W Line',
       'Mean Distance Between Failures - NYCT Bus',
       'Total Paratransit Ridership - NYCT Bus',
       'Customer Accident Injury Rate - NYCT Bus',
       'Total Ridership - NYCT Bus ',
       'Collisions with Injury Rate - NYCT Bus',
       '% of Completed Trips - NYCT Bus',
       'East New York Depot - % of Completed Trips',
       'Flatbush Depot - % of Completed Trips',
       'Jackie Gleason Depot - % of Completed Trips',
       'Ulmer Park Depot - % of Completed Trips',
       'Fresh Pond Depot - % of Completed Trips',
       'Yukon Depot - % of Completed Trips',
       '126th Street Depot - % of Completed Trips ',
       'Casey Stengel Depot - % of Completed Trips',
       'Jamaica Depot - % of Completed Trips',
       'Castleton Depot - % of Completed Trips',
       'Queens Village Depot - % of Completed Trips',
       'Kingsbridge Depot - % of Completed Trips',
       'Manhattanville Depot - % of Completed Trips',
       'Gun Hill Depot - % of Completed Trips',
       'Michael J. Quill Depot - % of Completed Trips',
       'Charleston Depot - % of Completed Trips',
       'Meredith Avenue Depot - % of Completed Trips',
       '100th Street Depot - % of Completed Trips',
       'West Farms Depot - % of Completed Trips',
       'Grand Avenue Depot - % of Completed Trips',
       ' Bus Passenger Wheelchair Lift Usage - NYCT Bus'], dtype=object)
In [15]:
df2.groupby('PERIOD_MONTH').mean()
Out[15]:
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
PERIOD_YEAR MONTHLY_ACTUAL
PERIOD_MONTH
1 2013.084399 2.448283e+06
2 2012.951444 2.420027e+06
3 2012.952694 2.791283e+06
4 2012.955204 2.703473e+06
5 2012.928477 2.742173e+06
6 2012.789203 2.596588e+06
7 2012.799228 2.494102e+06
8 2012.802835 2.447313e+06
9 2012.802835 2.568728e+06
10 2012.892287 2.797948e+06
11 2012.811054 2.488300e+06
12 2012.871863 2.668256e+06
In [20]:
df3 = df2[df2['INDICATOR_NAME'] == 'Total Ridership - NYCT Bus ']
In [21]:
df3
Out[21]:
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
INDICATOR_NAME PERIOD_YEAR PERIOD_MONTH MONTHLY_ACTUAL
6399 Total Ridership - NYCT Bus 2008 1 60889506.0
6400 Total Ridership - NYCT Bus 2008 2 57631961.0
6401 Total Ridership - NYCT Bus 2008 3 64147814.0
6402 Total Ridership - NYCT Bus 2008 4 63701847.0
6403 Total Ridership - NYCT Bus 2008 5 65582363.0
6404 Total Ridership - NYCT Bus 2008 6 62609260.0
6405 Total Ridership - NYCT Bus 2008 7 61878187.0
6406 Total Ridership - NYCT Bus 2008 8 58465618.0
6407 Total Ridership - NYCT Bus 2008 9 64956000.0
6408 Total Ridership - NYCT Bus 2008 10 67390000.0
6409 Total Ridership - NYCT Bus 2008 11 59213000.0
6410 Total Ridership - NYCT Bus 2008 12 60348000.0
6411 Total Ridership - NYCT Bus 2009 1 58200405.0
6412 Total Ridership - NYCT Bus 2009 2 56959039.0
6413 Total Ridership - NYCT Bus 2009 3 65574537.0
6414 Total Ridership - NYCT Bus 2009 4 62183189.0
6415 Total Ridership - NYCT Bus 2009 5 63714187.0
6416 Total Ridership - NYCT Bus 2009 6 61470181.0
6417 Total Ridership - NYCT Bus 2009 7 58621866.0
6418 Total Ridership - NYCT Bus 2009 8 55813560.0
6419 Total Ridership - NYCT Bus 2009 9 61277073.0
6420 Total Ridership - NYCT Bus 2009 10 65187884.0
6421 Total Ridership - NYCT Bus 2009 11 59500722.0
6422 Total Ridership - NYCT Bus 2009 12 57968949.0
6423 Total Ridership - NYCT Bus 2010 1 55536876.0
6424 Total Ridership - NYCT Bus 2010 2 50967260.0
6425 Total Ridership - NYCT Bus 2010 3 63734422.0
6426 Total Ridership - NYCT Bus 2010 4 61770630.0
6427 Total Ridership - NYCT Bus 2010 5 61939306.0
6428 Total Ridership - NYCT Bus 2010 6 60096854.0
... ... ... ... ...
6493 Total Ridership - NYCT Bus 2015 11 52985130.0
6494 Total Ridership - NYCT Bus 2015 12 58610335.0
6495 Total Ridership - NYCT Bus 2016 1 49049885.0
6496 Total Ridership - NYCT Bus 2016 2 50327673.0
6497 Total Ridership - NYCT Bus 2016 3 58720672.0
6498 Total Ridership - NYCT Bus 2016 4 54318126.0
6499 Total Ridership - NYCT Bus 2016 5 56608696.0
6500 Total Ridership - NYCT Bus 2016 6 54222089.0
6501 Total Ridership - NYCT Bus 2016 7 50453231.0
6502 Total Ridership - NYCT Bus 2016 8 52055737.0
6503 Total Ridership - NYCT Bus 2016 9 54848336.0
6504 Total Ridership - NYCT Bus 2016 10 54558371.0
6505 Total Ridership - NYCT Bus 2016 11 52389723.0
6506 Total Ridership - NYCT Bus 2016 12 50830973.0
6507 Total Ridership - NYCT Bus 2017 1 49629065.0
6508 Total Ridership - NYCT Bus 2017 2 46731666.0
6509 Total Ridership - NYCT Bus 2017 3 53491942.0
6510 Total Ridership - NYCT Bus 2017 4 49647496.0
6511 Total Ridership - NYCT Bus 2017 5 54521405.0
6512 Total Ridership - NYCT Bus 2017 6 51235902.0
6513 Total Ridership - NYCT Bus 2017 7 47820447.0
6514 Total Ridership - NYCT Bus 2017 8 48720905.0
6515 Total Ridership - NYCT Bus 2017 9 50399666.0
6516 Total Ridership - NYCT Bus 2017 10 54167068.0
6517 Total Ridership - NYCT Bus 2017 11 49790212.0
6518 Total Ridership - NYCT Bus 2017 12 46440136.0
6519 Total Ridership - NYCT Bus 2018 1 45323007.0
6520 Total Ridership - NYCT Bus 2018 2 44686764.0
6521 Total Ridership - NYCT Bus 2018 3 49761212.0
6522 Total Ridership - NYCT Bus 2018 4 48618792.0
<p>124 rows × 4 columns</p>
In [25]:
#df3.groupby('PERIOD_MONTH').mean()  # Average over years
df4 = df3.groupby('PERIOD_MONTH').mean() 
In [26]:
df4 = df4[['MONTHLY_ACTUAL']]
In [28]:
df3[:3]
Out[28]:
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
INDICATOR_NAME PERIOD_YEAR PERIOD_MONTH MONTHLY_ACTUAL
6399 Total Ridership - NYCT Bus 2008 1 60889506.0
6400 Total Ridership - NYCT Bus 2008 2 57631961.0
6401 Total Ridership - NYCT Bus 2008 3 64147814.0
In [ ]:
 
In [11]:
nyct_df = df2[((df2.PERIOD_YEAR > 2015))&(df2.INDICATOR_NAME == "Total Ridership - NYCT Bus ")]
nyct_df.head()
Out[11]:
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
INDICATOR_NAME PERIOD_YEAR PERIOD_MONTH MONTHLY_ACTUAL
6495 Total Ridership - NYCT Bus 2016 1 49049885.0
6496 Total Ridership - NYCT Bus 2016 2 50327673.0
6497 Total Ridership - NYCT Bus 2016 3 58720672.0
6498 Total Ridership - NYCT Bus 2016 4 54318126.0
6499 Total Ridership - NYCT Bus 2016 5 56608696.0
In [12]:
dates = pd.to_datetime(dict(year=nyct_df.PERIOD_YEAR, month=nyct_df.PERIOD_MONTH, day=1))
nyct_df.loc[:,'Date'] = dates
In [13]:
mta_df['Ridership'] = mta_df['MONTHLY_ACTUAL'].values + nyct_df['MONTHLY_ACTUAL'].values
bus_df = mta_df[['Ridership','Date']]
/home/user/.virtualenvs/crosscompute/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [22]:
bus_df.Date
Out[22]:
334   2016-01-01
335   2016-02-01
336   2016-03-01
337   2016-04-01
338   2016-05-01
339   2016-06-01
340   2016-07-01
341   2016-08-01
342   2016-09-01
343   2016-10-01
344   2016-11-01
345   2016-12-01
346   2017-01-01
347   2017-02-01
348   2017-03-01
349   2017-04-01
350   2017-05-01
351   2017-06-01
352   2017-07-01
353   2017-08-01
354   2017-09-01
355   2017-10-01
356   2017-11-01
357   2017-12-01
358   2018-01-01
359   2018-02-01
360   2018-03-01
361   2018-04-01
Name: Date, dtype: datetime64[ns]
In [23]:
import matplotlib.pyplot as plt
%matplotlib inline

fig = plt.plot(bus_df.Date, bus_df.Ridership)
fig.
  File "<ipython-input-23-372b4ae0a5dc>", line 5
    fig.
        ^
SyntaxError: invalid syntax
In [14]:
df3 = pd.read_csv("https://data.cityofnewyork.us/api/views/2v9c-2k7f/rows.csv?accessType=DOWNLOAD")
df3 = df3[["Year", "Month", "Total Dispatched Trips"]]
fhv_df = df3[((df3.Year == 2016) | (df3.Year == 2017) | (df3.Year == 2018))]
fhv_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20100 entries, 2 to 23415
Data columns (total 3 columns):
Year                      20100 non-null int64
Month                     20100 non-null int64
Total Dispatched Trips    20100 non-null int64
dtypes: int64(3)
memory usage: 628.1 KB
In [15]:
fhv_df['Date'] = pd.to_datetime(dict(year=fhv_df.Year, month=fhv_df.Month, day=1))
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [16]:
fhv_df = fhv_df.groupby(['Date']).sum()
fhv_df = fhv_df['Total Dispatched Trips']
In [17]:
fhv_df.plot()

plt.title("Monthly FHV Ridership")
plt.xlabel("Month")
plt.ylabel("FHV Ridership")
plt.show()
In [72]:
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression
model = LinearRegression()
x = fhv_df[0:28].values
y = bus_df.Ridership.values
In [73]:
from datetime import timedelta
x1 = pd.DataFrame(data=fhv_df[0:28])
x1['b'] = fhv_df[1:29].values
x1['c'] = fhv_df[2:30].values
x1.columns = ['Month - 2','Month - 1','Current Month']
x1 = x1.reset_index()
In [74]:
x1['Month'] = x1['Date'].apply(lambda x: x.month)
In [75]:
x1 = x1.drop(columns = 'Date')
In [77]:
x = x.reshape(28,1)
y = y.reshape(28,1)
model.fit(X=x,y=y)
plt.scatter(x,y)
plt.xlabel("FHV Trips")
plt.ylabel("MTA Bus Trips")
plt.plot(x,model.predict(x),color="red")
plt.show()
In [84]:
from sklearn.model_selection import cross_val_score
cross_val_score(model,x,y,scoring = 'neg_mean_absolute_error').mean()
C:\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
Out[84]:
-2943038.9585895427
In [97]:
model = LinearRegression()
model.fit(x1,y)
Out[97]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
In [98]:
cross_val_score(model,x1,y,scoring = 'neg_mean_absolute_error').mean()
C:\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
Out[98]:
-2710456.008361019
In [117]:
import statsmodels.api as sm
x = sm.add_constant(x)
In [118]:
reg = sm.OLS(y,x).fit()
In [119]:
print(reg.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.259
Model:                            OLS   Adj. R-squared:                  0.230
Method:                 Least Squares   F-statistic:                     9.086
Date:                Thu, 31 Jan 2019   Prob (F-statistic):            0.00569
Time:                        17:01:33   Log-Likelihood:                -460.46
No. Observations:                  28   AIC:                             924.9
Df Residuals:                      26   BIC:                             927.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       6.893e+07   2.62e+06     26.325      0.000    6.35e+07    7.43e+07
x1            -0.5278      0.175     -3.014      0.006      -0.888      -0.168
==============================================================================
Omnibus:                        1.381   Durbin-Watson:                   1.677
Prob(Omnibus):                  0.501   Jarque-Bera (JB):                0.937
Skew:                          -0.016   Prob(JB):                        0.626
Kurtosis:                       2.105   Cond. No.                     5.95e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.95e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

model.fit(x=x,y=y)