Problem Statement

Mission
You are working with the government to transform your city into a smart city. The vision is to convert it into a digital and intelligent city to improve the efficiency of services for the citizens. One of the problems faced by the government is traffic. You are a data scientist working to manage the traffic of the city better and to provide input on infrastructure planning for the future.

The government wants to implement a robust traffic system for the city by being prepared for traffic peaks. They want to understand the traffic patterns of the four junctions of the city. Traffic patterns on holidays, as well as on various other occasions during the year, differ from normal working days. This is important to take into account for your forecasting.

Your task
To predict traffic patterns in each of these four junctions for the next 4 months.

Data

The sensors on each of these junctions were collecting data at different times, hence you will see traffic data from different time periods. To add to the complexity, some of the junctions have provided limited or sparse data requiring thoughtfulness when creating future projections. Depending upon the historical data of 20 months, the government is looking to you to deliver accurate traffic projections for the coming four months. Your algorithm will become the foundation of a larger transformation to make your city smart and intelligent. (emphasis mine)

The evaluation metric for the competition is RMSE. Public-Private split for the competition is 25:75.

Data Dictionary
Variable ---- Definition
ID ----------- Unique ID
DateTime ----- Hourly Datetime Variable
Junction ----- Junction Type
Vehicles ----- Number of Vehicles (Target)

source: https://datahack.analyticsvidhya.com/contest/mckinsey-analytics-hackathon/

In [152]:
import pandas as pd
import numpy as np

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import time
from datetime import datetime
In [5]:
df_train = pd.read_csv('/home/datadavis/Desktop/Projects/Mckinsey & Co Data Hackathon/train.csv')
In [6]:
df_train.describe()
Out[6]:
Junction Vehicles ID
count 48120.000000 48120.000000 4.812000e+04
mean 2.180549 22.791334 2.016330e+10
std 0.966955 20.750063 5.944854e+06
min 1.000000 1.000000 2.015110e+10
25% 1.000000 9.000000 2.016042e+10
50% 2.000000 15.000000 2.016093e+10
75% 3.000000 29.000000 2.017023e+10
max 4.000000 180.000000 2.017063e+10
In [49]:
df_train['DateTime'].unique()
Out[49]:
array(['2015-11-01 00:00:00', '2015-11-01 01:00:00', '2015-11-01 02:00:00',
       ..., '2017-06-30 21:00:00', '2017-06-30 22:00:00',
       '2017-06-30 23:00:00'], dtype=object)
In [9]:
df_train['Junction'].value_counts()
Out[9]:
3    14592
2    14592
1    14592
4     4344
Name: Junction, dtype: int64
In [90]:
time_elem = []

for row in df_train.itertuples():
    time_elem.append(time.strptime(row[1],'%Y-%m-%d %H:%M:%S'))

# https://docs.python.org/3/library/time.html#time.strptime
# https://stackoverflow.com/questions/7837722/what-is-the-most-efficient-way-to-loop-through-dataframes-with-pandas
In [139]:
hour_meas = []

for row in df_train.itertuples():
     hour_meas.append(row[5][3])
In [129]:
wkday_meas = []

for row in df_train.itertuples():
     wkday_meas.append(row[5][6])
In [92]:
df_train['time_elem'] = time_elem
In [140]:
df_train['hour_meas'] = hour_meas
In [130]:
df_train['wkday_meas'] = wkday_meas
In [141]:
df_train.tail(24)
Out[141]:
DateTime Junction Vehicles ID time_elem hour_meas wkday_meas
48096 2017-06-30 00:00:00 4 9 20170630004 (2017, 6, 30, 0, 0, 0, 4, 181, -1) 0 4
48097 2017-06-30 01:00:00 4 7 20170630014 (2017, 6, 30, 1, 0, 0, 4, 181, -1) 1 4
48098 2017-06-30 02:00:00 4 5 20170630024 (2017, 6, 30, 2, 0, 0, 4, 181, -1) 2 4
48099 2017-06-30 03:00:00 4 6 20170630034 (2017, 6, 30, 3, 0, 0, 4, 181, -1) 3 4
48100 2017-06-30 04:00:00 4 5 20170630044 (2017, 6, 30, 4, 0, 0, 4, 181, -1) 4 4
48101 2017-06-30 05:00:00 4 5 20170630054 (2017, 6, 30, 5, 0, 0, 4, 181, -1) 5 4
48102 2017-06-30 06:00:00 4 6 20170630064 (2017, 6, 30, 6, 0, 0, 4, 181, -1) 6 4
48103 2017-06-30 07:00:00 4 4 20170630074 (2017, 6, 30, 7, 0, 0, 4, 181, -1) 7 4
48104 2017-06-30 08:00:00 4 5 20170630084 (2017, 6, 30, 8, 0, 0, 4, 181, -1) 8 4
48105 2017-06-30 09:00:00 4 7 20170630094 (2017, 6, 30, 9, 0, 0, 4, 181, -1) 9 4
48106 2017-06-30 10:00:00 4 15 20170630104 (2017, 6, 30, 10, 0, 0, 4, 181, -1) 10 4
48107 2017-06-30 11:00:00 4 14 20170630114 (2017, 6, 30, 11, 0, 0, 4, 181, -1) 11 4
48108 2017-06-30 12:00:00 4 9 20170630124 (2017, 6, 30, 12, 0, 0, 4, 181, -1) 12 4
48109 2017-06-30 13:00:00 4 11 20170630134 (2017, 6, 30, 13, 0, 0, 4, 181, -1) 13 4
48110 2017-06-30 14:00:00 4 10 20170630144 (2017, 6, 30, 14, 0, 0, 4, 181, -1) 14 4
48111 2017-06-30 15:00:00 4 14 20170630154 (2017, 6, 30, 15, 0, 0, 4, 181, -1) 15 4
48112 2017-06-30 16:00:00 4 16 20170630164 (2017, 6, 30, 16, 0, 0, 4, 181, -1) 16 4
48113 2017-06-30 17:00:00 4 16 20170630174 (2017, 6, 30, 17, 0, 0, 4, 181, -1) 17 4
48114 2017-06-30 18:00:00 4 17 20170630184 (2017, 6, 30, 18, 0, 0, 4, 181, -1) 18 4
48115 2017-06-30 19:00:00 4 11 20170630194 (2017, 6, 30, 19, 0, 0, 4, 181, -1) 19 4
48116 2017-06-30 20:00:00 4 30 20170630204 (2017, 6, 30, 20, 0, 0, 4, 181, -1) 20 4
48117 2017-06-30 21:00:00 4 16 20170630214 (2017, 6, 30, 21, 0, 0, 4, 181, -1) 21 4
48118 2017-06-30 22:00:00 4 22 20170630224 (2017, 6, 30, 22, 0, 0, 4, 181, -1) 22 4
48119 2017-06-30 23:00:00 4 12 20170630234 (2017, 6, 30, 23, 0, 0, 4, 181, -1) 23 4

Thought Process

  • Calculate daily demand for each junction
  • Spread each across the day by cumulative hourly demand (assumption is that each is similarly burdened during day)
In [194]:
t1 = df_train[df_train['Junction'] == 1]
t2 = df_train[df_train['Junction'] == 2]
t3 = df_train[df_train['Junction'] == 3]
t4 = df_train[df_train['Junction'] == 4]
In [143]:
t1['Vehicles'].groupby(t1['hour_meas']).mean()

# https://chrisalbon.com/python/pandas_group_rows_by.html
Out[143]:
hour_meas
0     45.738487
1     39.156250
2     33.907895
3     29.430921
4     25.654605
5     24.067434
6     26.080592
7     29.526316
8     32.735197
9     39.003289
10    49.527961
11    56.187500
12    57.254934
13    51.075658
14    54.733553
15    54.218750
16    51.911184
17    51.851974
18    55.411184
19    58.804276
20    57.401316
21    54.537829
22    52.963816
23    50.088816
Name: Vehicles, dtype: float64
In [158]:
objects1 = t1['hour_meas'].unique()
y_pos1 = np.arange(len(objects1))
performance1 = t1['Vehicles'].groupby(t1['hour_meas']).mean()

objects2 = t2['hour_meas'].unique()
y_pos2 = np.arange(len(objects2))
performance2 = t2['Vehicles'].groupby(t2['hour_meas']).mean()

objects3 = t3['hour_meas'].unique()
y_pos3 = np.arange(len(objects3))
performance3 = t3['Vehicles'].groupby(t3['hour_meas']).mean()

objects4 = t4['hour_meas'].unique()
y_pos4 = np.arange(len(objects4))
performance4 = t4['Vehicles'].groupby(t4['hour_meas']).mean()
Out[158]:
([<matplotlib.axis.XTick at 0x7fe093087be0>,
  <matplotlib.axis.XTick at 0x7fe0930a7f60>,
  <matplotlib.axis.XTick at 0x7fe0930839b0>,
  <matplotlib.axis.XTick at 0x7fe092fc4978>,
  <matplotlib.axis.XTick at 0x7fe092fc9390>,
  <matplotlib.axis.XTick at 0x7fe092fc9d68>,
  <matplotlib.axis.XTick at 0x7fe092fce780>,
  <matplotlib.axis.XTick at 0x7fe092fd6198>,
  <matplotlib.axis.XTick at 0x7fe092fd6b70>,
  <matplotlib.axis.XTick at 0x7fe092fdb588>,
  <matplotlib.axis.XTick at 0x7fe092fdbf60>,
  <matplotlib.axis.XTick at 0x7fe092fe4978>,
  <matplotlib.axis.XTick at 0x7fe092fe9390>,
  <matplotlib.axis.XTick at 0x7fe092fe9d68>,
  <matplotlib.axis.XTick at 0x7fe092fee780>,
  <matplotlib.axis.XTick at 0x7fe092ff6198>,
  <matplotlib.axis.XTick at 0x7fe092ff6b70>,
  <matplotlib.axis.XTick at 0x7fe092ffd588>,
  <matplotlib.axis.XTick at 0x7fe092ffdf60>,
  <matplotlib.axis.XTick at 0x7fe092f83978>,
  <matplotlib.axis.XTick at 0x7fe092f89390>,
  <matplotlib.axis.XTick at 0x7fe092f89d68>,
  <matplotlib.axis.XTick at 0x7fe092f90780>,
  <matplotlib.axis.XTick at 0x7fe092f97198>],
 <a list of 24 Text xticklabel objects>)
In [168]:
# four subplots, the axes array is 1-d
f, axarr = plt.subplots(4, sharex=True)

axarr[0].set_title('Hourly Demand by Junction')
axarr[0].bar(y_pos1, performance1, 0.5, color='red')
axarr[1].bar(y_pos2, performance2, 0.5, color='green')
axarr[2].bar(y_pos3, performance3, 0.5, color='blue')
axarr[3].bar(y_pos4, performance4, 0.5, color='gray')

# https://matplotlib.org/examples/pylab_examples/subplots_demo.html
# https://pythonspot.com/en/matplotlib-bar-chart/
# https://stackoverflow.com/questions/27678906/4-subplot-barcharts-in-python
Out[168]:
<Container object of 24 artists>
In [171]:
t1['Vehicles'].groupby(t1['hour_meas']).mean()[0]
Out[171]:
45.73848684210526
In [170]:
t1_hr_sum
Out[170]:
1081.2697368421054
In [180]:
t1_hr_sum = sum(t1['Vehicles'].groupby(t1['hour_meas']).mean())
t1_hr_weight = []
hr = 0

while hr < 24:
    t1_hr_weight.append(t1['Vehicles'].groupby(t1['hour_meas']).mean()[hr] / t1_hr_sum)
    hr +=1

t1_weights = pd.DataFrame(t1_hr_weight, columns=['wt'])
#t1_weights
In [181]:
t2_hr_sum = sum(t2['Vehicles'].groupby(t2['hour_meas']).mean())
t2_hr_weight = []
hr = 0

while hr < 24:
    t2_hr_weight.append(t2['Vehicles'].groupby(t2['hour_meas']).mean()[hr] / t2_hr_sum)
    hr +=1

t2_weights = pd.DataFrame(t2_hr_weight, columns=['wt'])
#t2_weights
In [182]:
t3_hr_sum = sum(t3['Vehicles'].groupby(t3['hour_meas']).mean())
t3_hr_weight = []
hr = 0

while hr < 24:
    t3_hr_weight.append(t3['Vehicles'].groupby(t3['hour_meas']).mean()[hr] / t3_hr_sum)
    hr +=1

t3_weights = pd.DataFrame(t3_hr_weight, columns=['wt'])
#t3_weights
In [183]:
t4_hr_sum = sum(t4['Vehicles'].groupby(t4['hour_meas']).mean())
t4_hr_weight = []
hr = 0

while hr < 24:
    t4_hr_weight.append(t4['Vehicles'].groupby(t4['hour_meas']).mean()[hr] / t4_hr_sum)
    hr +=1

t4_weights = pd.DataFrame(t4_hr_weight, columns=['wt'])
#t4_weights
In [253]:
t4_weights.head()
Out[253]:
wt
0 0.039777
1 0.032570
2 0.027809
3 0.024825
4 0.022158

Thought Process

  • Now that we have broken down hourly calcs for each junction, it's time to forecast daily data
  • We will do this by summarizing the data by day, forecasting with Facebook's Prophet engine, then applying our hourly weights for a best guesstimate of the future
In [184]:
!pip install fbprophet #https://facebook.github.io/prophet/docs/installation.html#installation-in-python
Requirement already satisfied: fbprophet in /home/datadavis/anaconda3/lib/python3.6/site-packages
Requirement already satisfied: matplotlib in /home/datadavis/anaconda3/lib/python3.6/site-packages (from fbprophet)
Requirement already satisfied: pandas>=0.18.1 in /home/datadavis/anaconda3/lib/python3.6/site-packages (from fbprophet)
Requirement already satisfied: pystan>=2.14 in /home/datadavis/anaconda3/lib/python3.6/site-packages (from fbprophet)
Requirement already satisfied: numpy>=1.7.1 in /home/datadavis/anaconda3/lib/python3.6/site-packages (from matplotlib->fbprophet)
Requirement already satisfied: six>=1.10 in /home/datadavis/anaconda3/lib/python3.6/site-packages (from matplotlib->fbprophet)
Requirement already satisfied: python-dateutil in /home/datadavis/anaconda3/lib/python3.6/site-packages (from matplotlib->fbprophet)
Requirement already satisfied: pytz in /home/datadavis/anaconda3/lib/python3.6/site-packages (from matplotlib->fbprophet)
Requirement already satisfied: cycler>=0.10 in /home/datadavis/anaconda3/lib/python3.6/site-packages (from matplotlib->fbprophet)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=1.5.6 in /home/datadavis/anaconda3/lib/python3.6/site-packages (from matplotlib->fbprophet)
Requirement already satisfied: Cython!=0.25.1,>=0.22 in /home/datadavis/anaconda3/lib/python3.6/site-packages (from pystan>=2.14->fbprophet)
In [201]:
from fbprophet import Prophet
In [191]:
#mydate = datetime.date(1943,3, 13)

t_date = []

for row in df_train.itertuples():
     t_date.append(datetime(row[5][0], row[5][1], row[5][2]).date())

# http://www.pythonforbeginners.com/basics/python-datetime-time-examples
In [193]:
df_train['ds'] = t_date

Go back into the process to recreate the 4 split dataframes with the new column

In [198]:
t1.head(7)
Out[198]:
DateTime Junction Vehicles ID time_elem hour_meas wkday_meas ds
0 2015-11-01 00:00:00 1 15 20151101001 (2015, 11, 1, 0, 0, 0, 6, 305, -1) 0 6 2015-11-01
1 2015-11-01 01:00:00 1 13 20151101011 (2015, 11, 1, 1, 0, 0, 6, 305, -1) 1 6 2015-11-01
2 2015-11-01 02:00:00 1 10 20151101021 (2015, 11, 1, 2, 0, 0, 6, 305, -1) 2 6 2015-11-01
3 2015-11-01 03:00:00 1 7 20151101031 (2015, 11, 1, 3, 0, 0, 6, 305, -1) 3 6 2015-11-01
4 2015-11-01 04:00:00 1 9 20151101041 (2015, 11, 1, 4, 0, 0, 6, 305, -1) 4 6 2015-11-01
5 2015-11-01 05:00:00 1 6 20151101051 (2015, 11, 1, 5, 0, 0, 6, 305, -1) 5 6 2015-11-01
6 2015-11-01 06:00:00 1 9 20151101061 (2015, 11, 1, 6, 0, 0, 6, 305, -1) 6 6 2015-11-01
In [256]:
train1 = t1[['ds', 'Vehicles']].groupby(t1['ds']).sum().reset_index()
#train1['Vehicles'] = np.log(train1['Vehicles'])
train1.columns = ['ds', 'y'] # *

m1 = Prophet()
m1.fit(train1)

days_ahead = 4*30 + 5 # 4 months ahead
future1 = m1.make_future_dataframe(periods=days_ahead)
#future.tail()

forecast1 = m1.predict(future1)
#forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

m1.plot(forecast1)

# https://facebook.github.io/prophet/docs/quick_start.html
# https://stackoverflow.com/questions/11346283/renaming-columns-in-pandas
INFO:fbprophet.forecaster:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
Out[256]:

Thought Process

  • Interesting to see the separation in data points over time; I wonder if the other junctions will show a similar pattern
  • Makes me think that some days do not include 24 hours worth of data. Let's examine further
In [212]:
t4['hour_meas'].value_counts()
Out[212]:
23    181
19    181
4     181
8     181
12    181
16    181
20    181
1     181
5     181
9     181
13    181
17    181
21    181
2     181
6     181
10    181
14    181
18    181
22    181
3     181
7     181
11    181
15    181
0     181
Name: hour_meas, dtype: int64

There are an equal distribution of samples across the hours for each junction so let's see if the pattern persists for the other forecasts

In [213]:
train2 = t2[['ds', 'Vehicles']].groupby(t2['ds']).sum().reset_index()
#train2['Vehicles'] = np.log(train2['Vehicles'])
train2.columns = ['ds', 'y'] # *

m2 = Prophet()
m2.fit(train2)

days_ahead = 4*30 + 5 # 4 months ahead
future2 = m2.make_future_dataframe(periods=days_ahead)
#future.tail()

forecast2 = m2.predict(future2)
#forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

m2.plot(forecast2)
INFO:fbprophet.forecaster:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
Out[213]:
In [214]:
train3 = t3[['ds', 'Vehicles']].groupby(t3['ds']).sum().reset_index()
#train3['Vehicles'] = np.log(train3['Vehicles'])
train3.columns = ['ds', 'y'] # *

m3 = Prophet()
m3.fit(train3)

days_ahead = 4*30 + 5 # 4 months ahead
future3 = m3.make_future_dataframe(periods=days_ahead)
#future.tail()

forecast3 = m3.predict(future3)
#forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

m3.plot(forecast3)
INFO:fbprophet.forecaster:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
Out[214]: