In this post I am going to do a forecast for an Airline company using the model Arima. To do so, I will stationarize the series, I will study the seasonality, the autocorrelations and partial autocorrelation to determine the optimal model and parameters. Then i will fit the model and I will produce a forecast for passengers for the next 2 years.
In [1]:
# In this exercise I am going to forecast the clients /passengers of an Airline company using Arima
# To do so, I will stationarize the series, I will study the autocorrelations, partial autocorrelations
# and the appropiate lags to be included in the forecasting equation
# I will decided whether an AR or AM model is more appropiate
# I will fit thee model selectd checking the residuals
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
#Importing data
airline = pd.read_csv('airline_passengers.csv',index_col="Month")
airline.dropna(inplace=True)
# Converting into datetime index the dates
airline.index = pd.to_datetime(airline.index)
In [3]:
airline.tail()
Out[3]:
In [4]:
# ploting the series along with 6 months and 12 months moving averages.
# We can see an upwards trend with seasonality
airline['6-month-SMA']=airline['Thousands of Passengers'].rolling(window=6).mean()
airline['12-month-SMA']=airline['Thousands of Passengers'].rolling(window=12).mean()
airline.plot(figsize=(10,5));
In [5]:
# Taking a look at the characteristics of the series
airline.describe().transpose()
Out[5]:
In [6]:
### ETS decomposition
### I am going to visualize the trend and the stationary component to see how big is the influence of
### seasonality
#airline.drop('12-month-SMA',inplace=True)
#airline.drop('6-month-SMA',inplace=True)
# Importing from Statssmodels the function to decompose the trend and seasonality
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(airline['Thousands of Passengers'], freq=12)
fig = plt.figure() ;
fig = decomposition.plot() ;
fig.set_size_inches(15, 8);
# We can see in the residuals the seasonality in the latest periods.
# This is because the series has more clear seasonality at the beginning of the series.
In [7]:
# We will use the following function to check if the series are stationary
from statsmodels.tsa.stattools import adfuller
In [8]:
result = adfuller(airline['Thousands of Passengers'])
result
# From the P value=0.9918, > 0.05 we can see tha the series is not stationary, indeed it has has seasonality and trend
Out[8]:
In [9]:
# As we will need to use the previous test many times to check the
# differences when eliminating the seasonality, we will store the test in a function
def adf_check(time_series):
"""
Pass in a time series, returns ADF report
"""
result = adfuller(time_series)
print('Augmented Dickey-Fuller Test:')
labels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations Used']
for value,label in zip(result,labels):
print(label+' : '+str(value) )
if result[1] <= 0.05:
print("strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary")
else:
print("weak evidence against null hypothesis, time series has a unit root, indicating it is non-stationary ")
In [10]:
# Repeating the check but with the function created
adf_check(airline['Thousands of Passengers'])
In [11]:
# We are going to take differences in the series with the objective of
# eliminating the trend and seasonality and exploring the number of lags with which the seasonality
# is eliminated thought differences.
# first difference
# We have use a lag of 1 month. This operation will eliminate the trend
airline['Airline first Difference'] = airline['Thousands of Passengers'] - airline['Thousands of Passengers'].shift(1)
In [12]:
# cheching the new series value for seasonality
adf_check(airline['Airline first Difference'].dropna())
# we can see that the P value is very close to 0.05 but it is above the limit, so we should do furhter differences.
In [13]:
airline['Airline first Difference'].plot(figsize=(10,5),legend=True);
airline['Airline first Difference'].rolling(window=12).mean().plot(figsize=(10,5),label='12-MA',legend=True)
airline['Airline first Difference'].rolling(window=12).std().plot(figsize=(10,5),legend=True,label='rolling 12-std');
##When looking at the plot we can see that the trend has dissapeared
# the chart is showing heterocedasticity as the variance depends on the time
# but the covariance seems that does not depend on the time
In [14]:
#Second difference
# Sometimes it would be necessary to do a second difference. In this case and although the trend has dissapear,
# we see that we still have some seasonality
airline['Airline Second Difference'] = airline['Airline first Difference'] - airline['Airline first Difference'].shift(1)
In [15]:
# Checking now if the series is stationary
adf_check(airline['Airline Second Difference'].dropna())
# We can see that the p value is now lower that 0.05, thus the series has lost its seasonality
# Nevertheless, as I have said before the lag that we should use is 12, because it its quite
# clear the seasonality of 12 lags periods.
In [16]:
# Ploting the seccond differences
airline['Airline Second Difference'].plot(figsize=(16,6),legend=True);
airline['Airline Second Difference'].rolling(window=12).mean().plot(figsize=(10,5),label='12-MA',legend=True)
airline['Airline Second Difference'].rolling(window=12).std().plot(figsize=(10,5),legend=True,label='rolling 12-std');
# Although I have done thee second difference, we still can see some heterodasticity problem, as the std depends on the time
In [17]:
# I am going to repeat the differences, but this time taking the lag of 12 months to eliminate first the seasonality
#SEASONAL DIFFERENCE
plt.figure(figsize=(16,6))
airline['Seasonal Difference'] = airline['Thousands of Passengers'] - airline['Thousands of Passengers'].shift(12)
airline['Seasonal Difference'].plot(legend=True);
airline['Seasonal Difference'].rolling(window=12).mean().plot(figsize=(10,5),label='12-MA',legend=True)
airline['Seasonal Difference'].rolling(window=12).std().plot(figsize=(10,5),legend=True,label='rolling 12-std');
# with this approach we can see that we can get rid of heterodasticity problem
In [18]:
# Seasonal Difference by itself was enough to make the series stationary
adf_check(airline['Seasonal Difference'].dropna())
# if we look to the p-value, this is lower thatn 0.05, so the series is stationary
# if after doing the seasonal difference, the test told as that the series is still not stationary
# we should have to take first differeces over the Seasonal differences.
In [19]:
# In this part i am going to plot the autocorrelations and partial autocorrelations
# to get to know how to fit the model, if with AR or with AM
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
In [20]:
# To see the differences I will plot the autocorrelation for boths of the first with 1 lag and with 12 lag (seasonality)
# Differences with lag of 1 period
fig_first = plot_acf(airline["Airline first Difference"].dropna())
In [21]:
#Autocorrelation plot with 1 difference, lag of 12 months
# We can see that this model is better because the autocorrelation coefficients fall earlier
fig_seasonal_first = plot_acf(airline["Seasonal Difference"].dropna())
In [22]:
# we can also plot the autocorrelation plot with Pandas
# Pandas does not have partial autocorrelation and we need it to explore if we are going to use
# AR or MA models and the order, so we will keep using statsmodels
from pandas.plotting import autocorrelation_plot
autocorrelation_plot(airline['Seasonal Difference'].dropna());
# as the first Autocorrelation is positive and the autocorrelations decay gradually till 0, I will use AR.
In [23]:
# I am going to plot now the partial autocorrelation to determine the order that will be
# used in the AR model
result = plot_pacf(airline["Seasonal Difference"].dropna())
# as the partical autocorrelations drop sharply after 2 lags, the order of the AR model will be 2.
In [24]:
# As a summary, I will use an AR(2) model with 1 difference for seasonality (12 lags periods)
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(airline['Seasonal Difference'].iloc[13:], lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(airline['Seasonal Difference'].iloc[13:], lags=40, ax=ax2)
#We have 2 spikes in Partial autocorrelation and smooth trend in autocorrelation so the correct paramentes are : AR(2)
# we can see how for the lag 12th onwards the coefficients are negative or 0, because the series
#had seasonality of 12 periods
In [25]:
# FITING THE MODEL ARIMA
# For non-seasonal data
# Importing the model Arima
from statsmodels.tsa.arima_model import ARIMA
In [38]:
# We can take a look to the ARIMA function documentation
# help(ARIMA)
In [27]:
# Modelling the series with Arima and fitting the model.
model = sm.tsa.statespace.SARIMAX(airline['Thousands of Passengers'],order=(2,1,0), seasonal_order=(1,1,1,12))
results = model.fit()
print(results.summary())
In [28]:
results.resid.plot(figsize=(16, 4));
results.resid.rolling(window=12).mean().plot(figsize=(10,5),label='12-MA',legend=True)
results.resid.rolling(window=12).std().plot(figsize=(10,5),label='12-MA',legend=True)
# We can see that the residuals do not depend on the time
Out[28]:
In [29]:
results.resid.plot(kind='kde',figsize=(8, 4));
# the residuals distribution seems to be centered aroun 0
In [30]:
#prediction values
# I am going to use the model fitted to get to know how much error does it convey
# to do so, I am going to use the model over know data to know how well the model fits the series.
airline['forecast'] = results.predict(start = 130, end= 145, dynamic= True)
airline[['Thousands of Passengers','forecast']].plot(figsize=(16,8));
In [31]:
# Forecasting
# We need to add more timee periods to do the forecast
airline.tail()
Out[31]:
In [32]:
# Creating the time periods to do the forecast
from pandas.tseries.offsets import DateOffset
future_dates = [airline.index[-1] + DateOffset(months=x) for x in range(0,24) ]
In [33]:
future_dates
Out[33]:
In [34]:
#Creating a dataframe with the index
future_dates_df = pd.DataFrame(index=future_dates[1:],columns=airline.columns)
In [35]:
#Concatenating both dataframes
airline_forecast = pd.concat([airline,future_dates_df])
In [36]:
# Checking that the dataset is right before running the model.
airline_forecast.tail()
len(airline_forecast)
len(airline)
Out[36]:
In [37]:
#Predicting the future values
# Applying the model to the future dates.
airline_forecast['forecast'] = results.predict(start = 143, end = 167, dynamic= True)
airline_forecast[['Thousands of Passengers', 'forecast']].plot(figsize=(16, 8));
In [ ]: