Removing common trends from a set of time series to highlight their differences

TL;DR

I divide the EU industry production index time series for each country by the smoothed EU average time series to bring out the countries’ individual development for further modeling.

Long Description

Using a chain of pandas methods to obtain a rolling-mean average, I smooth the EU average time series of the industry production index. This curve contains the development that the different countries have in common. Dividing the time series for each country element-wise by the EU-average curve thus removes this common part. The remaining normalized production index reflects the countries’ individual development compared with the EU average and is the basis for the growth modeling in the next project.

Table of contents

Project Background

From the exploratory data analysis that I performed on the EU industry production data in the previous project, I know that the time series for the different countries show different long-term trends (increasing or decreasing production index) but also conspicuous parallel features such as a sudden drop and recovery in the years 2008–2011.

As I would like to compare the countries’ performance, I want to (i) remove these common trends to distill their individual devolopment and (ii) then use some simple model to extract some quantity from the time series that can serve as a measure for the countries’ performance.

In this project, I normalize the individual time series by the EU average to achieve goal (i). The next project will then deal with goal (ii).

Recap

Reading in the data

As before, I read in the tidy dataframe with clean values from a pickled file, using the read_pickle() method and show its structure:

import pandas as pd
df = pd.read_pickle('EU_industry_production_dataframe_clean.pkl')
df.info()
print(df.head())

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 27936 entries, (1953-01-01 00:00:00, AT) to (2017-08-01 00:00:00, UK)
Data columns (total 2 columns):
production_index    8744 non-null float64
flags               27936 non-null category
dtypes: category(1), float64(1)
memory usage: 333.9+ KB

                         production_index flags
time       country_code                        
1953-01-01 AT                         NaN      
           BA                         NaN      
           BE                         NaN      
           BG                         NaN      
           CY                         NaN

Data visualization

From the exploratory data analysis that I did in the previous project I know that the individual countries showed a sudden drop in the production index values in 2008. Here is how I plotted the curves for all the countries:

import matplotlib.pyplot as plt
import numpy as np

# Select data for years 2000-2017 and store it in a new dataframe:
df_late = df.loc[(slice('2000','2017'),slice(None)),:].copy()   # Note: the copy() is needed to avoid getting a view

# Create figure and axes objects:
fig, ax = plt.subplots()

# Reset line properties to avoid ambiguity between different lines:
ax.set_prop_cycle(color=plt.cm.nipy_spectral(np.linspace(0,1,36)),linestyle=['-','--',':']*12)

# Create the plot:
ax = df_late.unstack(level=1).plot(y='production_index',ax=ax)
plt.xlabel('Year')
plt.ylabel('Production index')
ax.legend(ncol=6, fontsize=7)   # Adjust shape (four columns instead of one) and font size of legend
plt.show()

The drop is clearest in the average over all 28 EU countries:

eu_avg_late = df_late.loc[(slice(None),'EU28'),:].unstack(level=1)  # Select EU average for years 2000-2017

ax = eu_avg_late.plot(y='production_index')  # Plot
plt.xlabel('Year')
plt.ylabel('Production index')
plt.show()

In addition to the drop and recovery, there is a moderate long-term increase of the production index.

This curve can be used to normalize the time series for each country, to bring out the individual performance, i.e. if a country has developed faster or slower than the EU average.

Smoothing the average curve

Rolling mean for the selected years…

Before appying the normalization, I first smooth the EU average to get rid of the short-time fluctuations. There are different ways to accomplish this (e.g., Savitzky-Golay filters), with different side effects. Here, I just want a rough correction and use a simple rolling mean over five consecutive time steps (five months):

eu_avg_late_smooth = eu_avg_late.rolling(window=5).mean()  # Smooth EU average

ax = eu_avg_late_smooth.plot(y='production_index')  # Plot
plt.xlabel('Year')
plt.ylabel('Production index')
plt.show()

This curve is smoother indeed. Note, however, that I lose four data points at the beginning (January 2000 through April 2000) because the rolling mean of five points needs the current point plus the previous four data points to output a value.

…and for all the years

To prevent this from happening, I first apply the rolling mean on the full time series first and only then select the period 2000–2017:

eu_avg = df.loc[(slice(None),'EU28'),:].unstack(level=1)  # Select EU average for all years (1953-2017)
eu_avg_smooth = eu_avg.rolling(window=5).mean()           # Smooth EU average
eu_avg_late_smooth_corrected = eu_avg_smooth.loc[slice('2000','2017'),:]  # Restrict to years 2000-2017

ax = eu_avg_late_smooth_corrected.plot(y='production_index')  # Plot
plt.xlabel('Year')
plt.ylabel('Production index')
plt.show()

Normalizing the time series

Dividing by the EU average

Now I take the smoothed curve and divide each time series in df_late by it, element-wise. This is done with the pandas div() method. Before applying it, I select the production_index (since I cannot sensibly divide the flag strings by numbers) and unstack the country_code from the MultiIndex (turn each country into its own column).

In the resulting pandas series, I stack up the country_code values in the index again to have the same structure as in the original dataframe:

eu_avg_late_smooth_corrected_series = eu_avg_late_smooth_corrected['production_index','EU28']  # Select pandas series

# Unstack country_code, divide each column by smoothed series, stack country_code again:
production_index_norm_series = df_late['production_index'].unstack(level=1).div(eu_avg_late_smooth_corrected_series,axis=0).stack()
print(production_index_norm_series.head())

time        country_code
2000-01-01  AT              0.763119
            BE              0.659628
            BG              0.638720
            CY              1.076730
            CZ              0.609450
dtype: float64

This series contains the ratio of a country’s production index and the EU average production index, so it is easy to see how well a country has performed compared to others in the given period.

A glimpse on the normalized time series

Let’s plot the curves and see if the dip after the start of the financial crisis is gone:

fig, ax = plt.subplots()        # Create figure and axes objects

# Reset line properties to avoid ambiguity between different lines:
ax.set_prop_cycle(color=plt.cm.nipy_spectral(np.linspace(0,1,36)),linestyle=['-','--',':']*12)

# Create the plot:
ax = production_index_norm_series.unstack(level=1).plot(ax=ax)
plt.xlabel('Year')
plt.ylabel('Production index [relative to EU28 avg.]')
ax.legend(ncol=6, fontsize=7)   # Adjust shape (four columns instead of one) and font size of legend
plt.show()

The correction is far from perfect, but removes the biggest part of the dip in the years 2008–2011.

Store the normalized time series in the dataframe

I integrate the normalized industry production values into the dataframe:

df_late.loc[:,'production_index_norm'] = production_index_norm_series
print(df_late.info())
print(df_late.head())

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 7632 entries, (2000-01-01 00:00:00, AT) to (2017-08-01 00:00:00, UK)
Data columns (total 3 columns):
production_index         7412 non-null float64
flags                    7632 non-null category
production_index_norm    7404 non-null float64
dtypes: category(1), float64(2)
memory usage: 156.8+ KB
None

                         production_index flags  production_index_norm
time       country_code                                               
2000-01-01 AT                        73.0                     0.763119
           BA                         NaN                          NaN
           BE                        63.1                     0.659628
           BG                        61.1                     0.638720
           CY                       103.0                     1.076730

The normalized time series are now ready to be used for modeling.

For using the dataframe in the next project, I store it on disk as a pickled file:

df_late.to_pickle('EU_industry_production_dataframe_normalized_2000-2017.pkl')

Conclusion

I have smoothed the EU-average production index time series using a rolling-mean average and divided the time series for each country by this smooth average curve. This processing has mostly removed the common features of all the time series, in particular the drop and subsequent rise of the index in the wake of the financial crisis in 2008. The resulting normalized production index time series shows the individual development of the production index for each country with respect to the common development.

They can thus be used as a starting point for modeling the growth dynamics of the production index for each country.

In this project, I had to pay closer attention to whether pandas creates a view or a copy of a dataframe selection when I attempted to assign values to the dataframe. Especially the distinction between chaining (df['a']['b']) and locator-based access (df.loc['b','a']) is important: Chaining might create either a view or a copy (depending on arcane implementation details of pandas), whereas loc will create a view. So the latter should be used for the assignment. If the first option is used instead, pandas issues a warning (SettingWithCopyWarning, see pandas documentation) because it might be that not the original dataframe is altered, but a copy that has no reference.

Code

The project code was written using Jupyter Notebook 5.0.0, running the Python 3.6.3 kernel and Anaconda 5.0.1.

The Jupyter notebook can be found on Github.

Bio

I am a data scientist with a background in solar physics, with a long experience of turning complex data into valuable insights. Originally coming from Matlab, I now use the Python stack to solve problems.

Contact details

Jan Langfellner
contact@jan-langfellner.de
linkedin.com/in/jan-langfellner/

PhD thesis

Impressum