Bringing an EU industry production dataframe into good shape

TL;DR

I use pandas dataframe methods to bring EU industry production data into a tidy format to facilitate further analysis.

Long Description

Here I use the Python packages SQLAlchemy and pandas to read in a subset of the EU industry production dataset from a local PostgreSQL database into a dataframe. I apply pandas methods like melt and set_index to reorganize the dataframe in a tidy form that is convenient for efficient data analysis and visualization, thus saving a lot of time and headaches further down the road. With pandas, the reorganization turns out to be quick and straightforward.

Table of contents

Project background

For an efficient and easy data visualization and analysis, it is vital to have the data in a tidy format. To get an idea of what this means, let us consider the definition by Hadley Wickham, who devoted an entire paper to that matter:

In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

This simple, “tidy” data form is optimized for automated processing rather than for human readers who process information visually. In the following, the tidy form will be the guideline for and goal of reorganizing the EU industry production dataset.

Reading in the dataset

To start, I read in a subset of the EU industry production dataset from a local PostgreSQL database into a pandas dataframe, using SQLalchemy, as described in the previous project part.

Using an SQL query, I select the manufacturing (category “C”) production data index for all countries, adjusted for calendar and season:

import pandas as pd
import sqlalchemy

# Specify the SQL query for selecting the desired subset
query = 'SELECT * FROM industry_production WHERE nace_r2=\'C\' AND s_adj=\'SCA\' AND unit=\'I10\''

# Establish connection to PostgreSQL database
engine = sqlalchemy.create_engine("postgresql://postgres:xfkLVeMj@localhost/production")

# Execute SQL query and store result as dataframe
df = pd.read_sql_query(query, engine)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Columns: 781 entries, indic_bt to 1953M01
dtypes: object(781)
memory usage: 219.7+ KB

Preliminary cleaning

Inspection

Let’s inspect the dataframe a bit closer:

df.head()

  indic_bt nace_r2 s_adj unit geo\time 2017M08  2017M07  2017M06  2017M05   \
0     PROD       C   SCA  I10       UK       :    104.5    103.9    103.9    
1     PROD       C   SCA  I10       TR       :   135.2 p  132.4 p  132.8 p   
2     PROD       C   SCA  I10       SK       :   145.9 p   154.8    155.6    
3     PROD       C   SCA  I10       SI       :   124.9 p  124.6 p  125.1 p   
4     PROD       C   SCA  I10       SE       :    101.5    102.2    101.2    

  2017M04    ...   1953M10  1953M09  1953M08  1953M07  1953M06  1953M05   \
0   104.0    ...         :        :        :        :        :        :    
1  135.6 p   ...         :        :        :        :        :        :    
2   148.0    ...         :        :        :        :        :        :    
3  123.2 p   ...         :        :        :        :        :        :    
4    98.1    ...         :        :        :        :        :        :    

  1953M04  1953M03  1953M02  1953M01  
0       :        :        :        :  
1       :        :        :        :  
2       :        :        :        :  
3       :        :        :        :  
4       :        :        :        :  

[5 rows x 781 columns]

The first four columns do not offer any additional information because there is only a single value in each column, based on the WHERE filter of the query.

Dropping redundant columns

Let us get rid of them with the drop() method:

df.drop(['indic_bt','nace_r2','s_adj','unit'], axis=1, inplace=True)
print(df.head())

  geo\time 2017M08  2017M07  2017M06  2017M05  2017M04  2017M03  2017M02   \
0       UK       :    104.5    103.9    103.9    104.0    104.0    104.6    
1       TR       :   135.2 p  132.4 p  132.8 p  135.6 p  132.2 p  130.3 p   
2       SK       :   145.9 p   154.8    155.6    148.0    160.3    153.0    
3       SI       :   124.9 p  124.6 p  125.1 p  123.2 p  124.1 p  122.3 p   
4       SE       :    101.5    102.2    101.2     98.1    100.6    100.6    

  2017M01  2016M12    ...   1953M10  1953M09  1953M08  1953M07  1953M06   \
0   105.0    106.2    ...         :        :        :        :        :    
1  131.3 p  129.4 p   ...         :        :        :        :        :    
2   153.4    154.1    ...         :        :        :        :        :    
3  117.6 p  121.8 p   ...         :        :        :        :        :    
4   100.1     96.6    ...         :        :        :        :        :    

  1953M05  1953M04  1953M03  1953M02  1953M01  
0       :        :        :        :        :  
1       :        :        :        :        :  
2       :        :        :        :        :  
3       :        :        :        :        :  
4       :        :        :        :        :  

[5 rows x 777 columns]

The parameter axis=1 tells Pandas that the things to be dropped are columns (instead of rows), and inplace=True applies the deletion to the dataframe df itself (instead of a copy).

Fixing a column name

As a next step, I change the messed up name of the geo\time column to something more meaningful. The rename() method takes a dictionary with entries of the form 'old_name':'new_name' for the column parameter:

df.rename(columns={'geo\\time':'country_code'}, inplace=True)
print(df.head())

  country_code 2017M08  2017M07  2017M06  2017M05  2017M04  2017M03  2017M02   \
0           UK       :    104.5    103.9    103.9    104.0    104.0    104.6    
1           TR       :   135.2 p  132.4 p  132.8 p  135.6 p  132.2 p  130.3 p   
2           SK       :   145.9 p   154.8    155.6    148.0    160.3    153.0    
3           SI       :   124.9 p  124.6 p  125.1 p  123.2 p  124.1 p  122.3 p   
4           SE       :    101.5    102.2    101.2     98.1    100.6    100.6    

  2017M01  2016M12    ...   1953M10  1953M09  1953M08  1953M07  1953M06   \
0   105.0    106.2    ...         :        :        :        :        :    
1  131.3 p  129.4 p   ...         :        :        :        :        :    
2   153.4    154.1    ...         :        :        :        :        :    
3  117.6 p  121.8 p   ...         :        :        :        :        :    
4   100.1     96.6    ...         :        :        :        :        :    

  1953M05  1953M04  1953M03  1953M02  1953M01  
0       :        :        :        :        :  
1       :        :        :        :        :  
2       :        :        :        :        :  
3       :        :        :        :        :  
4       :        :        :        :        :  

[5 rows x 777 columns]

While this dataframe structure is quite convenient to read as a human (ignoring its sheer size for a moment…), for data analysis purposes it is not optimal: The time series is spread over many columns and the time stamps are strings in a non-standard format.

Identifying variables and observations

To tidy up the dataframe now, I first need to know what the (i) variables, (ii) observations, and (iii) types of observational units are.

Point (iii) is simple: I only selected the production index from the PostgreSQL table. The variables are the country code, the time, and the production index. The observations are production index values for a given country and month.

A tidy version of the dataframe would thus have three columns (time, country_code, production_index) and many rows. The columns are typically ordered like this: Fixed variables, i.e. variables that define the experimental setup, come first (time, country_code). Measured variables (production_index) come later.

Reorganizing the dataset

Melting the time columns

In the current structure of the dataframe, country_code is already a column, but time is not. Instead, the time values (months) are separate columns. I can store the column names in a time column using the pandas method melt():

df_melted = pd.melt(df, id_vars=['country_code'], var_name='time', value_name='production_index')
df_melted.info()
print(df_melted.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27936 entries, 0 to 27935
Data columns (total 3 columns):
country_code        27936 non-null object
time                27936 non-null object
production_index    27936 non-null object
dtypes: object(3)
memory usage: 654.8+ KB

  country_code      time production_index
0           UK  2017M08                : 
1           TR  2017M08                : 
2           SK  2017M08                : 
3           SI  2017M08                : 
4           SE  2017M08                :

The melt() method takes all columns that are not listed in the id_vars argument and puts the column names in a variable column and the values in a value column. I have changed these standard column names to 'time' and 'production_index' using the var_name and value_name arguments.

Using a Datetime Index…

The current shape of the dataframe after applying the melt() method is already much closer to the desired tidy form. However, the time values are still strings (non-null object in the info() method output), so Python does not “know” about the dataset being a time series. This can be changed by transforming the time column to a DatetimeIndex. First, I convert the time column from string to Datetime format with the to_datetime() method. To help the parser understand the format, I replace the 'M' in the string with a '-' before passing the column to the conversion method:

df_melted['time'] = pd.to_datetime(df_melted['time'].str.replace('M','-'))
df_melted.info()
print(df_melted.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27936 entries, 0 to 27935
Data columns (total 3 columns):
country_code        27936 non-null object
time                27936 non-null datetime64[ns]
production_index    27936 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 654.8+ KB

  country_code       time production_index
0           UK 2017-08-01               : 
1           TR 2017-08-01               : 
2           SK 2017-08-01               : 
3           SI 2017-08-01               : 
4           SE 2017-08-01               :

Now I can use the time column to obtain a DatetimeIndex, using the set_index() method:

df_tidy = df_melted.set_index('time')
df_tidy.info()
print(df_tidy.head())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 27936 entries, 2017-08-01 to 1953-01-01
Data columns (total 2 columns):
country_code        27936 non-null object
production_index    27936 non-null object
dtypes: object(2)
memory usage: 654.8+ KB

           country_code production_index
time                                    
2017-08-01           UK               : 
2017-08-01           TR               : 
2017-08-01           SK               : 
2017-08-01           SI               : 
2017-08-01           SE               :

This looks much better!

…or rather a Multi-Index

There is, however, one more thing to consider: Not only time, but also country_code is a fixed variable of the dataset. The dataset is thus two-dimensional. In pandas, this can be reflected by a multi-dimensional index, or short, “MultiIndex”. It is simply constructed by providing a list of column names to the set_index() method instead of a single column name:

df_tidy = df_melted.set_index(['time','country_code'])
df_tidy.info()
print(df_tidy.head())

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 27936 entries, (2017-08-01 00:00:00, UK) to (1953-01-01 00:00:00, AT)
Data columns (total 1 columns):
production_index    27936 non-null object
dtypes: object(1)
memory usage: 306.5+ KB

                        production_index
time       country_code                 
2017-08-01 UK                         : 
           TR                         : 
           SK                         : 
           SI                         : 
           SE                         :

This structure is a much more natural representation of the dataset for data analysis with pandas dataframes.

Sorting the index

As a last step, I sort the index (dates in chronological order, country codes in alphabetical order):

df_tidy.sort_index(inplace=True)
df_tidy.info()
print(df_tidy.head())

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 27936 entries, (1953-01-01 00:00:00, AT) to (2017-08-01 00:00:00, UK)
Data columns (total 1 columns):
production_index    27936 non-null object
dtypes: object(1)
memory usage: 306.5+ KB

                        production_index
time       country_code                 
1953-01-01 AT                          :
           BA                          :
           BE                          :
           BG                          :
           CY                          :

Besides bringing the dataframe in a neat order, this is also necessary to be able to slice it (e.g., extract a certain date range) in future projects.

Conclusion

I have reorganized an EU industry production data (sub-)set that I read in from a PostgreSQL database to bring it in a tidy form. At the beginning, the pandas dataframe was in a form suitable for human reading: The two dimensions of the dataset (time and country) were reflected by the two dimensions of the dataframe (columns and rows). Using pandas methods, I restructured the dataframe so that the two dimensions form a hierarchical index (MultiIndex) and the production index values form a value column. This is a form that pandas “understands” and thus will be beneficial for analyzing and visualizing the dataset.

This exercise also showed how powerful pandas dataframe methods are. With only a couple lines of code and no explicit loops, hundreds of columns could be parsed, converted and reorganized. Especially the datetime parser — while being slow — is remarkably capable of correcly identifying the date elements in strings, even when they are incomplete.

Code

The project code was written using Jupyter Notebook 5.0.0, running the Python 3.6.1 kernel and Anaconda 4.4.0.

The Jupyter notebook can be found on Github.

Bio

I am a data scientist with a background in solar physics, with a long experience of turning complex data into valuable insights. Originally coming from Matlab, I now use the Python stack to solve problems.

Contact details

Jan Langfellner
contact@jan-langfellner.de
linkedin.com/in/jan-langfellner/

PhD thesis

Impressum