TL;DR
I use pandas dataframe methods to bring EU industry production data into a tidy format to facilitate further analysis.
Long Description
Here I use the Python packages SQLAlchemy and pandas to read in a subset of the EU industry production dataset from a local PostgreSQL database into a dataframe. I apply pandas methods like melt and set_index to reorganize the dataframe in a tidy form that is convenient for efficient data analysis and visualization, thus saving a lot of time and headaches further down the road. With pandas, the reorganization turns out to be quick and straightforward.
Table of contents
Project background
For an efficient and easy data visualization and analysis, it is vital to have the data in a tidy format. To get an idea of what this means, let us consider the definition by Hadley Wickham, who devoted an entire paper to that matter:
In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
This simple, “tidy” data form is optimized for automated processing rather than for human readers who process information visually. In the following, the tidy form will be the guideline for and goal of reorganizing the EU industry production dataset.
Reading in the dataset
To start, I read in a subset of the EU industry production dataset from a local PostgreSQL database into a pandas dataframe, using SQLalchemy, as described in the previous project part.
Using an SQL query, I select the manufacturing (category “C”) production data index for all countries, adjusted for calendar and season:
import pandas as pd
import sqlalchemy
# Specify the SQL query for selecting the desired subset
query = 'SELECT * FROM industry_production WHERE nace_r2=\'C\' AND s_adj=\'SCA\' AND unit=\'I10\''
# Establish connection to PostgreSQL database
engine = sqlalchemy.create_engine("postgresql://postgres:xfkLVeMj@localhost/production")
# Execute SQL query and store result as dataframe
df = pd.read_sql_query(query, engine)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36 entries, 0 to 35 Columns: 781 entries, indic_bt to 1953M01 dtypes: object(781) memory usage: 219.7+ KB
Preliminary cleaning
Inspection
Let’s inspect the dataframe a bit closer:
df.head()
indic_bt nace_r2 s_adj unit geo\time 2017M08 2017M07 2017M06 2017M05 \ 0 PROD C SCA I10 UK : 104.5 103.9 103.9 1 PROD C SCA I10 TR : 135.2 p 132.4 p 132.8 p 2 PROD C SCA I10 SK : 145.9 p 154.8 155.6 3 PROD C SCA I10 SI : 124.9 p 124.6 p 125.1 p 4 PROD C SCA I10 SE : 101.5 102.2 101.2 2017M04 ... 1953M10 1953M09 1953M08 1953M07 1953M06 1953M05 \ 0 104.0 ... : : : : : : 1 135.6 p ... : : : : : : 2 148.0 ... : : : : : : 3 123.2 p ... : : : : : : 4 98.1 ... : : : : : : 1953M04 1953M03 1953M02 1953M01 0 : : : : 1 : : : : 2 : : : : 3 : : : : 4 : : : : [5 rows x 781 columns]
The first four columns do not offer any additional information because there is only a single value in each column, based on the WHERE
filter of the query.
Dropping redundant columns
Let us get rid of them with the drop()
method:
df.drop(['indic_bt','nace_r2','s_adj','unit'], axis=1, inplace=True)
print(df.head())
geo\time 2017M08 2017M07 2017M06 2017M05 2017M04 2017M03 2017M02 \ 0 UK : 104.5 103.9 103.9 104.0 104.0 104.6 1 TR : 135.2 p 132.4 p 132.8 p 135.6 p 132.2 p 130.3 p 2 SK : 145.9 p 154.8 155.6 148.0 160.3 153.0 3 SI : 124.9 p 124.6 p 125.1 p 123.2 p 124.1 p 122.3 p 4 SE : 101.5 102.2 101.2 98.1 100.6 100.6 2017M01 2016M12 ... 1953M10 1953M09 1953M08 1953M07 1953M06 \ 0 105.0 106.2 ... : : : : : 1 131.3 p 129.4 p ... : : : : : 2 153.4 154.1 ... : : : : : 3 117.6 p 121.8 p ... : : : : : 4 100.1 96.6 ... : : : : : 1953M05 1953M04 1953M03 1953M02 1953M01 0 : : : : : 1 : : : : : 2 : : : : : 3 : : : : : 4 : : : : : [5 rows x 777 columns]
The parameter axis=1
tells Pandas that the things to be dropped are columns (instead of rows), and inplace=True
applies the deletion to the dataframe df
itself (instead of a copy).
Fixing a column name
As a next step, I change the messed up name of the geo\time
column to something more meaningful. The rename()
method takes a dictionary with entries of the form 'old_name':'new_name'
for the column
parameter:
df.rename(columns={'geo\\time':'country_code'}, inplace=True)
print(df.head())
country_code 2017M08 2017M07 2017M06 2017M05 2017M04 2017M03 2017M02 \ 0 UK : 104.5 103.9 103.9 104.0 104.0 104.6 1 TR : 135.2 p 132.4 p 132.8 p 135.6 p 132.2 p 130.3 p 2 SK : 145.9 p 154.8 155.6 148.0 160.3 153.0 3 SI : 124.9 p 124.6 p 125.1 p 123.2 p 124.1 p 122.3 p 4 SE : 101.5 102.2 101.2 98.1 100.6 100.6 2017M01 2016M12 ... 1953M10 1953M09 1953M08 1953M07 1953M06 \ 0 105.0 106.2 ... : : : : : 1 131.3 p 129.4 p ... : : : : : 2 153.4 154.1 ... : : : : : 3 117.6 p 121.8 p ... : : : : : 4 100.1 96.6 ... : : : : : 1953M05 1953M04 1953M03 1953M02 1953M01 0 : : : : : 1 : : : : : 2 : : : : : 3 : : : : : 4 : : : : : [5 rows x 777 columns]
While this dataframe structure is quite convenient to read as a human (ignoring its sheer size for a moment…), for data analysis purposes it is not optimal: The time series is spread over many columns and the time stamps are strings in a non-standard format.
Identifying variables and observations
To tidy up the dataframe now, I first need to know what the (i) variables, (ii) observations, and (iii) types of observational units are.
Point (iii) is simple: I only selected the production index from the PostgreSQL table. The variables are the country code, the time, and the production index. The observations are production index values for a given country and month.
A tidy version of the dataframe would thus have three columns (time
, country_code
, production_index
) and many rows. The columns are typically ordered like this: Fixed variables, i.e. variables that define the experimental setup, come first (time
, country_code
). Measured variables (production_index
) come later.
Reorganizing the dataset
Melting the time columns
In the current structure of the dataframe, country_code
is already a column, but time
is not. Instead, the time values (months) are separate columns. I can store the column names in a time column using the pandas method melt()
:
df_melted = pd.melt(df, id_vars=['country_code'], var_name='time', value_name='production_index')
df_melted.info()
print(df_melted.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 27936 entries, 0 to 27935 Data columns (total 3 columns): country_code 27936 non-null object time 27936 non-null object production_index 27936 non-null object dtypes: object(3) memory usage: 654.8+ KB country_code time production_index 0 UK 2017M08 : 1 TR 2017M08 : 2 SK 2017M08 : 3 SI 2017M08 : 4 SE 2017M08 :
The melt()
method takes all columns that are not listed in the id_vars
argument and puts the column names in a variable
column and the values in a value
column. I have changed these standard column names to 'time'
and 'production_index'
using the var_name
and value_name
arguments.
Using a Datetime Index…
The current shape of the dataframe after applying the melt()
method is already much closer to the desired tidy form. However, the time
values are still strings (non-null object
in the info()
method output), so Python does not “know” about the dataset being a time series. This can be changed by transforming the time
column to a DatetimeIndex. First, I convert the time
column from string to Datetime format with the to_datetime()
method. To help the parser understand the format, I replace the 'M'
in the string with a '-'
before passing the column to the conversion method:
df_melted['time'] = pd.to_datetime(df_melted['time'].str.replace('M','-'))
df_melted.info()
print(df_melted.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 27936 entries, 0 to 27935 Data columns (total 3 columns): country_code 27936 non-null object time 27936 non-null datetime64[ns] production_index 27936 non-null object dtypes: datetime64[ns](1), object(2) memory usage: 654.8+ KB country_code time production_index 0 UK 2017-08-01 : 1 TR 2017-08-01 : 2 SK 2017-08-01 : 3 SI 2017-08-01 : 4 SE 2017-08-01 :
Now I can use the time column to obtain a DatetimeIndex, using the set_index()
method:
df_tidy = df_melted.set_index('time')
df_tidy.info()
print(df_tidy.head())
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 27936 entries, 2017-08-01 to 1953-01-01 Data columns (total 2 columns): country_code 27936 non-null object production_index 27936 non-null object dtypes: object(2) memory usage: 654.8+ KB country_code production_index time 2017-08-01 UK : 2017-08-01 TR : 2017-08-01 SK : 2017-08-01 SI : 2017-08-01 SE :
This looks much better!
…or rather a Multi-Index
There is, however, one more thing to consider: Not only time
, but also country_code
is a fixed variable of the dataset. The dataset is thus two-dimensional. In pandas, this can be reflected by a multi-dimensional index, or short, “MultiIndex”. It is simply constructed by providing a list of column names to the set_index()
method instead of a single column name:
df_tidy = df_melted.set_index(['time','country_code'])
df_tidy.info()
print(df_tidy.head())
<class 'pandas.core.frame.DataFrame'> MultiIndex: 27936 entries, (2017-08-01 00:00:00, UK) to (1953-01-01 00:00:00, AT) Data columns (total 1 columns): production_index 27936 non-null object dtypes: object(1) memory usage: 306.5+ KB production_index time country_code 2017-08-01 UK : TR : SK : SI : SE :
This structure is a much more natural representation of the dataset for data analysis with pandas dataframes.
Sorting the index
As a last step, I sort the index (dates in chronological order, country codes in alphabetical order):
df_tidy.sort_index(inplace=True)
df_tidy.info()
print(df_tidy.head())
<class 'pandas.core.frame.DataFrame'> MultiIndex: 27936 entries, (1953-01-01 00:00:00, AT) to (2017-08-01 00:00:00, UK) Data columns (total 1 columns): production_index 27936 non-null object dtypes: object(1) memory usage: 306.5+ KB production_index time country_code 1953-01-01 AT : BA : BE : BG : CY :
Besides bringing the dataframe in a neat order, this is also necessary to be able to slice it (e.g., extract a certain date range) in future projects.
Conclusion
I have reorganized an EU industry production data (sub-)set that I read in from a PostgreSQL database to bring it in a tidy form. At the beginning, the pandas dataframe was in a form suitable for human reading: The two dimensions of the dataset (time and country) were reflected by the two dimensions of the dataframe (columns and rows). Using pandas methods, I restructured the dataframe so that the two dimensions form a hierarchical index (MultiIndex) and the production index values form a value column. This is a form that pandas “understands” and thus will be beneficial for analyzing and visualizing the dataset.
This exercise also showed how powerful pandas dataframe methods are. With only a couple lines of code and no explicit loops, hundreds of columns could be parsed, converted and reorganized. Especially the datetime parser — while being slow — is remarkably capable of correcly identifying the date elements in strings, even when they are incomplete.
Code
The project code was written using Jupyter Notebook 5.0.0, running the Python 3.6.1 kernel and Anaconda 4.4.0.
The Jupyter notebook can be found on Github.
Bio
I am a data scientist with a background in solar physics, with a long experience of turning complex data into valuable insights. Originally coming from Matlab, I now use the Python stack to solve problems.
Contact details
Jan Langfellner
contact@jan-langfellner.de
linkedin.com/in/jan-langfellner/