TL;DR
Using the linear model from Python’s scikit-learn package, I obtain the slopes in the EU industry production time series for each country.
Long Description
I prepare the normalized EU industry production index dataset for the fit routine of the scikit-learn linear model by forcing the time stamps into a 2D numpy array and the production indices into one 1D array per country. For each country, I perform the linear regression with the fit method and store both the intercept and slope parameters of the fit lines in 1D numpy arrays. I also obtain the slope by simply subtracting the production index at the beginning of the time series from its value at the end and dividing by the time span. Both slope estimates and the intercept I store together with the country_codes in a new dataframe, simplifying further analysis.
Table of contents
Project Background
After having defined the model to describe the EU industry production index time series in the previous project, I now fit the model with the help of the scikit-learn package.
While the package offers many different generalized linear models, I opt to remain with standard linear regression, given the simplicity of the problem with only one dependent variable.
Data preparation
Inspecting the data
I start by reading in the dataframe that I previously prepared for this project:
import pandas as pd
df = pd.read_pickle('EU_industry_production_dataframe_forregression.pkl')
df.info()
print(df.head())
<class 'pandas.core.frame.DataFrame'> Float64Index: 212 entries, -10.0013689254 to 7.58110882957 Data columns (total 36 columns): AT 211 non-null float64 BA 139 non-null float64 BE 211 non-null float64 BG 211 non-null float64 CY 211 non-null float64 CZ 211 non-null float64 DE 211 non-null float64 DK 211 non-null float64 EA19 211 non-null float64 EE 211 non-null float64 EL 211 non-null float64 ES 211 non-null float64 EU28 211 non-null float64 FI 211 non-null float64 FR 211 non-null float64 HR 211 non-null float64 HU 211 non-null float64 IE 211 non-null float64 IT 211 non-null float64 LT 211 non-null float64 LU 211 non-null float64 LV 211 non-null float64 ME 91 non-null float64 MK 211 non-null float64 MT 211 non-null float64 NL 211 non-null float64 NO 211 non-null float64 PL 211 non-null float64 PT 211 non-null float64 RO 211 non-null float64 RS 211 non-null float64 SE 211 non-null float64 SI 211 non-null float64 SK 211 non-null float64 TR 211 non-null float64 UK 211 non-null float64 dtypes: float64(36) memory usage: 61.3 KB country_code AT BA BE BG CY CZ DE \ time -10.001369 0.763119 NaN 0.659628 0.638720 1.076730 0.609450 0.882291 -9.916496 0.804837 NaN 0.675563 0.650542 1.094662 0.628649 0.900751 -9.837098 0.799001 NaN 0.682480 0.650229 1.071577 0.636704 0.900957 -9.752225 0.806887 NaN 0.703174 0.646132 1.076540 0.650280 0.909562 -9.670089 0.819436 NaN 0.705168 0.643401 1.054149 0.660902 0.923410 country_code DK EA19 EE ... NO PL \ time ... -10.001369 1.108091 1.015053 0.577044 ... 0.993101 0.510140 -9.916496 1.097790 1.027940 0.606756 ... 0.988324 0.521268 -9.837098 1.114232 1.029963 0.587807 ... 0.989388 0.516022 -9.752225 1.093134 1.036092 0.602572 ... 0.984236 0.533084 -9.670089 1.179741 1.043854 0.593988 ... 0.957381 0.542516 country_code PT RO RS SE SI SK \ time -10.001369 1.266987 0.752666 0.963830 0.950240 0.822705 0.461008 -9.916496 1.256255 0.753753 1.083194 0.987281 0.845496 0.473311 -9.837098 1.190179 0.771952 1.087183 0.995630 0.857262 0.487932 -9.752225 1.176105 0.781995 1.097283 1.015350 0.854594 0.490562 -9.670089 1.220918 0.779288 1.089150 1.019148 0.866790 0.456043 country_code TR UK time -10.001369 0.614677 1.116454 -9.916496 0.631776 1.129066 -9.837098 0.616937 1.125676 -9.752225 0.643020 1.116988 -9.670089 0.644431 1.113856 [5 rows x 36 columns]
Note that Bosnia and Herzegovina ('BA'
) and Montenegro ('ME'
) have many missing values, since their time series do not start in 2000, but later.
Dropping NaN values
The fitting routine for the linear model in scikit-learn does not accept NaN values though, so I need to account for that. Because only two countries are affected (which are not even in the EU), I simply drop them:
df.drop(['BA','ME'], axis=1, inplace=True)
In addition, all the other time series have only 211 non-null entries out of 212 values in total. These single NaN entries are the latest data point for each country. I drop them as well and get nice NaN-free time series:
df.drop(df.index[-1], inplace=True)
print(df.info())
<class 'pandas.core.frame.DataFrame'> Float64Index: 211 entries, -10.0013689254 to 7.49623545517 Data columns (total 34 columns): AT 211 non-null float64 BE 211 non-null float64 BG 211 non-null float64 CY 211 non-null float64 CZ 211 non-null float64 DE 211 non-null float64 DK 211 non-null float64 EA19 211 non-null float64 EE 211 non-null float64 EL 211 non-null float64 ES 211 non-null float64 EU28 211 non-null float64 FI 211 non-null float64 FR 211 non-null float64 HR 211 non-null float64 HU 211 non-null float64 IE 211 non-null float64 IT 211 non-null float64 LT 211 non-null float64 LU 211 non-null float64 LV 211 non-null float64 MK 211 non-null float64 MT 211 non-null float64 NL 211 non-null float64 NO 211 non-null float64 PL 211 non-null float64 PT 211 non-null float64 RO 211 non-null float64 RS 211 non-null float64 SE 211 non-null float64 SI 211 non-null float64 SK 211 non-null float64 TR 211 non-null float64 UK 211 non-null float64 dtypes: float64(34) memory usage: 57.7 KB None
Reshaping the dataset for scikit-learn
Next, I import the linear_model
class from scikit-learn:
from sklearn import linear_model
Before actually performing the linear regression, I need to bring the dataset in a form that is recognized by scikit-learn. The fitting method takes two parameters: the input X
and the output y
.
X
can be identified with the dependent variables, of which there is just one in this case, namely the time.
y
is the outcome, that is the normalized industry production index.
The input X
has to be a 2D array and is the same for all countries, so I assign it right away:
X = df.index.values # Get numpy array of index (time stamps in years)
print(X.shape)
X = X.reshape(-1, 1) # Transform numpy array from 1D to 2D
print(X.shape)
(211,) (211, 1)
The output y
is different for each country, so I will loop over all countries and perform a separate linear regression for each one.
I can get the country codes as a list by extracting all the column names from the dataframe:
country_list = df.columns.values # Get list of all country codes
print(country_list)
['AT' 'BE' 'BG' 'CY' 'CZ' 'DE' 'DK' 'EA19' 'EE' 'EL' 'ES' 'EU28' 'FI' 'FR' 'HR' 'HU' 'IE' 'IT' 'LT' 'LU' 'LV' 'MK' 'MT' 'NL' 'NO' 'PL' 'PT' 'RO' 'RS' 'SE' 'SI' 'SK' 'TR' 'UK']
Performing the linear regression
Now I loop over all countries.
In each step, I assign the normalized industry production values to y
, perform the linear regression, and store the slope and intercept values in numpy arrays:
import numpy as np
slope = np.empty(len(country_list)) # Create empty array for slopes from linear regression
intercept = np.empty(len(country_list)) # Create empty array for intercept from linear regression
# Loop over all countries:
for i, country in enumerate(country_list):
y = df[country].values # Get industry production index values for selected country
regr = linear_model.LinearRegression() # Instantiate linear regression object
regr.fit(X, y) # Perform linear regression (fit straight line)
slope[i] = regr.coef_[0] # Store slope in array (first and only coefficient)
intercept[i] = regr.intercept_ # Store intercept in array
Let’s have a look at the parameter values that I obtained from the linear regression:
print(slope)
print(intercept)
[ 1.69755062e-02 2.27397515e-02 2.33266960e-02 -2.73036128e-02 3.01791696e-02 1.10648741e-02 -2.54430214e-03 -2.06856083e-03 4.36360429e-02 -3.08404991e-02 -2.61402287e-02 3.36143245e-05 -3.74731167e-03 -1.69121515e-02 8.05852423e-04 2.80566577e-02 3.92839139e-02 -2.34791885e-02 3.66338076e-02 -6.37034263e-03 2.77396054e-02 1.52164653e-02 -8.32050455e-03 5.89524088e-03 6.97164004e-03 4.64402255e-02 -2.25316217e-02 3.60324141e-02 -6.70048827e-04 -9.21470009e-03 9.93230550e-03 5.91275559e-02 3.89830067e-02 -7.94040227e-03] [ 0.97951101 0.93506694 1.00117521 0.90635968 0.98434731 0.98850886 1.07562782 1.00401122 1.06813739 1.01401165 1.03302358 1.00149184 0.97096251 1.03319856 0.95866234 0.98838774 1.04417995 1.00206516 1.02512619 0.99293733 1.06671794 1.07534489 0.9701639 0.96239976 0.96680512 0.92858528 1.04249824 1.02776252 1.01763429 0.974905 0.97854261 0.99558271 0.98038008 1.00006333]
The slope values range between plus/minus a few percent per year, which appears sensible (judging from the plots in the previous project).
It also makes sense that the intercept values are close to one, which is the reference value for zero time (the year 2010).
An alternative measure for the slope
As a simple measure to check for the robustness of the slope values, I also calculate the slope using an alternative method, namely the difference between the last and the first production index values of the time series, divided by the length of the time series in years.
To reduce the influence of the short-time fluctuations, I average over the five first and last values instead of just taking one value each:
slope_alt = np.empty(len(country_list)) # Create empty array for slopes from difference end of time series minus beginning
# Loop over all countries:
for i, country in enumerate(country_list):
y = df[country].values # Get industry production index values for selected country
slope_alt[i] = (y[-5:].mean() - y[:5].mean())/(X[-1,0] - X[0,0]) # Store slope in array
Let’s also print the alternative slope values:
print(slope_alt)
[ 0.01540618 0.0203894 0.02858958 -0.01814415 0.03178366 0.00758084 0.00022828 -0.00215899 0.04310238 -0.02575114 -0.02014504 -0.00021196 -0.0016687 -0.01616444 0.00645503 0.0299987 0.04527303 -0.02057385 0.03709763 -0.00015542 0.03415784 0.00874983 -0.0112554 0.0037613 -0.00294682 0.04093304 -0.01623621 0.03449866 0.00147644 -0.00471612 0.01575481 0.05198007 0.03306995 -0.01023881]
Storing the slopes in a new dataframe
Let’s now create a new dataframe containing the two slope estimates and the intercept from the linear regression:
# Put data in dictionary:
slope_dict = {'country_code':country_list, 'slope':slope, 'intercept':intercept, 'slope_alt':slope_alt}
# Construct dataframe from dictionary:
df_slopes = pd.DataFrame(slope_dict)
# Choose country_code column as index:
df_slopes.set_index('country_code', inplace=True)
print(df_slopes.info())
print(df_slopes.head())
<class 'pandas.core.frame.DataFrame'> Index: 34 entries, AT to UK Data columns (total 3 columns): intercept 34 non-null float64 slope 34 non-null float64 slope_alt 34 non-null float64 dtypes: float64(3) memory usage: 1.1+ KB None intercept slope slope_alt country_code AT 0.979511 0.016976 0.015406 BE 0.935067 0.022740 0.020389 BG 1.001175 0.023327 0.028590 CY 0.906360 -0.027304 -0.018144 CZ 0.984347 0.030179 0.031784
This dataframe contains the slopes for each country. I will visually explore and analyze them in the next project.
Conclusion
I have prepared the normalized EU industry production index dataset as a set of NaN-free numpy arrays that can be used as input for the linear regression method of the scikit-learn package. Looping over all countries, I obtained both slope and intercept of the fit lines and stored them in numpy arrays. As an alternative way of obtaining the slope (that will be useful for assessing the robustness of the slopes), I have also computed the difference of the normalized production index between the end and the beginning of the datasets.
The slopes and intercept I have put together with the country codes in a new dataframe that I will use for visualization and further analysis.
Note that I needn’t throw away the (shorter) time series for Bosnia and Herzegovina as well as Montenegro. This was the simplest measure here and did not cost me much, but other datasets might contain many more missing values. To avoid throwing away data with NaN values, different strategies can be employed, e.g. imputing the mean or following values at single NaN values. In the case of the two mentioned countries in the EU industry production dataset, I could have restricted the linear regression to the shorter (NaN-free) time period to obtain slope and intercept values.
Code
The project code was written using Jupyter Notebook 5.0.0, running the Python 3.6.3 kernel and Anaconda 5.0.1.
The Jupyter notebook can be found on Github.
Bio
I am a data scientist with a background in solar physics, with a long experience of turning complex data into valuable insights. Originally coming from Matlab, I now use the Python stack to solve problems.
Contact details
Jan Langfellner
contact@jan-langfellner.de
linkedin.com/in/jan-langfellner/
One thought on “Reducing complexity – from a time series to a single number: coding”