Using the linear model from Python’s scikit-learn package, I obtain the slopes in the EU industry production time series for each country.

I prepare the normalized EU industry production index dataset for the fit routine of the scikit-learn linear model by forcing the time stamps into a 2D numpy array and the production indices into one 1D array per country. For each country, I perform the linear regression with the fit method and store both the intercept and slope parameters of the fit lines in 1D numpy arrays. I also obtain the slope by simply subtracting the production index at the beginning of the time series from its value at the end and dividing by the time span. Both slope estimates and the intercept I store together with the country_codes in a new dataframe, simplifying further analysis.

Project Background

After having defined the model to describe the EU industry production index time series in the previous project, I now fit the model with the help of the scikit-learn package.

While the package offers many different generalized linear models, I opt to remain with standard linear regression, given the simplicity of the problem with only one dependent variable.

Data preparation

Inspecting the data

I start by reading in the dataframe that I previously prepared for this project:

import pandas as pd
df = pd.read_pickle('EU_industry_production_dataframe_forregression.pkl')
Note that Bosnia and Herzegovina ('BA') and Montenegro ('ME') have many missing values, since their time series do not start in 2000, but later.

Dropping NaN values

The fitting routine for the linear model in scikit-learn does not accept NaN values though, so I need to account for that. Because only two countries are affected (which are not even in the EU), I simply drop them:

df.drop(['BA','ME'], axis=1, inplace=True)

In addition, all the other time series have only 211 non-null entries out of 212 values in total. These single NaN entries are the latest data point for each country. I drop them as well and get nice NaN-free time series:

df.drop(df.index[-1], inplace=True)
Reshaping the dataset for scikit-learn

Next, I import the linear_model class from scikit-learn:

from sklearn import linear_model

Before actually performing the linear regression, I need to bring the dataset in a form that is recognized by scikit-learn. The fitting method takes two parameters: the input X and the output y.

X can be identified with the dependent variables, of which there is just one in this case, namely the time.

y is the outcome, that is the normalized industry production index.

The input X has to be a 2D array and is the same for all countries, so I assign it right away:

X = df.index.values            # Get numpy array of index (time stamps in years)

X = X.reshape(-1, 1)           # Transform numpy array from 1D to 2D
(211, 1)

The output y is different for each country, so I will loop over all countries and perform a separate linear regression for each one.

I can get the country codes as a list by extracting all the column names from the dataframe:

country_list = df.columns.values   # Get list of all country codes
Performing the linear regression

Now I loop over all countries.

In each step, I assign the normalized industry production values to y, perform the linear regression, and store the slope and intercept values in numpy arrays:

import numpy as np

slope = np.empty(len(country_list))         # Create empty array for slopes from linear regression
intercept = np.empty(len(country_list))     # Create empty array for intercept from linear regression

# Loop over all countries:
for i, country in enumerate(country_list):
    y = df[country].values                  # Get industry production index values for selected country
    regr = linear_model.LinearRegression()  # Instantiate linear regression object, y)                          # Perform linear regression (fit straight line)
    slope[i] = regr.coef_[0]                # Store slope in array (first and only coefficient)
    intercept[i] = regr.intercept_          # Store intercept in array

Let’s have a look at the parameter values that I obtained from the linear regression:

The slope values range between plus/minus a few percent per year, which appears sensible (judging from the plots in the previous project).

It also makes sense that the intercept values are close to one, which is the reference value for zero time (the year 2010).

An alternative measure for the slope

As a simple measure to check for the robustness of the slope values, I also calculate the slope using an alternative method, namely the difference between the last and the first production index values of the time series, divided by the length of the time series in years.

To reduce the influence of the short-time fluctuations, I average over the five first and last values instead of just taking one value each:

slope_alt = np.empty(len(country_list))     # Create empty array for slopes from difference end of time series minus beginning

# Loop over all countries:
for i, country in enumerate(country_list):
    y = df[country].values                  # Get industry production index values for selected country
    slope_alt[i] = (y[-5:].mean() - y[:5].mean())/(X[-1,0] - X[0,0])      # Store slope in array

Let’s also print the alternative slope values:

Storing the slopes in a new dataframe

Let’s now create a new dataframe containing the two slope estimates and the intercept from the linear regression:

# Put data in dictionary:
slope_dict = {'country_code':country_list, 'slope':slope, 'intercept':intercept, 'slope_alt':slope_alt}

# Construct dataframe from dictionary:
df_slopes = pd.DataFrame(slope_dict)

# Choose country_code column as index:
df_slopes.set_index('country_code', inplace=True)

<class 'pandas.core.frame.DataFrame'>
Index: 34 entries, AT to UK
Data columns (total 3 columns):
intercept    34 non-null float64
slope        34 non-null float64
slope_alt    34 non-null float64
dtypes: float64(3)
memory usage: 1.1+ KB

This dataframe contains the slopes for each country. I will visually explore and analyze them in the next project.


I have prepared the normalized EU industry production index dataset as a set of NaN-free numpy arrays that can be used as input for the linear regression method of the scikit-learn package. Looping over all countries, I obtained both slope and intercept of the fit lines and stored them in numpy arrays. As an alternative way of obtaining the slope (that will be useful for assessing the robustness of the slopes), I have also computed the difference of the normalized production index between the end and the beginning of the datasets.

The slopes and intercept I have put together with the country codes in a new dataframe that I will use for visualization and further analysis.

Note that I needn’t throw away the (shorter) time series for Bosnia and Herzegovina as well as Montenegro. This was the simplest measure here and did not cost me much, but other datasets might contain many more missing values. To avoid throwing away data with NaN values, different strategies can be employed, e.g. imputing the mean or following values at single NaN values. In the case of the two mentioned countries in the EU industry production dataset, I could have restricted the linear regression to the shorter (NaN-free) time period to obtain slope and intercept values.


The project code was written using Jupyter Notebook 5.0.0, running the Python 3.6.3 kernel and Anaconda 5.0.1.

The Jupyter notebook can be found on Github.


I am a data scientist with a background in solar physics, with a long experience of turning complex data into valuable insights. Originally coming from Matlab, I now use the Python stack to solve problems.

