In this tutorial, you will learn how to calculate correlation between two or more variables in Python, using my very own Pingouin package for Python.
To install Pingouin, you need to have Python 3 installed on your computer. If you are using a Mac or Windows, I strongly recommand installing Python via the
To install pingouin, just open a terminal and type the following lines:
pip install --upgrade pingouin
Once Pingouin is installed, you can simply load it in a python script, ipython console, or Jupyter lab:
import pingouin as pg
The correlation coefficient (sometimes referred to as Pearson's correlation coefficient, Pearson's product-moment correlation, or simply r) measures the strength of the linear relationship between two variables. It is indisputably one of the most commonly used metrics in both science and industry. In science, it is typically used to test for a linear association between two dependent variables, or measurements.
In industry, specifically in a machine-learning context, it is used to discover collinearity between features, which may undermine the quality of a model.
The correlation coefficient is directly linked to the beta coefficient in a linear regression (= the slope of a best-fit line), but has the advantage of being standardized between -1 to 1 ; the former meaning a perfect negative linear relationship, and the latter a perfect positive linear relationship. In other words, no matter what are the original units of the two variables are, the correlation coefficient will always be in the range of -1 to 1, which makes it very easy to work with.
Finally, the correlation coefficient can be used to do hypothesis testing, in which case a so-called correlation test will return not only the correlation coefficient (the r value) but also the p-value, which, in short, quantifies the statistical significance of the test. For more details on this, I recommend reading the excellent book “Statistical Thinking for the 21st Century” by Stanford's Professor Russ Poldrack.
Load the data
For the sake of the example, I generated a fake dataset that comprises the results of personality tests of 200 individuals, together with their age, height, weight and IQ. Please note that these data are randomly generated and not representative of real individuals.Big Five model (or OCEAN), and are typically measured on a 1 to 5 scale
- Openness to experience (inventive/curious vs. consistent/cautious)
- Conscientiousness (efficient/organized vs. easy-going/careless)
- Extraversion (outgoing/energetic vs. solitary/reserved)
- Agreeableness (friendly/compassionate vs. challenging/detached)
- Neuroticism (sensitive/nervous vs. secure/confident)
import pandas as pd df = pd.read_csv('data_corr.csv') print('%i subjects and %i columns' % df.shape) df.head()
Simple correlation between two columns
First, let's start by calculating the correlation between two columns of our dataframe. For instance, let's calculate the correlation between height and weight...Well, this is definitely not the most exciting research idea, but certainly one of the most intuitive to understand! For the sake of statistical testing, our hypothesis here will be that weight and height are indeed correlated. In other words, we expect that the taller someone is, the larger his/her weight is, and vice versa.
This can be done simply by calling the pingouin.corr function:
import pingouin as pg pg.corr(x=df['Height'], y=df['Weight'])
Let's take a moment to analyze the output of this function:
- n is the sample size, i.e. how many observations were included in the calculation of the correlation coefficient
- r is the correlation coefficient, 0.45 in that case, which is quite high.
- CI95% are the 95% confidence intervals around the correlation coefficient
- r2 and adj_r2 are the r-squared and ajusted r-squared respectively. As its name implies, it is simply the squared r, which is a measure of the proportion of the variance in the first variable that is predictable from the second variable.
- p-val is the p-value of the test. The general rule is that you can reject the hypothesis that the two variables are not correlated if the p-value is below 0.05, which is the case. We can therefore say that there is a significant correlation between the two variables.
- BF10 is the Bayes Factor of the test, which also measure the statistical significance of the test. It directly measures the strength of evidence in favor of our initial hypothesis that weight and height are correlated. Since this value is very large, it indicates that there is very strong evidence that the two variables are indeed correlated. While they are conceptually different, the Bayes Factor and p-values will in practice often reach the same conclusion.
- power is the achieved power of the test, which is the likelihood that we will detect an effect when there is indeed an effect there to be detected. The higher this value is, the more robust our test is. In that case, a value of 1 means that we can be greatly confident in our ability to detect the significant effect.
How does the correlation look visually? Let's plot this correlation using the Seaborn package:
import seaborn as sns import matplotlib.pyplot as plt from scipy.stats import pearsonr sns.set(style='white', font_scale=1.2) g = sns.JointGrid(data=df, x='Height', y='Weight', xlim=(140, 190), ylim=(40, 100), height=5) g = g.plot_joint(sns.regplot, color="xkcd:muted blue") g = g.plot_marginals(sns.distplot, kde=False, bins=12, color="xkcd:bluey grey") g.ax_joint.text(145, 95, 'r = 0.45, p < .001', fontstyle='italic') plt.tight_layout()
Pairwise correlation between several columns at once
What about the correlation between all the other columns in our dataframe? It would be a bit tedious to manually calculate the correlation between each pairs of columns in our dataframe (= pairwise correlation). Fortunately, Pingouin has a very convenient pairwise_corr function:
pg.pairwise_corr(df).sort_values(by=['p-unc'])[['X', 'Y', 'n', 'r', 'p-unc']].head()
For simplicity, we only display the the most important columns and the most significant correlation in descending order. This is done by sorting the output table on the p-value column (lowest p-value first) and using
.head() to only display the first rows of the sorted table.
The pairwise_corr function is very flexible and has several optional arguments. To illustrate that, the code below shows how to calculate the non-parametric Spearman correlation coefficient (which is more robust to outliers in the data) on a subset of columns:
# Calculate the pairwise Spearman correlation corr = pg.pairwise_corr(df, columns=['O', 'C', 'E', 'A', 'N'], method='spearman') # Sort the correlation by p-values and display the first rows corr.sort_values(by=['p-unc'])[['X', 'Y', 'n', 'r', 'p-unc']].head()
One can also easily calculate a one-vs-all correlation, as illustrated in the code below.
# Calculate the Pearson correlation between IQ and the personality dimensions corr = pg.pairwise_corr(df, columns=[['IQ'], ['O', 'C', 'E', 'A', 'N']], method='pearson') corr.sort_values(by=['p-unc'])[['X', 'Y', 'n', 'r', 'p-unc']].head()
There are many more facets to the pairwise_corr function, and this was just a short introduction. If you want to learn more, please refer to Pingouin's documentation and the example Jupyter notebook on Pingouin's GitHub page.
As the number of columns increase, it can become really hard to read and interpret the ouput of the pairwise_corr function. A better alternative is to calculate, and eventually plot, a correlation matrix. This can be done using Pandas and Seaborn:
corrs = df.corr() mask = np.zeros_like(corrs) mask[np.triu_indices_from(mask)] = True sns.heatmap(corrs, cmap='Spectral_r', mask=mask, square=True, vmin=-.4, vmax=.4) plt.title('Correlation matrix')
The only issue with these functions, however, is that they do not return the p-values, but only the correlation coefficients. Here again, Pingouin has a very convenient function that will show a similar correlation matrix with the r-value on the lower triangle and p-value on the upper triangle:
Now focusing on a subset of columns, and highlighting the significant correlations with stars:
df[['O', 'C', 'E', 'A', 'N']].rcorr()
Voilà! I hope that this tutorial was helpful. Please contact me if you have any questions or comments. And remember that we've only scratched the surface of the correlation-based functionalities that Pingouin offers. More advanced functions include partial correlation (controlling for one or more covariates), robust correlations, and adjustement of p-values after multiple comparisons. If you are interested, make sure that you have a look at the API documentation of Pingouin.