How To Calculate Correlation Coefficient in SQL

By Max Werner

•

On Jan 28, 2025

Analytics

Cloud Technologies

Tutorial

Business Intelligence

Scatterplot with formula for Pearson correlation coefficient

Lots of times, when we build out customer 360 profile tables, business users are interested in learning what traits drive certain business outcomes like purchasing, churning, or defaulting on payments. Now there's no shortage of fancy AI companies that promise to uncover hidden relationships, but for most use-cases I have encountered, that is just plain and simple overkill. In this post I will show you how you can easily calculate the correlation coefficient between two variables with plain old SQL.

The Tech Used

Tuberculosis Burden by Country sample dataset
Clickhouse Data Warehouse

While I am using Clickhouse as the data warehouse backend in this example, this will work in most databases like Redshift, Snowflake, BigQuery, etc. Some of the function names may vary, but that's about it. For example, Clickhouse has a function to calculate the standard deviation of a sample called stdevSamp and AWS Redshift calls it stddev_samp.

What is a Correlation Coefficient

Much smarter people than myself have written in-depth explanations about the what & how behind correlation and this is not a statistics tutorial, so for our purposes we'll just define correlation as

A decimal number from -1 to +1 that expresses how much an increase in one variable correlates to another
Correlation doesn't equal causation. Just because two variables are correlated, doesn't mean one causes the other.

For more details feel free to check out Wikipedia's page on the Pearson correlation coefficient.

This is the formula we'll be using to calculate the the correlation coefficient for two variables from a sample:

Correlation Coefficient Formula

The bar above X and Y refer to the average of X and Y.

Step 1 - Calculating Averages / Means

In this example we'll be calculating the correlation coefficient between the "Estimated Prevalence of TB in all forms" and the "Estimated Total Population" from our sample dataset, which I imported into the default schema and tb_burden_country table. I didn't change antything in the data, just cleaned up the column names to make it more SQL friendly (e.g. Estimated prevalence of TB (all forms) becomes estimated_prevalence_of_tb_all_forms).

select 
	avg(t.estimated_total_population_number) as avg_val,
	avg(t.estimated_prevalence_of_tb_all_forms) as avg_goal
from `default`.tb_burden_country t

To make things easier going forward I will be refering to the TB number as our "goal" and the population number as our "value".

This returns 29156711.61484375 as the avg_val and 66543.2783203125 as the avg_goal.

Step 2 - Calculating Xi and Yi Sums and Squared Sums

The formulas require us to calculate calculate the difference (and squared difference) between each value and its average. This is easy enough to do with a cross-join!

with
avg_mean as (
select 
  avg(t.estimated_total_population_number) as avg_val,
  avg(t.estimated_prevalence_of_tb_all_forms) as avg_goal
from `default`.tb_burden_country t
)
select
  (t.estimated_total_population_number-avg_val) as val_dist, 
  pow(t.estimated_total_population_number-avg_val,2) as val_dist_sq, 
  (t.estimated_prevalence_of_tb_all_forms-avg_goal) as goal_dist, 
  pow(t.estimated_prevalence_of_tb_all_forms-avg_goal-avg_goal,2) as goal_dist_sq
from `default`.tb_burden_country t, avg_mean a

I'll skip the output here, but you can see how easy it is to calculate (squared) differences from cells and their means.

Step 3 - Putting it all Together

Now that we have this all prepped, we can just do the last step of the formula, which conists of dividing the sum-product of val_dist and goal_dist by the product of the square root of the sum of squared distances:

with
avg_mean as (
	select 
		avg(t.estimated_total_population_number) as avg_val,
		avg(t.estimated_prevalence_of_tb_all_forms) as avg_goal
	from `default`.tb_burden_country t
)
, math_prep as (
	select
		(t.estimated_total_population_number-avg_val) as val_dist, 
		pow(t.estimated_total_population_number-avg_val,2) as val_dist_sq, 
		(t.estimated_prevalence_of_tb_all_forms-avg_goal) as goal_dist, 
		pow(t.estimated_prevalence_of_tb_all_forms-avg_goal-avg_goal,2) as goal_dist_sq
	from `default`.tb_burden_country t, avg_mean a
)
select 
	(
		sum(val_dist*goal_dist)
		/
		(
			SQRT(sum(val_dist_sq))
			*
			SQRT(sum(goal_dist_sq))
		)
	) as val_goal_correl
from math_prep

The output of this is quite simply: One column called val_goal_correl with one row that contains 0.8446697479811314 as the value.

Conclusions, Notes, and Caveats

So this is all that's needed to calculate the correlation between two variables in plain SQL. Of course, this was just to illustrate the process. 0.8446697479811314 is a completely meaningless number here, since I more or less just picked two numeric columns from a public, pre-aggregated sample dataset.

Variable Data Types

Both variable have to be floating point numbers for this to function. If a column you want to use is an integer column, make sure that your average calculation comes out as a decimal column. Casting is your friend here. This is a SQL thing, not a stats thing. Since we're calculating averages, if your SQL environment of choice calculates avg(1,5,34) as 13, instead of 13.3333333 you will lose a ton of accuracy in your final calculation.

Categorical Data

You can use categorical data such as booleans or countries, but they have to be transformed into floats. For booleans, this is easy, as false becomes 0.0 and true becomes 1.0. Categories like country, marital status, gender, etc. have to be split into n-1 columns. For example if you have 5 countries in your countries list, you have to turn them into 4 columns that are is_country_a, is_country_b, is_country_c, is_country_d, all as 0.0 or 1.0 for whether or not your source column matches that value. You can skip the 5th column, because if columns 1 through 4 are 0, it has to be country E by default (not accounting for null values here). There are ways to test categorical data, called a Chi-Squared Test, but that's out of scope for this post.

So Why Bother With All This?

There are quite a few benefits to doing this yourself in your data warehouse instead of relying on a third-party service, or "AI" to get insight into your data and its relationships:

You own it -- If you don't send your data to service A or B, it stays in your control, and nobody can use it to train their own machine learning models, AI snake oil, or sell your data without you even knowing about it!
Cost -- Sure you have to have a basic understanding of statistics to translate these formulas into SQL, as well as a bit of extra prep work on your data sets, but that is an investment instead of an ongoing monthly bill for some service
Speed -- Data warehouses are really good at doing these kinds of calculations. The TB sample data set is only 5120 rows, but the above query executed on my local computer in 0.03 seconds.
Proven Technology -- The Pearson correlation coefficient formula was published in 1844 and is still around today. It works well, and you can look under the hood of your math and logic to see what might be influencing things. Can you get third-party services to show their work, or do you just have to take them at their word?

Afterthoughts / Bonus Content

You can take the above SQL and turn it into a DBT macro very easily for reusability across your datasets.

{% macro correlation(schema, table, col_a, col_b) %}
with
avg_mean as (
	select 
		avg({{col_a}}) as avg_val,
		avg({{col_b}}) as avg_goal
	from {{schema}}.{{table}} t
)
, math_prep as (
	select
		({{col_a}}-avg_val) as val_dist, 
		pow({{col_a}}-avg_val,2) as val_dist_sq, 
		({{col_b}}-avg_goal) as goal_dist, 
		pow({{col_b}}-avg_goal-avg_goal,2) as goal_dist_sq
	from {{schema}}.{{table}} t, avg_mean a
)
select 
	(
		sum(val_dist*goal_dist)
		/
		(
			SQRT(sum(val_dist_sq))
			*
			SQRT(sum(goal_dist_sq))
		)
	) as val_goal_correl
from math_prep
{% endmacro %}

Now you can make simply DBT models that just call this macro like this:

--models/correl_model_a.sql
{{correlation('default', 'some_table', 'some_col', 'some_other_col')}}

and

--models/correl_model_b.sql
{{correlation('default', 'some_table', 'some_col', 'some_diffrent_col')}}

and lastly even output them together like so:

select 'some_col_x_some_other_col' as variables, val_goal_correl from {{ref('correl_model_a')}}
UNION
select 'some_col_x_some_different_col' as variables, val_goal_correl from {{ref('correl_model_b')}}

To get this final table/view to plug into a dashboard:

|           variables           | val_goal_correl |
|-------------------------------|-----------------|
| some_col_x_some_other_col     |           0.123 |
| some_col_x_some_different_col |           0.643 |

Enjoy!