How To Calculate Correlation Coefficient in SQL
By Max Werner
•On Jan 28, 2025
Lots of times, when we build out customer 360 profile tables, business users are interested in learning what traits drive certain business outcomes like purchasing, churning, or defaulting on payments. Now there's no shortage of fancy AI companies that promise to uncover hidden relationships, but for most use-cases I have encountered, that is just plain and simple overkill. In this post I will show you how you can easily calculate the correlation coefficient between two variables with plain old SQL.
The Tech Used
- Tuberculosis Burden by Country sample dataset
- Clickhouse Data Warehouse
While I am using Clickhouse as the data warehouse backend in this example, this will work in most databases like Redshift, Snowflake, BigQuery, etc. Some of the function names may vary, but that's about it. For example, Clickhouse has a function to calculate the standard deviation of a sample called stdevSamp
and AWS Redshift calls it stddev_samp
.
What is a Correlation Coefficient
Much smarter people than myself have written in-depth explanations about the what & how behind correlation and this is not a statistics tutorial, so for our purposes we'll just define correlation as
- A decimal number from -1 to +1 that expresses how much an increase in one variable correlates to another
- Correlation doesn't equal causation. Just because two variables are correlated, doesn't mean one causes the other.
For more details feel free to check out Wikipedia's page on the Pearson correlation coefficient.
This is the formula we'll be using to calculate the the correlation coefficient for two variables from a sample:
The bar above X and Y refer to the average of X and Y.
Step 1 - Calculating Averages / Means
In this example we'll be calculating the correlation coefficient between the "Estimated Prevalence of TB in all forms" and the "Estimated Total Population" from our sample dataset, which I imported into the default
schema and tb_burden_country
table. I didn't change antything in the data, just cleaned up the column names to make it more SQL friendly (e.g. Estimated prevalence of TB (all forms)
becomes estimated_prevalence_of_tb_all_forms
).
select
avg(t.estimated_total_population_number) as avg_val,
avg(t.estimated_prevalence_of_tb_all_forms) as avg_goal
from `default`.tb_burden_country t
To make things easier going forward I will be refering to the TB number as our "goal" and the population number as our "value".
This returns 29156711.61484375
as the avg_val
and 66543.2783203125
as the avg_goal
.
Step 2 - Calculating Xi and Yi Sums and Squared Sums
The formulas require us to calculate calculate the difference (and squared difference) between each value and its average. This is easy enough to do with a cross-join!
with
avg_mean as (
select
avg(t.estimated_total_population_number) as avg_val,
avg(t.estimated_prevalence_of_tb_all_forms) as avg_goal
from `default`.tb_burden_country t
)
select
(t.estimated_total_population_number-avg_val) as val_dist,
pow(t.estimated_total_population_number-avg_val,2) as val_dist_sq,
(t.estimated_prevalence_of_tb_all_forms-avg_goal) as goal_dist,
pow(t.estimated_prevalence_of_tb_all_forms-avg_goal-avg_goal,2) as goal_dist_sq
from `default`.tb_burden_country t, avg_mean a
I'll skip the output here, but you can see how easy it is to calculate (squared) differences from cells and their means.
Step 3 - Putting it all Together
Now that we have this all prepped, we can just do the last step of the formula, which conists of dividing the sum-product of val_dist
and goal_dist
by the product of the square root of the sum of squared distances:
with
avg_mean as (
select
avg(t.estimated_total_population_number) as avg_val,
avg(t.estimated_prevalence_of_tb_all_forms) as avg_goal
from `default`.tb_burden_country t
)
, math_prep as (
select
(t.estimated_total_population_number-avg_val) as val_dist,
pow(t.estimated_total_population_number-avg_val,2) as val_dist_sq,
(t.estimated_prevalence_of_tb_all_forms-avg_goal) as goal_dist,
pow(t.estimated_prevalence_of_tb_all_forms-avg_goal-avg_goal,2) as goal_dist_sq
from `default`.tb_burden_country t, avg_mean a
)
select
(
sum(val_dist*goal_dist)
/
(
SQRT(sum(val_dist_sq))
*
SQRT(sum(goal_dist_sq))
)
) as val_goal_correl
from math_prep
The output of this is quite simply: One column called val_goal_correl
with one row that contains 0.8446697479811314
as the value.
Conclusions, Notes, and Caveats
So this is all that's needed to calculate the correlation between two variables in plain SQL. Of course, this was just to illustrate the process. 0.8446697479811314
is a completely meaningless number here, since I more or less just picked two numeric columns from a public, pre-aggregated sample dataset.
Variable Data Types
Both variable have to be floating point numbers for this to function. If a column you want to use is an integer column, make sure that your average calculation comes out as a decimal column. Casting is your friend here. This is a SQL thing, not a stats thing. Since we're calculating averages, if your SQL environment of choice calculates avg(1,5,34)
as 13
, instead of 13.3333333
you will lose a ton of accuracy in your final calculation.
Categorical Data
You can use categorical data such as booleans or countries, but they have to be transformed into floats. For booleans, this is easy, as false
becomes 0.0
and true
becomes 1.0
. Categories like country, marital status, gender, etc. have to be split into n-1 columns. For example if you have 5 countries in your countries list, you have to turn them into 4 columns that are is_country_a
, is_country_b
, is_country_c
, is_country_d
, all as 0.0
or 1.0
for whether or not your source column matches that value. You can skip the 5th column, because if columns 1 through 4 are 0, it has to be country E by default (not accounting for null values here). There are ways to test categorical data, called a Chi-Squared Test, but that's out of scope for this post.
So Why Bother With All This?
There are quite a few benefits to doing this yourself in your data warehouse instead of relying on a third-party service, or "AI" to get insight into your data and its relationships:
- You own it -- If you don't send your data to service A or B, it stays in your control, and nobody can use it to train their own machine learning models, AI snake oil, or sell your data without you even knowing about it!
- Cost -- Sure you have to have a basic understanding of statistics to translate these formulas into SQL, as well as a bit of extra prep work on your data sets, but that is an investment instead of an ongoing monthly bill for some service
- Speed -- Data warehouses are really good at doing these kinds of calculations. The TB sample data set is only 5120 rows, but the above query executed on my local computer in 0.03 seconds.
- Proven Technology -- The Pearson correlation coefficient formula was published in 1844 and is still around today. It works well, and you can look under the hood of your math and logic to see what might be influencing things. Can you get third-party services to show their work, or do you just have to take them at their word?
Afterthoughts / Bonus Content
You can take the above SQL and turn it into a DBT macro very easily for reusability across your datasets.
{% macro correlation(schema, table, col_a, col_b) %}
with
avg_mean as (
select
avg({{col_a}}) as avg_val,
avg({{col_b}}) as avg_goal
from {{schema}}.{{table}} t
)
, math_prep as (
select
({{col_a}}-avg_val) as val_dist,
pow({{col_a}}-avg_val,2) as val_dist_sq,
({{col_b}}-avg_goal) as goal_dist,
pow({{col_b}}-avg_goal-avg_goal,2) as goal_dist_sq
from {{schema}}.{{table}} t, avg_mean a
)
select
(
sum(val_dist*goal_dist)
/
(
SQRT(sum(val_dist_sq))
*
SQRT(sum(goal_dist_sq))
)
) as val_goal_correl
from math_prep
{% endmacro %}
Now you can make simply DBT models that just call this macro like this:
--models/correl_model_a.sql
{{correlation('default', 'some_table', 'some_col', 'some_other_col')}}
and
--models/correl_model_b.sql
{{correlation('default', 'some_table', 'some_col', 'some_diffrent_col')}}
and lastly even output them together like so:
select 'some_col_x_some_other_col' as variables, val_goal_correl from {{ref('correl_model_a')}}
UNION
select 'some_col_x_some_different_col' as variables, val_goal_correl from {{ref('correl_model_b')}}
To get this final table/view to plug into a dashboard:
| variables | val_goal_correl |
|-------------------------------|-----------------|
| some_col_x_some_other_col | 0.123 |
| some_col_x_some_different_col | 0.643 |
Enjoy!