Skip to content

SQL – Correlation Calculation

Regrettably SQL Server does not come with a built in Correlation calculation function, but any data analyst with self respect has written a correlation function (alternative we can move the data to R or similar tools).

Below, we are sharing the correlation function that NilssonHedge is using. The function below is free to reuse, if you find it useful. This function is one of the tools we use to identify diverging or converging programs. We use a number of other quality checking tools, but correlation is a quick way to identify highly correlated return streams.

Create FUNCTION [dbo].[ProgramCorrelation]
(
	@programId1 int,
	@programId2 int
)
RETURNS decimal(8,4)
AS
BEGIN
	
	DECLARE @count int
	DECLARE @sumxy float
	DECLARE @sumx float
	DECLARE @sumx2 float
	
	DECLARE @sumy float
	DECLARE @sumy2 float
	

	DECLARE  @temp TABLE (
		date	date,
		ror1	numeric(8,6),
		ror2	numeric(8,6)
	)	
	
	-- Create temp table with ror data to speed up calculations	
	INSERT INTO @temp 
		SELECT A.date, A.ror, B.ror
		FROM dbo.ManagerRoRAUM A
		INNER JOIN dbo.ManagerRoRAUM B on A.date=b.date
		WHERE A.id= @programId1 AND B.id=@programId2
	-- temp variables
	SELECT @count =  COUNT(*) FROM @temp
	SELECT @sumxy = sum(ror1*ror2) FROM @temp
	SELECT @sumx = sum(ror1) FROM @temp
	SELECT @sumy = sum(ror2) FROM @temp
	SELECT @sumx2 = sum(ror1*ror1) FROM @temp
	SELECT @sumy2 = sum(ror2*ror2) FROM @temp
	-- Pearson correlation formula
	if @count <12
		RETURN -98
	IF (SQRT( (@count*@sumx2 -@sumx*@sumx)*(@count*@sumy2 -@sumy*@sumy)) <> 0) 
			RETURN (@count*@sumxy-@sumx*@sumy)/(SQRT( (@count*@sumx2 -@sumx*@sumx)*(@count*@sumy2 -@sumy*@sumy)))
	RETURN -99 -- Catch divide by zero
END

The function is pretty straightforward. It fetches synced data from a table and then it applies the Pearson correlation formula and produces a correlation number. Note that if there are fewer than twelve months of data, the function returns an error.

It uses a temporary ror table to select the returns from the database. This works better than to select two return series from a large database table.