SQL – Ordinary Least Squares

SQL Server and T-SQL comes with few statistical tools and if you want to solve basic statistical problems, you have to construct user-defined functions or stored procedures in order to perform the necessary calculations. This article presents a solution, in the form of a user-defined function, for performing an n-day rolling regression on financial data. We will focus on the simplest case, where there is only one variable to fit. Multiple regression can be performed, but given the difficulties using a matrix approach inside T-SQL, the solution becomes too long and switching to a more professional software is probably warranted. More specifically, this approach uses the OLS way to minimize the error term.

Introduction

Linear regression is probably one of the first regression techniques a student runs across. It is useful for explaining (and perhaps forecasting) data that has a linear nature, but where the underlying process is unknown. This function creates the best fit line, by minimizing the squared error (squared distance from the estimated line). Linear regressions can be performed in Excel, by adding on of the user-defined trend lines. Likewise, many other statistical and charting packages contain ways to display linear regression in charts.

For an in-depth article about the linear regression, please refer to the Wikipedia Entry, that explains the underlying theory and equations. This solution focuses on the practical aspects of finding the best trend line for a financial time series. Moreover, this solution presents a simple solution to a relatively complex problem.

In the end, the formulas can be found in most textbooks, however, to the best of my knowledge, this is the first rolling regression recipe that has been presented for T-SQL. There is a number of books that covers the assumptions and derivations of the formulas used for the estimation process. That material is outside the scope of the article since this is mostly focused on the practical solution.

An implementation that works can be found below. The input needs to be adapted to your own data structure.

CREATE

FUNCTION [dbo].[LineReg] (@market varchar(100)='@ES.D',@startdate datetime= '1900-01-01',@enddate datetime = '2100-01-01',@lookback int = 30 )
RETURNS @Result Table (
[date] datetime, SX numeric(18,6), SY numeric(18,6), sxy numeric(18,6), alpha numeric(18,6),
Beta1 numeric(18,6), R2 numeric(18,6))
AS
BEGIN
-- Create ordered list of dates
Declare @datelist table(
id int identity(1,1),
[date] datetime)

Insert into @datelist ([date]) SELECT A.[date] from dbo.[Market-data] A
WHERE A.market = @market AND A.date between @Startdate and @Enddate
ORDER BY A.[date]

DECLARE @Sx numeric(18,6) -- Sum X
DECLARE @Sy numeric(18,6) -- Sum Y
DECLARE @Sxy numeric(18,6) -- Sum X*Y
DECLARE @Sxx numeric(18,6) -- Sum X*X
DECLARE @Syy numeric(18,6) -- Sum Y*Y
DECLARE @Beta1 numeric(18,6) -- First coeff
DECLARE @alpha numeric(18,6) -- Alpha
DECLARE @R2 numeric(18,6) -- R Square

Set @Sx = (@lookback * (@lookback +1)) / 2 -- sum of 1 2 3 4 ...n
set @Sxx = (@lookback * ( @lookback+1 ) * ( 2*@lookback+1 )) / 6 -- sum of 1 4 9 16...n*n

declare @cursor cursor 
declare @date datetime
declare @id int

Set @cursor = CURSOR FOR SELECT [date] FROM dbo.[market-data] where market = @market and date between @Startdate and @Enddate order by [date]
OPEN @cursor

FETCH NEXT From @cursor INTO @date
WHILE @@FETCH_STATUS = 0 BEGIN
Declare @tempData table ( id int , [close] numeric(18,6) )

SELECT @ID = id from @datelist where date = @date
insert into @tempData (id , [close])
select row_number() over (ORDER BY B.id), [close] from dbo.[market-data] A
INNER JOIN @datelist B on A.date = B.date and B.id between @ID - @lookback +1 and @ID 
WHERE A.market=@market 
ORDER by B.id

SELECT @Sy = sum([close]) from @tempData A
SELECT @Sxy = sum(id * [close]) from @tempData A 
SELECT @Syy = sum([close]*[close]) from @tempData A

if @lookback = 0 set @lookback =1

SET @beta1 = (@lookback * @Sxy - @Sx * @Sy) / (@lookback *@Sxx -@SX*@SX+0.001)
SET @alpha = (@Sy / @lookback) - @beta1 * @Sx / @lookback
SET @R2 = power(@lookback * @Sxy - @Sx*@Sy,2) / (( @lookback*@Sxx-@Sx*@Sx+0.001)*(@lookback *@Syy-@Sy*@Sy+0.001))

if @id >= @lookback
INSERT INTO @RESULT (date, Sx, Sy, sxy, alpha, beta1, r2)
SELECT @date, @SX,@SY, @Sxy, @alpha, @beta1, @r2
delete from @tempdata
FETCH NEXT From @cursor INTO @date
END

Return

END
%d bloggers like this: