SQL Server and T-SQL comes with few statistical tools and if you want to solve basic statistical problems, you have to construct user-defined functions or stored procedures in order to perform the necessary calculations. This article presents a solution, in the form of a user-defined function, for performing an n-day rolling regression on financial data. We will focus on the simplest case, where there is only one variable to fit. Multiple regression can be performed, but given the difficulties using a matrix approach inside T-SQL, the solution becomes too long and switching to a more professional software is probably warranted. More specifically, this approach uses the OLS way to minimize the error term.
Introduction
Linear regression is probably one of the first regression techniques a student runs across. It is useful for explaining (and perhaps forecasting) data that has a linear nature, but where the underlying process is unknown. This function creates the best fit line, by minimizing the squared error (squared distance from the estimated line). Linear regressions can be performed in Excel, by adding on of the user-defined trend lines. Likewise, many other statistical and charting packages contain ways to display linear regression in charts.
For an in-depth article about the linear regression, please refer to the Wikipedia Entry, that explains the underlying theory and equations. This solution focuses on the practical aspects of finding the best trend line for a financial time series. Moreover, this solution presents a simple solution to a relatively complex problem.
In the end, the formulas can be found in most textbooks, however, to the best of my knowledge, this is the first rolling regression recipe that has been presented for T-SQL. There is a number of books that covers the assumptions and derivations of the formulas used for the estimation process. That material is outside the scope of the article since this is mostly focused on the practical solution.
An implementation that works can be found below. The input needs to be adapted to your own data structure.
CREATE FUNCTION [dbo].[LineReg] (@market varchar(100)='@ES.D',@startdate datetime= '1900-01-01',@enddate datetime = '2100-01-01',@lookback int = 30 ) RETURNS @Result Table ( [date] datetime, SX numeric(18,6), SY numeric(18,6), sxy numeric(18,6), alpha numeric(18,6), Beta1 numeric(18,6), R2 numeric(18,6)) AS BEGIN -- Create ordered list of dates Declare @datelist table( id int identity(1,1), [date] datetime) Insert into @datelist ([date]) SELECT A.[date] from dbo.[Market-data] A WHERE A.market = @market AND A.date between @Startdate and @Enddate ORDER BY A.[date] DECLARE @Sx numeric(18,6) -- Sum X DECLARE @Sy numeric(18,6) -- Sum Y DECLARE @Sxy numeric(18,6) -- Sum X*Y DECLARE @Sxx numeric(18,6) -- Sum X*X DECLARE @Syy numeric(18,6) -- Sum Y*Y DECLARE @Beta1 numeric(18,6) -- First coeff DECLARE @alpha numeric(18,6) -- Alpha DECLARE @R2 numeric(18,6) -- R Square Set @Sx = (@lookback * (@lookback +1)) / 2 -- sum of 1 2 3 4 ...n set @Sxx = (@lookback * ( @lookback+1 ) * ( 2*@lookback+1 )) / 6 -- sum of 1 4 9 16...n*n declare @cursor cursor declare @date datetime declare @id int Set @cursor = CURSOR FOR SELECT [date] FROM dbo.[market-data] where market = @market and date between @Startdate and @Enddate order by [date] OPEN @cursor FETCH NEXT From @cursor INTO @date WHILE @@FETCH_STATUS = 0 BEGIN Declare @tempData table ( id int , [close] numeric(18,6) ) SELECT @ID = id from @datelist where date = @date insert into @tempData (id , [close]) select row_number() over (ORDER BY B.id), [close] from dbo.[market-data] A INNER JOIN @datelist B on A.date = B.date and B.id between @ID - @lookback +1 and @ID WHERE A.market=@market ORDER by B.id SELECT @Sy = sum([close]) from @tempData A SELECT @Sxy = sum(id * [close]) from @tempData A SELECT @Syy = sum([close]*[close]) from @tempData A if @lookback = 0 set @lookback =1 SET @beta1 = (@lookback * @Sxy - @Sx * @Sy) / (@lookback *@Sxx -@SX*@SX+0.001) SET @alpha = (@Sy / @lookback) - @beta1 * @Sx / @lookback SET @R2 = power(@lookback * @Sxy - @Sx*@Sy,2) / (( @lookback*@Sxx-@Sx*@Sx+0.001)*(@lookback *@Syy-@Sy*@Sy+0.001)) if @id >= @lookback INSERT INTO @RESULT (date, Sx, Sy, sxy, alpha, beta1, r2) SELECT @date, @SX,@SY, @Sxy, @alpha, @beta1, @r2 delete from @tempdata FETCH NEXT From @cursor INTO @date END Return END