Linear Regression with TA-Lib
Understanding how the Linear Regression functions in TA-Lib Python Library
For those of you new to TA-Lib and Technical Analysis with Python, I have pasted in the following lines the introduction statement you can find in the website where the TA-Lib Python library is available for download.
So the quick intro is that TA-Lib is used by trading software developers requiring to perform technical analysis of financial market data.
Includes 150+ indicators such as ADX, MACD, RSI, Stochastic, Bollinger Bands, etc.Candlestick pattern recognition, Open-source API for C/C++, Java, Perl, Python and 100% Managed .NET
Having said that the objective of this story is to support those trying to understand how the Linear Regression functions work in that library and because the available documentation is a bit minimal and the only few suggestions you can find are in the comments of the source code which don’t help much anyway.
So the point is not explaining what the linear regression is as I hope you already know about it at least conceptually but how the functions implementing the linear regression work in TA-Lib.
The available functions are the following:
ta.LINEARREG(series of values,timeperiod)
Let me tell you that if you understand how this function works you understand all the others as well.
All of us using the TA-Lib library tend to use pandas dataframes because we’re used to download stock data from online services like Yahoo Finance or others. So I’ll make examples having in mind dataframes columns. This function takes as input two parameters:
- a column of the dataframe with values that we want to regress
- a timeperiod parameter.
and returns a series of value where each value is the evaluation of the linear regression equation at x=timeperiod-1. What does it mean ?
The first thing to understand is that the ta.LINEARREG function doesn’t take into consideration neither Date fields or the sequential index of rows in the dataframe.
It takes the column of values you want to regress (the dependent variable y let’s say) and the timeperiod value, then automatically forms with the column of values, all possible sets of timeperiod length using a new index (from 0 to timeperiod) for every subset of data.
So for example if we have 20 values (n=20) in our dataframe column of y values (tipically they are indexed from 0, the oldest, to 19, the newest) and we set the timeperiod = 10, the function will form (n-timeperiod+1)=20–10+1=11 sets, each one of ‘timeperiod’ length and each one with a temporary index from 0 to 9 (=> timeperiod-1).
At that point, for everyone of the 11 sets in our example, starting bottom-up, the function calculates the regression line equation of y=ax+b using the 10 numbers for every set. Then it takes the last index of the subset (in our example x=9 because numbering goes from 0..9 if timeperiod=10) and it evaluates the regression equation at x=9 (y=a*9+b). This result is placed at the bottom of the calculations. At the end of loop we will have this new series made up of 11 values, but in order to mantain the series aligned with the shape of the columns in the dataframe, the function will fill the remaining values with NaN up to the top from 12 to 20.
As a picture is worth more than thousand words look at the picture below to have an overall view about how ta.LINEARREG (and sister functions) works.
ta.LINEARREG_SLOPE(column with y values, timeperiod):
ta.LINEARREG_INTERCEPT (column with y values, timeperiod)
ta.LINEARREG_ANGLE (column with y values, timeperiod)
These 3 functions, if you have understood the main LINEARREG function described above and the picture works exactly the same way. Remember that those function don’t look at Dates column or the sequential index of the dataframe.
The ta.LINEARREG_SLOPE returns a new series of numbers where each represents the slope ( coefficient a in y=ax+b) for every linear regression equation obtained from all the different subsets of timeperiod length (in the example above I used a dataframe with 20 values, timeperiod=10 so the function forms 11 subsets of 10 numbers each). To fill the unavailable values the function use ‘NaN’ to the top. If you plot this new series you have the history of how slope, calculated for the window of timeperiod data, changes in time. Something like this:
The ta.LINEARREG_ANGLE returns a new series of numbers where each one represents the angle in degrees of the slope (arctan in degrees of the slope) for every linear regression equation calculated on all the different subsets of timeperiod length (in the example there are 11) of the dataset. As usual unavailable values are filled with NaN on top.
The ta.LINEARREG_INTERCEPT returns a series of numbers representing the intercept (coefficient b in y=ax+b, when x=0 relatively to every subset of timeperiod dimension) for every linear regression equation obtained from all the subsets of timeperiod length (in the example there are 11) . unavailable values are filled as usual with NaN from the top.
ta.LINEARREG_TSF (column with y values, timeperiod)
TSF stands for Time Serie Forecast, but the name is misleading imho. This function predicts one future value at x=timeperiod (in our example timeperiod=10).
This means that it does conceptually the same work as function ta.LINEARREG does but instead of evaluating the regression function at x=timeperiod-1, it extrapolates one tick in the future evaluating the function at timeperiod.
Also in this case the function returns a series of numbers, each one representing the forecast based on the linear regression of a set of values of timeperiod length. Of course all of the forecasts will be old forecasts exception made for the last one (at the bottom of the series) so only the last one is in my opinion of practical use because it represents a real forecast for something theoretically not yet happened.
Don’t get confused by the index of the returned series. Consider every number of the returned serie for what it is : the result of the evaluation of every linear regression equation at x=timeperiod. So the last number in the series is the forecast for the 11th item (=> y=a*timeperiod+b).
ta.CORREL(column with x values, column with y values, timeperiod)
In this case you have to provide explicitly the two columns of x and y. In the previous cases you just provided the y column because the function determined the x column automatically based on timeperiod dimension.
Here this difference is useful because for example you could evaluate the correlation between other columns as well like for example between ‘high’, and ‘open’.
The result as you can expect is the correlation coefficient between the two columns.
Of course ta.CORREL becomes usuful if you want to check if it makes sense at all using the linear regression functions described above. Consider always that you have to provide this function all of the x and all of the y but correlation is not calculated for the entire set but for every subset of timeperiod length. So this plot can tell you about how correlation changes in time for all the windows of timeperiod length.
… just Follow me
Please consider following me in order for me to reach the threshold of number of followers so that Medium platform consider me in their partner program.
I hope you all appreciated my effort to fill the gap in the documentation as actually using these functions is easy but understanding the results, sometime, is a bit more complex than one can think .