Guide to Linear Regressions in Python
This article outlines the essentials to define a linear regression in python using numpy.
A linear regression is one of the simplest and oldest, and still incredibly important, types of model available for predictive and inferential data analysis. With origins dating to the late 19th century, it’s become well tested, and indeed widely used. In fact, the concepts used to create linear regressions are essential to the foundations of machine learning, including advanced methods like deep learning and neural networks.
Put simply, a linear regression is a “best fit line” used to approximate a trend in a graph. In predictive applications, it’s used to provide an estimate, or “best guess” when a direct measurement is not taken or cannot be made.
What a great concept! Find the best fit line to a given set of data points, and just like that estimates can be made. In order to do this, parameters must be given to create this “best fit”.
As is the case for the great majority of linear regressions, the parameter used is known as the “least squares” method. In other words, “squares” are defined and methods are determined to minimize them (find the “least” of the “squares”).
What exactly is getting “squared”? These are the distances from the data points to the regression line. This is a way to quantify the error in the estimate. In this way, a line can be found that has the least amount of error, and can be concluded to be the best estimate based on these parameters.
For regressions, two types of data are required. These are dependent and independent variables used in tandem. For instance, time is independent, as it marches forward at a (relatively) constant rate. While a distance an be a dependent variable, when measuring the distance traveled, the distance may change relative to time, but time will indeed march tenaciously forward independent of the distance traveled.
As always, begin with essential imports, in this case NumPy for manipulating data, and Matplotlib for visualizations:
import numpy as np
import matplotlib as plt
Add the data of interest. In this case, I manufacture some:
# create independent and dependent variables
X = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.float64)
Y = np.array([6,6,8,10,10,12,12,15,13,16], dtype=np.float64)
Just looking at the arrays there seems to be a nice correlation, however, it’s wise to visualize your data with the tools at your disposal. Make a Matplotlib plot to get an idea of existing trends:
Returns:See figure below.
Indeed there is a strong positive correlation. The data points can be used to calculate the slope, m, of the linear regression using the formula below.
Define a function to calculate the slope based on input arrays:
# calculate the slope using formula above
m = (((np.mean(X)*np.mean(Y)) - np.mean(X*Y)) /
((np.mean(X)**2) - np.mean(X*X)))
Traditionally, lines are defined in slope- intercept form; perhaps y = mx+b rings a bell. In this case, the x is given in the form of the values in the X array, m can be calculated with the function defined above, so what remains is to find b, the y-intercept of the regression.
# find intercept by solving slope-intercept equation
m = calc_slope(X,Y)
b = np.mean(Y) — m*np.mean(X)
return m, b
With these pieces, a regression function may be define. This will take the slope and intercept from the best_fit function and find the values of the regression line for the given data:
def reg_line (m, b, X):
return [(m*x)+b for x in X]
Now, simply add the regression to the data!
plt.scatter(X,Y,color='#003F72', label="Input Data")
plt.plot(X, regression_line,color='r', label= "Regression Line")
Returns:See figure below.