A Simple Guide to Linear Regressions with Polynomial Features
As a data scientist, machine learning is a fundamental tool for data analysis. There are two broad classifications for machine learning, supervised and unsupervised. Supervised learning simply means there are labels for the data. In other words, we know what the model is drawing conclusions about. Perhaps the most rudimentary type of machine learning is the linear regression, which looks at data and returns a “best fit line” to make approximations for qualities new data will have based on your sample. The extension of this is fitting data with a polynomial, which just means the best fit line no longer has to be straight, it can curve with our data. This article outlines how to run a linear regression on data, then how to improve the model by adding a polynomial regression.
The data I’m working with is observations about numerous galaxies in the observable universe. Specifically, I’ll be estimating the red shift of a galaxy. Here’s a link if you’re interested in checking out the data yourself:
There are plenty of tools available for manipulating data, creating visualizations, and creating linear regressions, including polyfit() from numpy. However, to make the transition to machine learning more clear, I’ll be using sklearn to create the regressions. In addition, I’ll be manipulating data with numpy and pandas, with visualizations left to the OG matplotlib.
For the exhaustive list of packages and modules used, refer to the import section of the example code.
Begin with importing our packages:
# import packages
# pandas and numpy, standard for the loading and data manipulation
import pandas as pd
import numpy as np# visualization imports
# matplotlib is a ubiquitous visualization package
import matplotlib.pyplot as plt# machine learning imports
# to split your data in order to get an accurate model
from sklearn.model_selection import train_test_split
# models to assess your data
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
Hint: if you encounter errors here, it’s likely you need to pip install or conda install one or more of these packages
Next we load the data into a pandas DataFrame. This loads locally stored data into an object which can be manipulated:
# load data into pandas dataframe
# create string of relative or absoulte path to data file
filename = 'data/COMBO17.csv'
data = pd.read_csv(filename)# check dataframe
data.head()####################################################################
Returns:
See picture below
Now for some data cleaning. This is an essential step after loading data, always make sure you clean your data!
# clean data
# check for duplicate entries in data set
print(data.duplicated().any())# check for na's (missing information)
data.isna().any().any()####################################################################
Returns:
False
True
The above code returns “False” then “True”. This means there is no duplicated data, however we have found some entries with missing information. This requires attention, otherwise this data can’t be used to create the model. Below I check which columns have missing information and how much information is missing.
# check columns and number of na's
for col in data.columns:
if sum(data[col].isna()):
print(col, sum(data[col].isna()))####################################################################
Returns:
VnMAG 1
e.VbMAG 1
S280MAG 24
e.S280MA 24
While the meaning of these columns are esoteric, there’s up to 50 rows containing missing data. 50 seems like it could be an issue, let’s check the size of our dataframe.
# check dimensions of dataframe
data.shape####################################################################
Returns:
(3462, 65)
The return of this .shape attribute is (3462, 65). This means there are over 3400 entries, and from earlier we know there are 65 columns. Given there are up to 50 rows missing information, we can say with confidence it won’t skew our data in any meaningful way if we drop 50 rows.
# drop rows missing information
data.dropna(inplace=True)
# returns False if all na's dropped
print(data.isna().any().any())# check dimensions of dataframe
data.shape####################################################################
Returns:
False
(3438, 65)
Looks like there were only 24 rows missing information. Now I’ve successfully dropped our na’s and are ready to continue.
Since I’m interested in redshift, the column that most closely approximates this is labelled ‘Mcz’. Let’s find columns with a high correlations to ‘Mcz’. In this case, I used 65% correlation as my filter.
# when checking for red-shift we're interested in Mcz
# we look to find high correlations with this column
# instantiate empty list to fill with predictor columns
predictors = []# loop through columns to find those with high correlations
for ind in data.corr()['Mcz'].index:
if abs(data.corr()['Mcz'][ind]) > 0.65:
# append columns meeting parameter to list
predictors.append(ind)# see table of correlations
data[predictors].corr()####################################################################
Returns:
See picture below
I found the columns with very high correlations with Mcz. One might be tempted to take the highest correlation, but upon some digging in the documentation, I found this is simply another estimate for redshift. Instead, I took S280MAG, with the second highest correlation. (Note: we’re looking for the highest magnitude, so we ignore the negative sign)
Now that I’ve chosen S280MAG as the predictor, I need to separate the data.
# separate data
x = data['S280MAG']
y = data.Mcz# we consider Mcz our "target", what we want to predict
# reshape predictor for agreement with regressor
predictor = x.values.reshape(-1, 1)
target = y####################################################################
Returns:
No return, final line is an assignment
Before I run the regression, it’s a good idea to visualize the data. I did this using matplotlib.
# plot a scatter plot with matplotlib.pyplot to visualize
# correlation
fig, ax = plt.subplots(figsize=(6,6))ax.scatter(predictor, target);
plt.title('Galaxy Redshift Correlation: S280MAG vs. Mcz')
plt.xlabel('S280MAG (redshift predictor)')
plt.ylabel('Mcz (estimated redshift magnitude)')####################################################################
Returns:
See picture below
Now I’ve implement functions from sklearn. When training a model, it’s wise to have something to test it against. Otherwise there’s no way to approximate how our model will work on unseen data. After all, the main purpose of creating a predictive model is to predict real-world phenomena where we want to approximate the answer.
The way this is done is by using sklearn’s train_test_split. This takes the data and sets aside a certain portion to test our model on.
# split data here. splitting data allows more accurate assessment of # model's performance on unseen data
X_train, X_test, y_train, y_test = train_test_split(predictor, target, random_state=144)####################################################################
Returns:
No return, final line is an assignment
Now that I have data to train the model, I use LinearRegression from sklearn.linear_model to train and test the data.
# Linear regression
# create instance of Linear Regression class
linreg = LinearRegression()
# feed model training data
linreg.fit(X_train, y_train)
# check how linear model fits on train data
linreg.score(X_train, y_train)####################################################################
Returns:
0.5062190758933518
The decimal returned above is the R² value of our regression line on our data. However, this is the score for how well it did on the training data, I need to check the test data.
# check how linear model works on test data
linreg.score(X_test, y_test)####################################################################
Returns:
0.4361796390722199
A decent R² score considering it’s a linear fit on clearly non-linear tornado-looking data. However, the model can improve. Let’s add Polynomial Features.
# add higher order polynomial features to linear regression
# create instance of polynomial regression class
poly = PolynomialFeatures(degree=2)
# create new training data with polynomial features instance
X_train_poly = poly.fit_transform(X_train)
# fit with features using linear model
poly_fit = LinearRegression().fit(X_train_poly, y_train)
# check how polynomial (2nd order) model works on train data
poly_fit.score(X_train_poly, y_train)####################################################################
Returns:
0.5070155216633236
Again, I check how this does on the testing data.
# check how polynomial (2nd order) model works on train data
poly_fit.score(X_test_poly, y_test)####################################################################
Returns:
0.4376592266134638
This does better, but not much better. Let’s see how high we can get a model. After some iterations, it looks like 7th order is the maximum.
# add higher order polynomial features to linear regression
# create instance of polynomial regression class
poly = PolynomialFeatures(degree=7)
# create new training data with polynomial features instance
X_train_poly = poly.fit_transform(X_train)
# fit with features using linear model
poly_fit = LinearRegression().fit(X_train_poly, y_train)
# check how polynomial (2nd order) model works on train data
poly_fit.score(X_train_poly, y_train)####################################################################
Returns:
0.5460465265443504
Upon checking test data:
# transform test data with poly instance-
# DO NOT fit_transform
X_test_poly = poly.transform(X_test)# check how polynomial (7th order) model works on train data
poly_fit.score(X_test_poly, y_test)####################################################################
Returns:
0.49943705606320155
I see the 7th order model does best with 7th order polynomial fit, and seeing as it also performs better (returns a higher R² value) on the test data, there’s evidence this isn’t from over-fitting our model.
Vioala! I’ve completed a linear regression, added 2nd order features, then 7th order features for good measure. There are many more methods of modelling, and within this method plenty of area for improvement, for instance using “cross validation” or “K-folds” to improve how we train our data. Doing further hyper-parameter tuning, implementing things like GridSearchCV, even running classifiers on this data (as we know there’s plenty of it) however, I’ll leave those for another blog post.