This article is a brief overview of what a data pipeline is, why it is useful, and how to create one for a classification model. The intended audience has experience building simple classification models in python with sklearn.
What is a pipeline? The word has a quite a bit of history and several definitions, some of which will even stir up some controversy. However, there is little controversy over the “pipeline” of today’s article, that would be the data pipeline.
When building a model, there are several steps between importing the data and training a model. There are the cursory steps like handling null values, data splitting, identifying the target and predictors. Some must be done manually for each dataset and model, however pipelines can be used to automate various steps. This allows for cleaner and more modular code, and has other advantages — like helping prevent data leakage.
When using pipelines, naturally begin with importing the essentials:
# imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
The first two imports are familiar tools for any data scientist. The others are functions and classes used in data selection, cleaning, and (of course) creating pipelines.
Another familiar step, loading and cleaning data. In this case, I’m using the infamous wine quality dataset:
# load and clean data
df = pd.read_csv('winequality-red.csv')# split the predictor and target
y = df['quality']
X = df.drop('quality', axis=1)# split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,\
test_size=0.25, random_state=2021)
The wine data is relatively easy to work with- specifically there are few preprocessing steps that need to be done before training the model. In this case the numeric data needs to be scaled for the model to work properly.
Rather than of doing this directly to the data, instead put it in the pipeline:
# Build a pipeline that scales data
scaled_pipeline = Pipeline([('ss', StandardScaler()),
('RF', RandomForestClassifier())])
See? Very simple. I’m using a random forest classifier, but this works with any classifier in the toolkit.
Now, simply fit the model and judge the performance:
# fit model with pipeline
scaled_pipeline.fit(X_train, y_train)# score performance
scaled_pipeline.score(X_test, y_test)####################################################################
Returns:
0.6725
This result means the model is correctly classifying the quality about 67% of the time. This is good for a simple model, but there are additional ways of to take advantage of this pipeline. For instance it can be used in tandem with GridSearchCV to explore models with various parameters specified. In this case, since a random forest is being used, splitting and max depth parameters can be specified:
# specify parameters for random forest
# define "grid"
grid = [{'RF__max_depth': [5, 15],
'RF__min_samples_split': [2, 5],
'RF__min_samples_leaf': [1, 3]}]
Now the pipeline may be passed as the estimator in GrisSearchCV:
# grid search
gridsearch = GridSearchCV(estimator=scaled_pipeline,
param_grid=grid,
scoring='accuracy',
cv=5)
GridSearchCV is treated like a model itself, so now simply fit and score:
# fit model
gridsearch.fit(X_train, y_train)# find accuracy
gridsearch.score(X_test, y_test)####################################################################
Returns:
0.68
With a bit of tuning, looks like there was nominal improvement to the model. However, there’s plenty of room for improvement and tweaking and adding parameters to the grid would likely lead to further improvement.
This shows how simple pipelines can make model analysis, and how easy it is to create them. With that said, operations added to pipelines are not limited to simple scaling, in fact custom operations can be defined if necessary. For instance, there are times when operations need to be done to some columns and not others. This requires the use of a “column transformer,” which is incredibly useful, and deserving as a topic for a future article.