How Scikit-learn Pipelines Make Your Life So Much Easier
During the course of my latest machine learning project, I took some time to explore Scikit-learn pipelines. A pipeline is an object that allows you to preprocess/transform data, train a model, and use a model all in one easy tool. Below I will talk about some of the cool things you can do with pipelines, and how they can be a HUGE time-saver when building and validating models.
Building a Basic Pipeline | Pipeline Basic Abilities
Building a pipeline is simple. Let’s say you need to do the following steps in building your model:
- Scale data to a standardized scale
- Fit a Random Forest Classifier
We can take these same steps and throw them in a pipeline. Essentially, the initial steps in the pipeline should be preprocessing steps, thus having a fit/transform method. The final step should be an estimator with a scoring method. In this case, our scaler is our preprocessing step, and our Random Forest is the estimator. You can add any number of preprocessing steps into your pipeline before the estimator, but for this example, I will keep it simple. Check out the code below to define the pipeline:
Notice that the steps in your pipeline should be defined as a list of tuples, with the name of the step as a string first, and the method second.
Now you can use this pipeline just like you would use an estimator object:
- First, you need to fit it to the training data
- Call the .score( ) method, .predict( ), etc.
- Access attributes and methods of the steps within your pipeline by indexing the step (this was something that I learned halfway through my project after lots of Googling… remember to go back to basics with indexing!)
Using Pipeline and GridSearchCV
Another amazing thing you can do with pipelines is to integrate a grid search! GridSearchCV is an exhaustive search of all combinations of parameters within a specific parameter grid for your model. It allows you to find the perfect tuning parameters in just a few lines of code. You can create pipelines that perform grid searches and even cross-validation as well.
In order to do this, you simply define the pipeline the same way as above, then define your parameter grid. The parameter grid is a dictionary with parameters as keys and lists of parameters to test as values. The names of the keys must have the following string format “PipelineStepName__ParameterName”. Finally, you run GridSearchCV with your pipeline as the estimator, and parameter grid as your parameter grid.
In the example above, you can see how you can see the best parameters and best score for the best performing model. You can even store that best estimator in another variable to reference later on. In this way, you can add another step to your pipeline streamline — model tuning. It is important to remember that GridSearchCV is an *exhaustive* search, meaning that it will try every single combination of parameters. This can get cumbersome very quickly as you add more parameters, and/or if you add cross-validation as well. The example above would fit 90 total models (3*2*3*5), each containing 100–200 decision trees, so be very careful about which parameters to try and their effect on Big O Notation.
Dealing with Class Imbalance in Your Pipeline
If you’ve worked with pipelines before, you may have run into issues when trying to deal with class imbalance. Class imbalance is when the target you are predicting is not seen as frequently as other classes (i.e. a rare disease). To deal with this imbalance, you can use class weights, oversample the minority class, or undersample the majority class. One popular oversampling technique is SMOTE (Synthetic Minority Oversampling Technique) from Python’s imblearn library.
When trying to add SMOTE to my pipeline in my project, I hit an error. The issue is that sklearn’s pipeline will try to oversample the training and validation sets, which is not what you want to do with SMOTE. To fix this, imblearn has a pipeline that is built on top of sklearn’s pipeline, meaning it functions almost exactly the same way. However, when you call the predict( ) method, the imblearn pipeline will skip the sampling step, solving this issue.
In this way, dealing with class imbalance can be a breeze, because you can just add that step to your pipeline, like the example below.
Customizing Classes to Add to Your Pipeline
One of the things I explored in my most recent machine-learning project is how to build a custom class for my pipeline. Why would you want to do this? Sometimes OneHotEncoder, or StandardScaler is not enough preprocessing to transform your raw data to be ready to feed into your estimator. Essentially to make pipelines work most efficiently, you want them to include all of the data preprocessing so all you need to do to make predictions on new data is to feed it right into your pipeline. For example, what if you want to transform text data in a specific way, or engineer new features that are more complicated? There are 2 ways to do customize these classes:
The first way is to use Object-Oriented Programming (OOP). Sklearn has base estimators for all of their models that you can use to inherit qualities. Remember that in order to be used in the pipeline, the class must have both .fit( ) and .transform( ) methods, and the final class must have a .score( ) method. Therefore, we can build custom classes that have these methods and our pipeline will recognize them as valid. Check out the example below to see a very simple example:
The first class will allow me to choose which columns I want to use. In this way, I can try a variety of features in my grid search (see below). The second class will transform two of the features in my dataset from ‘yes’ and ‘no’ to 1 and 0, respectively (this could be done with OneHotEncoder too, but I just wanted to show an easy example). Now I can go through a similar grid search and scoring as the previous examples, while completely preprocessing my data all in my pipeline.
The second way to do this is arguably easier, but less customizable. We can use sklearn’s FunctionTransformer. This amazing tool will transform any function (even a lambda function) into a class that the pipeline can understand and use.
In conclusion…
Pipelines are your best friend when you’re building machine learning models! They make it super easy for you to put all your preprocessing steps, your model, and even model tuning into an organized and streamlined workflow. If your goal is to work smarter, not harder, then pipelines can help you achieve that goal.