How to Use Pipeline to Chain Together Multiple Steps

A Pipeline is a way to organize complex machine learning workflows by combining data pre-processing steps with machine learning algorithms in a single unified model. It helps streamline the process of developing and deploying machine learning models and also makes the code more concise and easier to understand.

In machine learning, it is common to apply a sequence of transformations to the input data before feeding it into a model. For example, we might want to scale the features to a common range, impute missing values, or apply feature selection. Using sci-kit-learn’s Pipeline class, we can easily chain together multiple steps into a single object that can be fit on training data and used to transform both training and test data. In this thread, we will look at steps of how you can create a pipeline in Python, if you want to learn more about sci-kit-learn’s techniques, follow the threads below:

  1. Applying different imputation techniques to handle missing values.
  2. Handling unknown categories using OneHotEncoder.
  3. Encoding categorical features using sklearn.
  4. Selecting columns when transforming columns.
  5. Difference between fit and transform with example.
  6. Applying ColumnTransformer to dataframe columns.

Step1: Selecting a dataset to work on

  • The first step in any machine learning process is to load a dataset into the correct format.
  • We are using sk-learn’s built-in dataset named load_breast_cancer.
  • We have converted this dataset into a dataframe so it is easy for you to view it.

Step2: Adding missing values and changing scales

  • This is not a necessary step but we are doing this step because the dataset we’ve loaded is almost perfect and since we are going to be using pre-processing tools, we will add some missing values and change the scales of the data.
  • We’ve added some missing values and changed the scales of a few features.

Step3: Splitting the data into training and testing sets

  • An essential step in the machine learning process which is necessary because when applying pre-processing tools, we only want the tools to train on the training set and not on the testing set.
  • We’ve split the data such that the testing set contains 20% of the complete data.

Step4: Making a pipeline

  • We have first imported all the functions we will be using i.e, for making a pipeline, for scaling the features, for handling missing values, and lastly, a model which will be trained on our data.
  • The Pipeline class takes sequential steps in the form of tuples where the first item in the tuple is the name for the step, and the second item in the tuple specifies the object of the tool or the model we will apply.
  • We have defined a pipeline such that, it’ll first impute the features (X_train, y_train) using SimpleImputer, then deal with scaling them using StandardScalar, and then a LinearRegression model will learn on this newly processed data.
  • After all this, we have used the model to make predictions on test data (X_test) using the predict function.

Step5: Evaluating model's performance

  • The last and final step is to see how well our model performed, there are many different metrics that we can use in this step/
  • Since this is a regression problem, we’ve used mean_squared_error which measures the average of the squared differences between the predicted and actual values, and r2_score which measures the proportion of the variance in the target variable that is explained by the model.