Saving a trained model or pipeline in Scikit-learn
is a crucial step in machine learning projects. It offers several advantages such as saving time and computational resources by avoiding retraining. It facilitates sharing the model with others and deploying it in production environments. It is helpful in automating data processing, creating predictive models, and building intelligent systems. In this thread, we will discuss different methods that help us in saving machine learning models or pipelines.
Creating a "model" or "pipeline":
We import
some necessary libraries and generates some sample data and creates a pipeline that performs scaling and linear regression on the data. The pipeline is then fitted to the sample data. Once the pipeline is fitted, it can be used to make predictions on new, unseen data using the predict()
function.
Let’s see the code below to gain better understanding:
Now, let’s discuss some methods of saving a model.
1. Using "pickle" module:
Pickle
is a Python module used for serializing and de-serializing Python objects.
-
It can be used to save trained machine learning pipelines for later use without having to retrain the model.
-
It saves time and resources by allowing you to reuse the trained model without retraining it and allows for easy sharing and deployment of trained models.
Let’s see the example below to learn how pickle
help us in saving the model.
2. Using "Joblib" module:
Joblib
is a Python module that can save and load machine learning models and pipelines.
-
It is a part of the
Scikit-learn
library. -
Compared to
Pickle
,Joblib
can efficiently handle largeNumPy
arrays and reduce disk space usage by compressing the saved file.
Let’s see the example below to learn how Joblib
help us in saving the model.
In the above code, a pipeline object is saved to a file named ‘model.joblib’ using the joblib.dump()
method. To load the saved pipeline object from the file, the joblib.load()
method is used, and a new data point is used to predict a target value using the loaded pipeline. The predicted value is printed using the print()
function.
3. Using "yaml" module:
YAML
is a human-readable data serialization format that can be used to save machine learning pipelines in a readable and portable format.
-
YAML
files are platform-independent, making them ideal for sharing across different operating systems and programming languages.
Let’s see the example below to learn how YAML
help us in saving the model.
In the above code, a pipeline object is saved to a file named ‘model.yaml’ using the yaml.dump()
method. To load the saved pipeline object from the file, the yaml.load()
method is used along with the BaseLoader
class from the yaml.loader
module. A new data point is used to predict a target value using the loaded pipeline, and the predicted value is printed using the print()
function