Imputing missing categorical features values

Imputing missing values refers to the process of filling in or replacing missing data with estimated values. In the context of a categorical feature, this means replacing missing values with an estimated category based on the known categories in the feature. Scikit-learn is a popular Python library for machine learning, which provides several methods for imputing missing values, including for categorical features. There are several ways to impute missing values for a categorical feature using scikit-learn. Here are a few options:

1. "SimpleImputer":

SimpleImputer is a scikit-learn class that provides basic strategies for imputing missing values.To use SimpleImputer for categorical features, you can set the strategy parameter to ‘most_frequent’. This will replace missing values with the most frequent value in the column.

Here’s an example below:

In the above code,

  • First, we import the necessary libraries: pandas, numpy, and SimpleImputer from sklearn.impute.

  • We then create a small dataset called ‘data’ with a column named ‘SocialApps’ and some missing values represented by np.nan.

  • Next, we create a SimpleImputer object named ‘imp’ with a strategy parameter set to ‘most_frequent’, which means that the most frequent value in the ‘SocialApps’ column will be used to impute the missing values.

  • To impute the missing values, we call the ‘fit_transform’ method on the ‘SocialApps’ column of the ‘data’ DataFrame and pass it as an argument to the SimpleImputer object. The ‘fit_transform’ method fits the imputer on the data and applies the transformation (imputation) to the ‘SocialApps’ column.

  • Finally, we print the ‘SocialApps’ column to verify that the missing values have been imputed with the most frequent value in the column.

2. "KNNImputer":

KNNImputer is another Scikit-learn class that imputes missing values using k-nearest neighbors. To use KNNImputer for categorical features, you can set k- value and weights as uniform.

Here’s an example:

In the above code,

  • First, we import the necessary libraries: pandas, numpy, and KNNImputer from sklearn.impute.

  • We then create a small dataset called ‘data’ with some missing values represented by np.nan.

  • Next, we create a KNNImputer object named ‘imp’ with a n_neighbors parameter set to 5, and weights parameter set to ‘uniform’. This means that the imputation will be based on the 5 nearest neighbors with uniform weights.

  • To impute the missing values, we call the ‘fit_transform’ method on the ‘data’ and pass it as an argument to the KNNImputer object. The ‘fit_transform’ method fits the imputer on the data and applies the KNN algorithm to fill in the missing values based on the values of the nearest neighbors.

  • The imputed data is stored in a new variable called ‘imputed_data’.

  • Finally, we print the ‘imputed_data’ to verify that the missing values have been imputed using the KNN algorithm.

3. "IterativeImputer":

`IterativeImputer` is a scikit-learn class that imputes missing values using a machine learning model. To use `IterativeImputer` for categorical features, you can set the estimator parameter to a classifier that can handle categorical features, such as `RandomForestClassifier`.

Here’s an example:

In the above code,

  • First, we import the necessary libraries: pandas, numpy, enable_iterative_imputer, IterativeImputer, and RandomForestClassifier from sklearn.

  • We then create a small dataset called ‘data’ with some missing values represented by np.nan.

  • Next, we create a RandomForestClassifier object named ‘estimator’ to be used as the imputer’s estimator. This classifier will be used to predict the missing values.

  • We create an IterativeImputer object named ‘imputer’ with the estimator parameter set to ‘estimator’ and a random_state parameter set to 0.

  • To impute the missing values, we call the ‘fit_transform’ method on the ‘data’ and pass it as an argument to the IterativeImputer object. The ‘fit_transform’ method fits the imputer on the data and applies the iterative imputation algorithm to fill in the missing values based on the predicted values from the RandomForestClassifier.

  • The imputed data is stored in a new variable called ‘imputed_data’.

  • Finally, we print the ‘imputed_data’ to verify that the missing values have been imputed using the IterativeImputer algorithm.