Most machine learning algorithms require input data to be in numerical format, and since there are many features in datasets that are categorical, we must know different techniques of how to deal with them to get efficient model results.
One common technique for encoding categorical features into a numerical format is using scikit-learn, a popular Python library for machine learning. Scikit-learn provides several classes for encoding categorical features, including OneHotEncoder
, OrdinalEncoder
, and many others, each with its own advantages and limitations. This thread is focused on some of the most common and efficient techniques which are discussed below.
Before we explore different functions through which you can encode categorical columns, we first loaded a simple sample dataframe. We selected a column Embarked
in this dataframe to be encoded by all functions. Different encodings would be applied on the same column and you’d have a clear idea of what each function does by comparing the results.
1. Using "OneHotEncoder()" method:
-
OneHotEncoder
is used to transform categorical features into a numerical format that can be used as input for machine learning algorithms. It creates a binary vector representation for each unique category in the original feature. - In the example code below, there are 3 unique values in column
Embarked
and hence theOneHotEncoder
class creates 3 binary columns for each value, and value1
indicates which value is true in each data sample. - The
to_array()
function simply converts the results obtained byfit_transform()
into a 2D array that can be easily interpreted.
2. Using "OrdinalEncoder()" method:
- The
OrdinalEncoder
function converts each unique category in a categorical feature into a numerical value based on its order of appearance. UnlikeOneHotEncoder
, it does not create a separate binary column for each unique value. - The advantage of
OrdinalEncoder
is that it preserves the ordinal relationship between categories, which can be useful when there is a natural order to the categories. Additionally,OrdinalEncoder
is computationally efficient and can handle large datasets with many categories. - However, it is important to note that the numerical values assigned to each category may not be meaningful in all cases, and may not be appropriate for certain machine-learning algorithms.
3. Using "LabelEncoder()" method:
-
LabelEncoder
is used to transform categorical labels into numerical labels. It assigns a unique integer value to each category in the input feature. - It is different from
OrdinalEncoder
as it can transform only one feature at a time and not a list or array of features. - The advantage of
LabelEncoder
is that it is a simple and computationally efficient method for encoding categorical labels. Additionally, it preserves the ordinal relationship between categories, which can be useful in some situations. - However, it is important to note that
LabelEncoder
should be used only when the categorical variable is ordinal or when there are only two categories. In cases where there are more than two categories, it is generally better to useOneHotEncoder
orOrdinalEncoder
to avoid assigning any arbitrary order to the categories.
If you want to learn how you can apply all these classes and transformers much more efficiently on multiple columns at once, go through the threads of: