How can I deal with the non-numerical feature?

safa · August 10, 2023, 6:34pm

I’m learning more about DecisionTreeClassifier. My attempt to fit it into my data resulted in a ValueError. The classifier rejects my feature since its values are strings. As demonstrated in the code below:

# Loading libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
# Creating a sample DataFrame
data = {'Name': ['Aliza', 'Bazil', 'Aliza', 'Champ'],
        'Try': [1, 2, 1, 3],
        'Score': [0, 1, 0, 1]}
df = pd.DataFrame(data)

# Building and fitting a classifier
X_cat = df[['Name', 'Try']]
y_cat = df['Score']
clf_cat = DecisionTreeClassifier()
clf_cat.fit(X_cat, y_cat)

Error:

ValueError: could not convert string to float: 'Aliza'

Can someone please explain the mistake I’m making and how to fix it?

Replies

01:
You can use Label encoder. Categorical features can be transformed into numerical values via label encoding. However, use caution because label encoding could suggest an ordinal relationship in the data that isn’t actually there.

02:
The best way is to use one-hot encoding technique. One-hot encoding generates binary columns for each category without making any assumptions about ordinal values.

03:
You can also use Column Transformer. Different transformations can be applied to various columns using it.