What is one-hot encoding?

Categorical data refers to variables that are made up of label values. For example, a “color” variable could have the values “red“, “blue, and “green”. Think of values like various categories having a natural ordering to them.

Some machine learning algorithms can work directly with the categorical data depending on implementation, such as a decision tree, but most require any input or output variables to be a number, or numeric in value. This means that any categorical data must be mapped to integers.

One hot encoding is one method of converting data to prepare it for an algorithm and get a better prediction. With one-hot, we convert each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns. Each integer value is represented as a binary vector. All the values are zero, and the index is marked with a 1.

Have a look at this chart for a better understanding: