Categorical data refers to variables that are made up of label values. For example, a “color” variable could have the values “red“, “blue, and “green”. Think of values like various categories having a natural ordering to them.

Some machine learning algorithms can work directly with the categorical data depending on implementation, such as a decision tree, but most require any input or output variables to be a number, or numeric in value. This means that any categorical data must be mapped to integers.

**One hot encoding** is one method of converting data to prepare it for an algorithm and get a better prediction. With one-hot, we convert each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns. Each integer value is represented as a binary vector. All the values are zero, and the index is marked with a 1.

Have a look at this chart for a better understanding: