ML Feature Engineering: Dealing with Categorical Features

Most of the algorithms in traditional ML, namely, algorithms based on statistical equations, work best with numerical values. However, there are times when categorical features are important when building a model to solve a problem.

So, the question is: How do we process these categorical features to make them work with most of the algorithms?

To answer this question, categorical features need a way to be represented as numbers, yet the natural of the meaning is kept.

Types of categorical encoding to be covered in this article are:

  1. Ordinal Encoding
  2. One-hot Encoding
  3. Binary Encoding

To start, we need to first understand what is a categorical feature.

Suppose that we are the owner of a gym, and we are gathering the data of our gym members such as height and weight.

  • As height and weight are measured in numbers, such as 180cm or 80kg, these values are continuous numbers, so they are numerical features.

So, when comes to categorical features, that are features such as gender of a personnumber of car doors; etc.

Straight-forward?

Ordinal Encoding

There are times when a categorical feature exhibits an ordinal relationship, which refers to a type of data that has a natural ordering or ranking.

For example, in a survey, we might ask a person to rate their satisfaction using options like: Least, Medium, and High.

In order to preserve the ordering relationship, we normally encode the values such that:

  • 1 to represent Least
  • to represent Medium
  • 3 to represent High

With this type of encoding, the natural ordering is kept. Thus, the meaning of the feature is not lost.

visit