Effectively Handling Categorical Features

Real world datasets often contain categorical attributes. Some models support these natively (e.g. decision trees), but others have to be modified somehow to support them. This post focuses on techniques that can be used to convert categorical attributes into continuous ones so they might be used with those models that don’t support categorical attributes well natively (like neural networks).

1-hot Encodings

There exists a fairly simple mechanism for converting a categorical attribute to a continuous one that has been used extensively in the machine learning community for many years: the categorical attribute is encoded as a 1-hot (or 1-of-k) vector of continuous values. The fundamental principle is to split each unique categorical value into its own dimension. The values in each dimension are either 0 (for attributes that do not match) or 1 (for the matching attribute).

For example, suppose we were given a categorical attribute Fruit = {Apple, Orange, Banana}. All instances of Apple are encoded as (1, 0, 0), all instances of Orange are encoded as (0, 1, 0), and all instances of Banana are encoded as (0, 0, 1). Simple enough, right?

Hopefully it’s apparent that when a 1-hot encoding is used, we increase the dimensionality of the problem. The one Fruit dimension was turned into 3 continuous dimensions, one for each of the possible values of Fruit. What if the categorical attribute we started with had more values, say 1,000? Then we would have to replace 1 dimension with 1,000! That’s… not ideal.

The problem is exacerbated when we start working with categorical attributes that have even more possible values. For example, according to the U.S. census (and Quora), there are approximately 20,000 cities in the United States¹. If I had to represent a single city in the US, I would then need 20,000 dimensions. If I had to represent an origin / destination pair, I would need 40,000 (20,000 for each)! Furthermore, most datasets have more than 1 or 2 attributes. We very well might have even more categorical data that needs to be 1-hot encoded, which would add to the complexity. This particular problem is sometimes called dimensionality explosion, to really drive home the point that we have blown up our input space.

Pruning The Input Space

Since many models, neural networks included, can have a difficult time working with more than a few dozen inputs, many different techniques have been proposed to deal with dimensionality explosion. The simplest is simply to drop some of the possible values, especially for those that do not appear very frequently in the dataset or have been determined to be less relevant than the others. Dropping values wouldn’t work for our simple Fruit example, but in some domains, like NLP, it can be used to great affect.

For example, suppose we are trying to build a very simple word disambiguator. We want to know if the word “bank” is used as a financial institution or as the land alongside a river or lake. To do this, we will examine the context of the word as it appears in some corpus (collection of text). A very simple context is to look at the word (or words) immediately before and immediately after the word “bank”, so we might have two attributes: WordBefore and WordAfter. Virtually all words in the English language could appear in either attribute, and there are roughly 60,000 English words that are commonly used, so we could encode both of these attributes as 1-hot vectors with 60,000 dimensions.

Practically, working with 120,000 features is a little bit crazy, so one reasonable course of action is to try to prune the space a bit. Suppose the word “the” comes before “bank” in this corpus. Does that tell us anything about which meaning of “bank” we should use? The answer is “probably not”, which means we could theoretically remove it from consideration completely. The same goes for most other articles like “a” or “an”. In fact, the vast majority of all English words would likely be unhelpful for this particular problem. If we could narrow the list of useful words down to say 1,000 words, or even 200 words, that would help to make the problem significantly more tractable.

In practice, an NLP application like this one likely would not use a context window of 1 word on either side, and all words in the English language would probably not be considered in the first place, but this example shows how the general principle of reducing the possible input space can be useful in the right situation.

Combining Categories

An alternative solution to the dimensionality explosion problem is to merge some of the categories together. For example, I could take Fruit = {Apple, Orange, Banana} and turn it into Fruit‘ = {JuicyFruits, Banana} by combining Apple and Orange into a single category. We still use a 1-hot representation, but as there are now fewer possible values, we don’t have to add as many dimensions. This makes the problem easier from a computational point of view.

This particular example does not provide good motivation for the general technique, so let’s try another one. Suppose your client was a large chain retail store. We’ll call them All-mart, as they seem to sell everything. All-mart wants to be able to estimate how long the average customer spends looking at each individual product they sell so they can move those products people seem to care about closer to the front of their store. (This problem is a little contrived, but it does provide a good example of this technique.)

Assuming you were provided with adequate training data, you will have the immediate problem that All-mart sells approximately 100,000 different items. Many models are likely to struggle with so many inputs, so you will need to shrink the input space somehow. In this case, pruning the space will not work, as All-mart really does want estimates for all of their products, but we do know that some items are more similar to others. In that case, we could combine several individual items into categories and evaluate those categories instead. The estimate for the category will be applied to all items within that category.

For example, All-mart might sell 200 different types of candy, 100 different types of cereal, and 42 types of golf tees. If we could reduce those to {Candy, Cereal, Golf Tees}, we’ve gone from 342 dimensions down to 3, which indeed does make the problem much easier to manage. As a side benefit, grouping the products makes it quite simple to create a hierarchical structure, so you could give All-mart results on a wide range of granularities, which might actually be more useful to them.

Transforming the Input Space

A third category of solutions relies on transforming the input space by performing dimensionality reduction. Dimensionality reduction is a large field, and there are many established techniques, but the core principle is to somehow represent the input in a way that fewer dimensions are needed.

One of the more popular techniques is called Principal Component Analysis. It effectively calculates a new coordinate system for the original data where the first axis has the largest variance, the second axis has the next largest variance, and so on. By limiting the number of axes that can be used, the data is forced into a lower-dimensional space, which accomplishes our original goal. Other common approaches include Multiple Correspondence Analysis (better suited to categorical inputs) and Latent Semantic Analysis (used for textual data).

I’m going to forgo detailed explanations of these methods as they are beyond the scope of the article, but I should mention that no dimensionality reduction algorithm is a catch-all. Reducing the dimensionality of the input can be seen as a type of lossy compression, meaning that at some level, data is usually destroyed by the transformation process. Intelligent algorithms, like PCA, will try to ensure that the data that is lost is not important, but it is still something that should be kept in mind by the practitioner.

An Optimization for Neural Networks

The methods and techniques listed above are usually agnostic to the type of model being used, but there exist others that are specific to certain models. In the specific case of neural networks, there is a particularly nice way to work with high-dimensional categorical features if you can use a simple linear input layer–use a sparse input encoding.

For a typical 1-hot representation of a categorical feature with $N$ dimensions, we will have $N – 1$ 0’s and a single 1. When a linear input layer is used, the product $xW$ corresponds to a selection of one row of $W$, which is significantly cheaper to calculate than the full vector-matrix multiplication.

For example, assume:

$$x = \begin{bmatrix}0 & 0 & 1 & 0\end{bmatrix}, W = \begin{bmatrix}a & b & c\\d & e & f \\g & h & i \\j & k & l \end{bmatrix}$$

The product is:
$$y = xW = \begin{bmatrix} g & h & i\end{bmatrix}$$

With batches, this becomes a selection of multiple rows of $W$:

$$x = \begin{bmatrix}0 & 0 & 1 & 0\\1 & 0 & 0 & 0\\0 & 0 & 0 & 1\end{bmatrix}$$

$$Y = xW = \begin{bmatrix}g & h & i\\a & b & c\\j & k & l\end{bmatrix}$$

In both cases, the forward propagation step becomes trivial. During the backpropagation step, when $\frac{\delta E}{\delta W}$ is calculated, we can also exploit the sparsity of the inputs:

$$\frac{\delta E}{\delta W} = \frac{\delta E}{\delta Y}\frac{\delta Y}{\delta W} =x^T\frac{\delta E}{\delta Y} $$

$$\frac{\delta E}{\delta Y} = \begin{bmatrix}\delta_{11}&\delta_{12}&\delta_{13}\\\delta_{21}&\delta_{22}&\delta_{23}\\\delta_{31}&\delta_{32}&\delta_{33}\end{bmatrix}$$

$$\frac{\delta E}{\delta W} =\begin{bmatrix}0&1&0\\0&0&0\\1&0&0\\0&0&1\end{bmatrix}\begin{bmatrix}\delta_{11}&\delta_{12}&\delta_{13}\\\delta_{21}&\delta_{22}&\delta_{23}\\\delta_{31}&\delta_{32}&\delta_{33}\end{bmatrix} = \begin{bmatrix}\delta_{21}&\delta_{22}&\delta_{23}\\0&0&0\\\delta_{11}&\delta_{12}&\delta_{13}\\\delta_{31}&\delta_{32}&\delta_{33}\end{bmatrix}$$

So just as with the forward propagation step, calculating the gradient in this case requires simply selecting the appropriate rows from the error matrix, which is considerably cheaper than the associated matrix-matrix multiplication. In general, though, we will still have to do a few additions, as there will likely be several inputs that represent the same class. A sparse matrix multiplication implementation should take care of these details while avoiding the unnecessary multiplications, which will generally improve performance significantly.

Note that $W$ is a dense matrix, so all its values do still have to be stored. We can take a shortcut for the forward and backward passes to speed up computation, but memory consumption can still be an issue, especially for larger networks. In those cases, it might be wise to consider using one of the earlier approaches in concert with the sparse inputs approach.