Nominalia, a region where the priests conduct Strange Rites
This article is originally published at https://robertgrantstats.wordpress.com
You have a nominal predictor variables with many values. That is to say, there are many categories and they do not have an innate order. Perhaps they are towns and villages in Hampshire, and you are using that alongside other data in a survey to look at the impact of perceived road surface condition on bicycle use (a causal question). Perhaps they are countries in the EU, and you want to predict voting patterns (a predictive question). They don’t have to relate to locations, though these examples do.
How do you analyse them? For statisticians, they simply have to be turned into numbers somehow. For machine learning people, other ideas might hover in the background like text analysis by deep learning or anisotropic projections like support vector machines. However, core “machine learning” methods will still for the most part expect to receive numbers as inputs, for example logistic regression. You might also be tempted to say that it doesn’t matter as long as the model is complex enough (but, if so, I hope you one day learn to reflect on the bias-variance tradeoff and the central role of feature engineering).
If they are categories and have no order, then we can’t say that A is innately closer to B than it is to C; there simply is no metric. It might be that the data on other variables for members of A is quite similar to that for members of B, and quite different to that for members of C, but that’s a different matter, one of similarity in this sample, not inherent in the variable’s definition.
If the difference is not obvious, consider a Likert item: how much do you agree or disagree with the statement national security agents should read all our emails? Strongly disagree / Disagree / Neither agree nor disagree / Agree / Strongly agree. There are 5 categories there, and Agree is closer to Strongly Agree than it is to Strongly Disagree, even before you collect any data. These are ordinal data. But nominal data have no order like that. Even though Owslebury is geographically closer to Colden Common than to Whitchurch, that is not relevant to the analysis; we simply want to recognise that people who live in different places are exposed to different roads and that might skew the results if we don’t adjust for it. Now, here’s the thing: numbers have an order. So, how can you convert nominal data into numbers without distorting the relationships and getting incorrect results back? (I set aside here the option of integrating over them in a REML stylee, because that approach to hierarchical modelling dodges the challenge of encoding them as predictors.)
The classic response, taught in stats school as the only real option, is called dummy variables or indicator variables. In computer science and machine learning, they call it one-hot encoding. There is a column — a new variable, or feature — for every village except one. If the respondent lives in Owslebury, then they get a 1 in the Owslebury column, otherwise they get a zero. One of the villages, perhaps Abbot’s Barton, is chosen as a baseline and its residents are identified by having zeros in all the columns. This creates k-1 new variables for k categories; in regression, each gets its own coefficient, and perhaps terms for heteroscedasticity, random slopes, interactions and so on, thus introducing a lot of parameters and eating up a lot of degrees of freedom. But if you want the computer to infer those differences for each of the categories, you’d better have enough data to feed it. It’s a maximal model but it treats every category on its own merits. You can reduce the complexity by combining categories, but that has to be justified because you throw away some information along the way.
Now, you may be tempted to try to do more, to encode in a different way so that the model is more parsimonious, adds fewer parameters but still captures the patterns across all the categories. It’s reasonable to wish for that, even if it is unattainable without strong assumptions. Nevertheless, the temptation of E-Z analytics has proven too strong, mixed with the heady liquor of ignorance, for some.
Welcome to Nominalia, where the usual rules no longer apply. Here, if it can be made to look like a number, it is a number. People have indeed done some strange things to those poor categories. Chop them up, ram them back together. And I say that considering that my own ancestors back in the day did some truly weird stuff.
One of the more popular is to assign arbitrary integer codes and then just treat it as a continuous variable. Here’s one I saw a while back. The author is so busy larking about with complex TensorFlow code, he doesn’t notice that variables are being represented by effectively random hashes and then represented as integers, and then hey! TF can run this so it must be good. (You might notice he also used passenger ID as a continuous variable “because it’s already an integer”.)
Let’s try this with EU countries. Austria is number 1 and UK 28, and when you ask for them to be plotted in 2 dimensions via multi-dimensional scaling (MDS), you get an idea of the implicit distances:
and that’s just Albania to UK, lined up. No surprises there. If you include this as a covariate/input/predictor/independent/exogenous variable, you will not get a parameter telling you how different one country is to another, other predictors being equal. You will get a slope that tilts this line-up in its best efforts to predict your (completely unrelated) target/outcome/dependent/endogenous variable. Good luck with that.
Here’s another recently brought up on Twittr. Countries were given integer codes and then used as a number. I like the clear tone of the letter criticising it:
But when they included their categorically-coded country (1 = US, 2 = Canada, and so on) in their models, it was entered not as fixed effects, with dummy variables for all of the countries except one, but as a continuous measure. This treats the variable as a measure of ‘country-ness’ (for example, Canada is twice as much a country as the US) instead of providing the fixed effects they explicitly intended.
Here’s part of the distance matrix. I will not comment on whether Canada is twice the country America is.
You can understand how novices make these slips, even if it is not acceptable. But what about this next macabre ritual? Assign arbitrary integer codes, represent these as binary numbers and then include each digit as a dichotomous variable. Category number 13 becomes Category 1101, which gets a 1 in the first column, 1 in the second column, 0 in the third column, and 1 in the fourth column. Now you have reduced k-1=27 variables in boring old one-hot into an exciting 4-parameter version, which tightens those flabby confidence intervals and speeds up your model fit! Win-win!
Here’s what the data would look like:
The dude is not alone in doing this sort of thing, as we see in popular Q&A forums — though this one at least stopped to ask.
We can look at these five EU binary variables and combine them into either an L1 (Manhattan) distance or an L2 (Euclidean):
And if we put these through MDS, we get strange hexagons. In Nominalia, we still find these arcane symbols on the sides of houses, where they are drawn to ward off Inconvenient Parameters:
The central blob contains Estonia, Finland, Austria and Portugal (EFAP). Everyone is a bit like EFAP. They are a hub around which more remote EU states exhibit unique and exotic properties. Well, let’s hope that’s true, because that’s what you’ve just forced your model to include. And you’ve just forced the EFAP quartet to be identical, along with everyone else and their respective uneasy bedfellows in your numeric farrago.
To clarify, suppose you have three categories: A, B and C. With each of the bizarre rituals we’ve seen, these are the implicit distances, and the only way to locate them in 2-D:
It is all very simple: if it seems too good to be true, it probably is. You can throw away information if you like, but don’t imagine that you can do things like represent a lot of information with a little and still get the right answer out. At best, you will add uncertainty, and at worst, you will hopelessly bias the outcome, while everything looks rosy on the surface. For a bottom line, I shall pastiche Leviticus (we’ve already had William Burroughs in this post, so I might as well): thou shalt not throw away information; it is confusion.
Please visit source website for post related comments.