The Entropy of First Names
I was recently working on a coding challenge for a job related to analysis of the first names of people born in the USA. The data set given was from the Social Security Administration and can be downloaded here. There is a text file for each state and for every name entry there is the associated gender, year, state, and count of people born that year with the stated name. For example, here are the first few lines of the Arkansas file:
While looking for inspiration for this challenge I found an interesting article that referenced the entropy of baby names. Entropy measures the amount of information contained in a choice from a distribution. I decided to explore the entropy concept further and the results are below.
The question, “How much information is in a baby’s name?” can sound odd to those not used to thinking about information in bits. Put briefly, the uncertainty of a (fair) coin flip is 1 bit: heads or tails. 0 or 1. The uncertainty of the outcome of a fair die is about 2.6 bits. The uncertainty is larger in the case of the die because there are six possible outcomes instead of only two. So the larger the number of equally probable choices, the larger the uncertainty in the outcome, which is precisely what entropy measures.
However, it’s not just the number of choices, but the details of the probability distribution. If you have a possible outcome of X or Y, but X occurs 99% of the time and Y only 1% of the time, you’re less uncertain than you would be if it was a 50/50 split. That means the 50/50 scenario has more entropy than the 99/1 process.
Entropy in Names
So, how has the entropy of names changed throughout time? The figure below shows a rapid increase in name entropy over the last 25 years. For perspective, 10 bits is similar to flipping 10 “heads” in a row. That’s about 1 in a 1000 chance of getting it right. 8 bits, however, is like 8 coin flips and corresponds to a chance of 1 in 256.
Another way to think about it is that if we had a perfectly efficient encoding scheme for mapping numbers to names (i.e. no correlation between the digits), then we would need a binary number of 11 digits to keep track of everyone’s name until around 2025.
Intuitively, we know that some names are more common in some years than others. For instance, we might expect Dorothy to be more common in 1930 than in 2014. How can we quantify the amount of information we have about someone’s name if we’re given their birth year?
Mutual information is such a quantity. I(X;Y) quantifies the amount of information you have concerning Y if you know X or vise versa. In this case, the mutual information of the names and years is about 0.85 bits. The entropy of a name without any information of year is 9.90 bits. This means that if you’re guessing someone’s name and are told the year they were born, your uncertainty has reduced from 9.90 bits to 9.05 bits.
A ratio of the mutual information to the total (unconditional) entropy is called the uncertainty coefficient. In our case it is equal to 0.09 and represents the fraction of “name” bits we could predict given the year. Put simply, if we’re given the year then we’ve eliminated about 10% of the original uncertainty.
This type of analysis could be extended to quantify the mutual information contained in the gender and state variables. We could then easily compare the effectiveness of variables in predicting outcomes. For instance, we could easily answer the question: “If we know the state, gender, and age range of the people coming to a website, how many people do we need to visit before we can guess someone’s name correctly? If we could get the uncertainty down to 5 bits, for example, we could guess the names of roughly 1 in 30 people, just like flipping 5 heads in a row.
Generally this is a nice way of understanding the predictive ability of models. It has obvious implications for feature selection and the evaluation of feature redundancy.