Category Archives: Information Theory

Deriving the Normal Distribution, Part 2


In Part 1 of this series we set out learn why the Gaussian/Normal distribution is such a pervasive choice in modeling random variables.  We claim that this distribution is the least biased choice possible when the only information we have is the average and variance of outcomes from a population.  Our strategy was to establish a metric of uncertainty and then maximize this metric while simultaneously adhering to our known equation for variance from an average.

Read more

Deriving the Normal Distribution, Part 1


When dealing with probabilities or building mathematical models, the de-facto distribution is the normal/gaussian.  The bell curve.  But where does this come from?  Why is it a better choice than any other distribution that centers on the average?  It turns out that the Guassian distribution is the one that is the least biased choice if the only information you have are 1) the average and 2) the variance of a set of outcomes.

To get to this result, we will first establish a measure of our uncertainty.  This will be useful because we will then (Part 2) choose a distribution to model our process that both maximizes this uncertainty function and simultaneously satisfies the constraints of our average/variance knowledge.

Criteria for a Measure of Uncertainty

How do you measure the amount of uncertainty in a probability distribution?  We’ll start by establishing a few characteristics that would be required for a useful uncertainty function.  From these we will derive the functional form that meets the criteria.

Suppose we have the possible outcomes of the process encoded by values represented by (x_1, ... , x_n). For instance, if it were a 6-sided die, the values would be 1, 2, 3, 4, 5, and 6. One way to mathematically model the process of rolling the die would be to assign probabilities to each of the possible outcomes.  In the case of a fair die, each probability is equal to 1/6.  We assign a corresponding probability (p_1, ... , p_n) for each outcome (x_1, ... , x_n) and these probabilities represent our knowledge of the process. Let’s call our unknown uncertainty function “H”.  We expect H to depend on our probabilities since we are trying to quantify our uncertainty in our knowledge and its our probabilities that represent that knowledge: H=H(p_1, ..., p_n)

So which criteria does H have to meet so that it is useful and in line with our intentions?

  1. It must be a continuous function of the p_i.  This means a small change in one of the probabilities can’t give a huge change in our knowledge state.
  2. In the case that all outcomes are equally probable (as in the case of a fair 6-sided die), the uncertainty should increase as the number of possible outcomes increases.  In other words, the uncertainty function monotonically increases with the number of equiprobable choices.
    1. This just means that if you have more choices, you’re more uncertain than if you had fewer choices.  Guessing the outcome of a coin-flip is easier than picking the lottery numbers.
  3. No matter how you count the probabilities, you should arrive at the same value for the same state of knowledge.
    1. For instance, let’s say that instead of having to guess the outcome of a fair six sided die roll directly, you first guess whether it’s in the set \{1,2,3\} or \{4,5,6\} and then guess the final number from the smaller list.  The first choice has two equal outcomes, so your probability of success is 1/2.  The second choice has three equal outcomes, so your probability of success is 1/3.  Your total probability of success is the product of these two and is 1/6.  Essentially this is another way of modeling the exact same process.  The requirement above just means that no matter how we model a process and break it up into sub-decisions, we should get one consistent value for H, whatever it ends up being.

The Uncertainty Function

It turns out, the function that satisfies these criteria is H(p_1,...,p_n)=-K\sum p_i \log p_i where K is a positive constant that determines the units.  If you’re interested in the derivation, see the Deriving the Uncertainty Function section below.

An Example

Let’s use a simple case where there are two outcomes with probabilities p and q = 1-p.  The uncertainty becomes

H(p, 1-p)=-K[p\log p + (1-p)\log(1-p)]

= -K[p\log\frac{p}{1-p}+\log(1-p)]

Plotting this function using K=1 and a logarithm of base-2, the following plot is generated with units of bits on the y-axis.


How does this fit our intuition?  If we are certain in the outcome (i.e. p=0 or p=1) then our uncertainty function H=0.  We have no uncertainty.  If, on the other hand, p=q=\frac{1}{2}, then our function is maximized at H=1 bit.  As we would expect, we are maximally uncertain when the two outcomes are equally probable.

Deriving the Uncertainty Function

Using the criteria stated above, let’s now derive our uncertainty function.

Uncertainty Function of Equally Probable Outcomes

Let’s assume we are dealing with equally probable events at first, like a fair die.  We’ll simplify our notation so that instead of writing the uncertainty for n equally probable events as H(\frac{1}{n},\frac{1}{n},...,\frac{1}{n}), we will just use A(n).

According to requirement 3 above, we can deduce that to decompose a choice from n = s^m equally likely possibilities, you can transform it to be m choices from s equally likely possibilities:  A(s^m) = mA(s).  In other words, if you have n = 9 = 3^2 equally likely choices, you can decompose that into two subsequent choices, each from three possibilities.  This decomposition is similar to what we did in the sub-point of requirement 3.

Now, for any specified number t^n, we can find an m to satisfy the following equation: s^m\leq t^n\leq s^{m+1}.  So if t^n = 3^2 = 9, we could choose s^m = 2^3 = 8 and then s^{m+1} = 2^4 = 16, which satisfies the inequality.

By taking the logarithms of each term, we get m\log(s) \leq n\log(t) \leq (m+1)\log(s).  Dividing by n\log(s), we get \frac{m}{n} \leq \frac{\log(t)}{\log(s)} \leq \frac{m}{n} + \frac{1}{n}.  Subtracting \frac{m}{n} from each term we end up showing: \frac{\log(t)}{\log(s)} - \frac{m}{n} \leq \frac{1}{n}.  And since n can be arbitrarily large, |\frac{\log(t)}{\log(s)} - \frac{m}{n}| < \epsilon.  So what?

Since requirement 2 says that our uncertainty function is monotonic that means the original inequality and subsequent forms of it apply:  A(s^m) \leq A(t^n) \leq A(s^{m+1}).  Using our decomposition rule just derived: mA(s) \leq nA(t) \leq (m+1)A(s).  Continuing with the same division/subtraction steps we just undertook we end up with the relation:  |\frac{A(t)}{A(s)} - \frac{m}{n}| < \epsilon.

These inequalities essentially state the following:  The ratios of \frac{A(t)}{A(s)} and \frac{\log(t)}{\log(s)} are both within a distance of \epsilon from the quantity \frac{m}{n}.  That means the furthest they can be from one another is 2\epsilon.  Or translating that to an inequality, |\frac{A(t)}{A(s)}-\frac{\log(t)}{\log(s)}| < 2\epsilon.

So what does this tell us?  We have shown that the difference between a ratio of our unknown functions \frac{A(t)}{A(s)} are arbitrarily close to a ratio of known functions \frac{\log(t)}{\log(s)}.  This leads one to naturally conclude that the unknown functions are the known functions within some multiplicative constant.  How else could they be arbitrarily close for any s and t?  Therefore:

A(t) = K\log(t).

Uncertainty of General Probability Distributions

So that’s the function that quantifies the uncertainty of a process with t equally probable outcomes.  But what about processes that don’t have equally probable outcomes?  Using requirement 3 we see that the way you decompose a process is that you take the uncertainty of the first choice and add the uncertainties of the additional choices that are weighted by their probability of occurrence.  So in the case of our game in the example of requirement 3, the uncertainty of the process of choosing 1 of 6 equally probable sides is the same as the sum of the uncertainties of the two step process: 1) choosing one of the two subgroups \{1,2,3\} or \{4,5,6\} and 2) choosing among the three elements of each subgroup multiplied by the probabilities of getting that subgroup in the first place.

A(6) = H(\frac{1}{2},\frac{1}{2})+\frac{1}{2}H(\frac{1}{3},\frac{1}{3}, \frac{1}{3})+\frac{1}{2}H(\frac{1}{3},\frac{1}{3}, \frac{1}{3})

Now imagine a process where you have n =\sum n_i outcomes where n_i is the relative occurrence of the i^{th} outcome.  For instance, a 6 sided die where there are a single green, a single red, two blues and two yellows.  The probabilities for each outcome are then straight-forward:  p_i = \frac{n_i}{\sum n_i}.  This also means that if you know you are on outcome i, this is like choosing from n_i equally probable events since there are n_i occurrences of i.  To break down this process of n =\sum n_i outcomes we have A(\sum n_i) = H(p_1,...,p_n) + \sum p_i A(n_i).  Let’s solve this for our unknown uncertainty function H(p_1,...,p_n) .

H(p_1,...,p_n)=A(\sum n_i)-\sum p_i A(n_i)

= \sum p_i A(\sum n_i)- \sum p_i A(n_i) = \sum p_i [A(\sum n_i) - A(n_i)]

= K\sum p_i [\log(\sum n_i)-\log(n_i)]

= K\sum p_i \log\frac{\sum n_i}{n_i}

= K\sum p_i \log\frac{1}{p_i}

= -K\sum p_i \log p_i

The only thing we did was recognize that a function weighted by its probabilities and summed is equal to the function itself (line 2) and substituted our equation of the A(n) that we found above (line 3).

So, finally, we have solved our problem.  The uncertainty associated with a process that’s characterized by the set of probabilities \{p_i\} is given by the function

H(p_1,...,p_n)=-K\sum p_i \log p_i

where K is a positive constant.


Our original goal was to find the least biased distribution to use to model a process when we only have knowledge of the average and variance of a sub-population.  To do this we started by coming up with a metric for measuring our uncertainty when given a set of probabilities.  In Part 2 of this treatment we will maximize this uncertainty function while using our knowledge of average/variance as constraints on the probabilities.  In this way we work backwards from this uncertainty function and ultimately arrive at a functional form of our least-biased probability distribution.


This derivation and treatment follows almost exactly that found in Claud Shannon’s original work A Mathematical Theory of Communication.  The image above was found on the wikipedia page on information entropy and can be found here.

The Entropy of First Names

I was recently working on a coding challenge for a job related to analysis of the first names of people born in the USA.  The data set given was from the Social Security Administration and can be downloaded here.  There is a text file for each state and for every name entry there is the associated gender, year, state, and count of people born that year with the stated name.  For example, here are the first few lines of the Arkansas file:


While looking for inspiration for this challenge I found an interesting article that referenced the entropy of baby names.  Entropy measures the amount of information contained in a choice from a distribution.  I decided to explore the entropy concept further and the results are below.

Read more