Over the past couple of weeks we've been talking a lot about statistics and calculating the probability of hash collisions.

Last week, we took some time to calculate the probability that, in a group of people, at least two individuals share a birthday. In computer science, this is a hash collision - where two random events (i.e. hash calculations) happen to coincide exactly.

Today, we'll instead work out the probability that, in a given group of individuals, at least three share a birthday.

The Three-way Birthday Problem

If you remember from last week, when we have a group of 50 individuals, there's a 97.04% chance that two[ref]Or more ...[/ref] members of the group share a birthday.

The next step of the calculation is to figure out the probability of exactly two people sharing a birthday in the group. Unlike the two-or-more problem earlier, we have to work this out based on the number of potential combinations of individuals within the group.

We can choose 2 individuals from a population of 50 in 1,225 ways.

The probability that exactly 2 individuals share a birthday is then calculated as:

$latex P=\frac{1225\times365\times364\times\dots\times317}{365^{50}}=11.18\%$

To review, this means that, for a population of 50, the probability of no one sharing a birthday is 2.96% and the probability of exactly two sharing a birthday is 11.18%.

The probability, then, that three or more members of the group share a birthday is:

$latex P=1-(0.0296+0.1118)=0.8586-85.86\%$

Consider this, in a room of 50 people, there's a 97.04% chance that two of them share a birthday. There's also an 85.86% chance that three of them share a birthday.

Implications for Security

The birthday problem is a convenient way to visualize the behavior of hashes. Out of a population of over 7 billion, the population of the planet can still be summarized in a hash table consisting only of 365 discrete hashes. It's the mapping of a very large data set onto a much smaller one.

Standard hashes in computing do the same thing. A typical checksum will reduce a possibly infinite set of data down to just ten potential values. The potential for hash collisions is enormous.

Simple hashing algorithms like MD5 also raise the potential for large numbers of collisions. If a stored password is hashed with MD5, a potential attacker doesn't need to know your actual password to bypass security, only a string that hashes to the same output.

The smaller the potential set output by the hash function, the higher the likelihood of hash collisions.