Last week I briefly mentioned the idea of hash collisions, using the likelihood of multiple lottery winners as an example. Let's take some time to build up some more rigorous background in statistics to support that discussion.

The Birthday Paradox

There’s a paradox in statistics that states, in a room of 23 people, the chance that two people have the same birthday is 50%.

It's a non-intuitive statement, and was one of the first problems I was assigned when studying statistics in college.[ref]I was taking a course on political analysis. Understanding basic statistics was a prerequisite to running more advanced regressions and analyses on polling data.[/ref] Our assignment seemed simple:

"How many students do you need to gather in a room such that the probability that two of them share the same birthday reaches 50%?"

Well, it seemed simple.

I might have a math degree, but this problem still took me a week to figure out. It's fairly straight-forward once you've solved the problem once, but for first-time approaches it becomes tricky.

How it Works

Let's walk through an example.

If only one person is in a room, we don't care about probabilities (the chance they share a birthday with themselves is a nonsensical thing to calculate). Let's move on instead.

The chances that the second person shares a birthday with the first is $latex \frac{1}{365}$. Not very high.

Add a third person, and things begin to become confusing. The chances that any two people share the same birthday becomes $latex \frac{1}{365}+\frac{1}{365}+\frac{1}{365}=\frac{3}{365}=0.82\%$. This is the sum of the chances of each permutation of potentially shared birthdays: Persons 1 & 2, Persons 1 & 3, Persons 2 & 3.[ref]For a more rigorous calculation, we also have to factor in the probability that all of the participants share the same birthday. My point here isn't to provide a solid additive solution, but to illustrate the snowballing complexity of calculating our probabilities in this fashion.[/ref]

Continuing on with this pattern becomes unnecessarily cumbersome very quickly as the available permutations add up. Instead, it's easier to calculate the probability instead that, among all members of the group, there are no shared birthdays.

Exclusivity

Assuming there are 365 available birthdays in a year,[ref]We're ignoring leap years for simplicity.[/ref] if a person is alone in their room we calculate the exclusivity of their birthday as $latex \frac{365}{365} = 100\%$.

When a second person enters the room, there are now only 364 birthdays remaining for theirs to be exclusive. This means the probability of mutually exclusive birthdays in a population of two is $latex \frac{365}{365}\times\frac{364}{365}=99.73\%$.

The pattern continues with a third person: $latex \frac{365}{365}\times\frac{364}{365}\times\frac{363}{365}=99.20\%$.

And a fourth: $latex \frac{365}{365}\times\frac{364}{365}\times\frac{363}{365}\times\frac{362}{365}=98.36\%$.

When we reach 10 people in a room, the probability that none of them share a birthday becomes $latex 88.31\%$.

Fifteen people becomes $latex 74.71\%$.

By the time twenty people have entered the room, the probability that their birthdays are all unique has already dropped to $latex 58.86\%$. It's not much more of a jump to $latex 49.27\%$, the probability of having absolutely unique birthdays among a group of 23 people.

Flipping the probability back around to our original question (looking at the probability at least two people share a birthday) shows that, once the population reaches 23, the likelihood that two people in the group share a birthday is $latex 50.73\%$.

My politics professor would be proud.

Extending the Pattern

What's more, we can extend the pattern a bit, and quickly see that the likelihood of duplicate birthdays grows rapidly as the population increases. In fact, we reach a probability where it's nearly certain that two people share a birthday. With just 50 people in a population, there's an insanely high chance two of them share a birthday: $latex P(50) = 1-\frac{365}{365}\times\frac{364}{365}\times\cdots\times\frac{316}{365}=97.04\%$.[ref]Remember, our original exclusivity approach calculates the probability of having completely unique birthdays in a group. To find the probability of a collision, we have to subtract from 1.[/ref]

In a population of 70, the probability is so close to 100%, I'd be surprised if you didn't find two people with the same birthday.

Think about this the next time you're at a WordCamp ...

Implications in Computing

When we deal with security, we aren't talking in terms of birthdays but in terms of hashes.

Hash functions, whether they're cryptographically secure or not, are merely functions that map an infinite set of potential inputs to a finite space of potential outputs.

MD5, for example, only presents 120,892,581,961,462,917,4706,176 different potential hashes (that's $latex 32^{16}$ combinations). This might seem like a rather large number, but when you take the statistics above into account, the complexity for generating a collision is only $latex 2^{64}$.

Put another way, generating a collision for MD5 "is the equivalent of an exhaustive key search of 64 bits."[ref]MD5 Discussion on Information Security Stack Exchange[/ref]

In order for a one-way hash to be secure, it needs to derive a very large key set, otherwise a birthday attack (looking for potential collisions as a way to crack the hash) renders it useless.