Learning Haskell by working my way through the Real World Haskell book and the huge variety of online resources recently has reignited my curiosity about “proper” computer science. The sort with maths. I’ve picked up most of my maths as a side effect of hacking code, so I have been slowly working my way through “Sets, Logic and Maths for Computing” by David Makison, which is a reasonably thorough undergraduate-level introduction biased more towards set theory than some texts. I think that it has been worth doing.
In the chapter on probability and combinatorics a classic question is posed, which has for various reasons, rather occupied my mind for the last couple of days.
There are 25 people in a room. What is the probability that at least two are born on the same dayof the year (but not necessarily the same year)? For simplicity, assume that each year has 365 days, and that the people’s birthdays are randomly distributed through the year (both assumptions being, of course, rough approximations)
Now there are a number of solutions, but most of them require some big maths. Which is to say numbers like 36525, dividing factorials. As it turns out Haskell, whilst being able to handle mega big Integers, isn’t quite as nice when you want arbitrary precision floats. Of which more anon. But first I need to explain how I solved the problem. There are other solutions — see the Birthday Problem entry for example — but the one I got to seems to be the standard way of doing it.
How to solve the problem
The text gives three extremely helpful hints that help you get to a solution.
- Be careful with your choice of sample space
- First find the probability that no 2 distinct people have the same birthday
- You will need a pocket calculator
Well my pocket calculator can’t handle 36525. Maybe I need a better one. Enter GHCi, which said
GHCi, version 7.4.1: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
So, back to hint 1, thinking about the size of my sample space. Well, the sample space has to be based on the 365 days of the year. The order is significant and we allow ourselves to repeat days. Which means that we use permutations with repetition allowed, the formula for which is just nk. In our case, that means our old pal 36525. Which is just saying 365*365*365… and so on until we’ve multiplied 365 by itself 25 times.
The second hint suggests that we look for the set of all people with different birthdays. Again, the order is significant but this time repetitions are very much not allowed — remember that we are looking for distinct birthdays. So, this is a case of permutations. If you write this all down mathematically you get something like
Image, used under creative commons, from wikipedia
If you are still confused at this point, have a quick look at this youtube video, from "Maths made almost bearable" which explains things quite nicely.
To get to this point took me a while. I’m not a mathemetician, but I did get there in the end. However, the Haskell implementation turned out to be harded than I had expected. Possibly because I was trying to make a one liner. But mostly because I struggled to find a way to divide large Integers together.
I started by looking for the built in factorial function. Which it turned out wasn’t built in. There is, however, a product function, which multiplies together the elements of a list. Combining that with the fact that one can generate a list from a range (1..5) gave me a way to write fac (as I decided to call my factorial function) nicely
let fac n = product [1..n]
Now that I could do factorials, I could implement a permutation fuction. Easy enough to do like this
let perm n' = fac 365 `div` fac (365-n');The n' becomes relevant later, but it is just an Haskellish name for a variable that is somehow related to n, rather than calling it n2 or something as one might in another language. We take the factorial of 365 and divide it by the factorial of 365 minus another integer. To solve the question n' would be 25. This is probably a case of bad naming, as mathemeticians would usually refer to what I am calling n' as k and to 365 as n. If this bothers you, you should change it. Here is a test run anyhow.
Prelude>let fac n = product [1..n]
Prelude> let perm n' = fac 365 `div` fac (365-n');
Prelude> perm 2
Prelude> perm 25
Now that I had those bits, what I wanted to say was
bd n = 100 - 100 * perm n / 365^nI multiplied the probability of all the birthdays being distinct by 100 to get a percentage. Then I took that percentage off 100 to find out the percentage probability that all the birthdays are not distinct. That is at least 2 people share a birthday.
However, dividing big integers like that can, it seems, overflow boundaries of the Double that Haskell comes with. I had a dig about for ways round this. I would have thought it would be widely documented, but as it turns out that was not the case. The HMPFR library came up during my search, but the solution I went with in the end was somewhere in a mammoth thread on Haskell Café entitled about integer and float operations. Who knew? I implemented a function that does division by first constructing a Rational, then converting to floating point using fromRational. Note that I had to import Data.Ratio to get the % operator.
x /. y = > fromRational (x % y)
This mostly does as you would expect. Mostly. And usually for big numbers, which is what we care about.
Prelude> import Data.Ratio
Prelude Data.Ratio> let x /. y = fromRational (x % y)
Prelude Data.Ratio> fac 777 /. fac 778
Prelude Data.Ratio> fac 777 /. fac 771
I now had all the components I needed to make a nice little one liner to solve the birthday problem for whatever number I wanted.
Prelude Data.Ratio> let bd n = 100.0 - 100.0 * perm n /. (365^n) where perm n' = fac 365 `div` fac (365-n'); fac x = product [1..x]; x /. y = fromRational (x % y)
Prelude Data.Ratio> bd 25
At this stage I could answer the original question and say that the probability that at least 2 of the 25 people in the room were born on the same day of the year is 56.87 percent (ish). Which may or may not surprise you.
The implementation could stand to be a little more efficient. I do a lot of work in the fac function. I can improve performance by changing how I calculate the permutation so that I need to calculate fewer factorials. How? Well fac 365 / fac 365 - n is the same as fac 365 - (n - 1) and I was able to take advantage of this fact like this
Prelude> fac 365 `div` fac 362 48228180
Prelude> product [363..365] 48228180That made me happy, because I could avoid a fair bit of redundant calculation. You can observe the effect of redundant calculation when the figures are an order of magnitude bigger. You could do the following.
Prelude> fac 36500 `div` fac 36300 1664794314530650901156157427362126079623718218858338320926253293194738029255822028880804606888391620622216346830718850489163512882032403154323479420940484945108180108150817321610145580902741060457892438267562280504387599793310194282194436244175253336388945446577578317455324514816275134418826320949533614256936277001924393246993319975850825529842665801562179315307492480192204683979570635723886832740975026116663456784309654170151889459039390520911997827868496881384821437627268818497913193773276823414991643772715776993626808085451928663301632257485173243595486303454386605026994637467439070225888511938467908860108173806269061218085426994717143254299411207555930798536689399855217030247250194784241041929395588519421076007220176675600893925794424367074488857667687578194574907273296310755718552967233481418597816809213004129158878860590102578587316264240572006400000000000000000000000000000000000000000000000000 Prelude> product [36301..36500] 1664794314530650901156157427362126079623718218858338320926253293194738029255822028880804606888391620622216346830718850489163512882032403154323479420940484945108180108150817321610145580902741060457892438267562280504387599793310194282194436244175253336388945446577578317455324514816275134418826320949533614256936277001924393246993319975850825529842665801562179315307492480192204683979570635723886832740975026116663456784309654170151889459039390520911997827868496881384821437627268818497913193773276823414991643772715776993626808085451928663301632257485173243595486303454386605026994637467439070225888511938467908860108173806269061218085426994717143254299411207555930798536689399855217030247250194784241041929395588519421076007220176675600893925794424367074488857667687578194574907273296310755718552967233481418597816809213004129158878860590102578587316264240572006400000000000000000000000000000000000000000000000000 The same result. But I can count to 9 whilst the first calculation runs, but not even get to 1 when the second does. Charlies like massive speed improvements! Even if, in this case, it is barely noticeable when I call the bd function.
Here is the improved and more efficient code
Prelude> let bd n = 100.0 - 100.0 * perm n /. (365^n) where perm n' = product [(366-n')..365]; x /. y = fromRational (x % y) Prelude> bd 25 56.86997039694639There is no doubt still plenty of room for improvement, but this code solves the problem as required.
It took a fair while to get such a tiny bit of code to work, but the Wikipedia entry had another, to my mind more elegant (though less accurate) method to approximate the probability.
The probability of any two people not having the same birthday is 364/365. In a room containing n people, there are C(n, 2)=n(n-1)/2 pairs of people, i.e. C(n, 2) events. The probability of no two people sharing the same birthday can be approximated by assuming that these events are independent and hence by multiplying their probability together. In short 364/365 can be multiplied by itself C(n, 2) times.
Implementing that as a one liner starts with the C(n,2) part. That is maths speak for combinations, meaning that order is not significant and repetition is not allowed. The formula I know is n! / k!(n-k)! More of those factorials. I wrote a little combine function, which, unlike the perm function implements combinations generally
Prelude Data.Ratio> let combine n' k = round $ fac n' / (fac k * fac (n'-k) )
Prelude Data.Ratio> combine 10 6
As before, however, I needed to guard against overflows by using the /. division trick. Otherwise I could end up with
Prelude Data.Ratio> combine 180 6
Prelude Data.Ratio> combine 190 6
Prelude Data.Ratio>Not what I wanted to see. Compare and contrast with
Prelude Data.Ratio> let combine n' k = round $ fac n' /. (fac k * fac (n'-k) )
Prelude Data.Ratio> combine 190 6
Prelude Data.Ratio> combine 180 6
Prelude Data.Ratio> combine 170 6
With the division issue once again resolved, I was in a position to implement the exponent approximation method of calculating birthday probabilities thus.
Prelude Data.Ratio> let bd' n = 100 - 100 * (364/365) ^ (combine n 2) where combine n' k = floor $ fac n' / (fac k * fac (n'-k) ); x /. y = fromRational x % y
Prelude Data.Ratio> bd' 25
Prelude Data.Ratio> -- compared to
Prelude Data.Ratio> bd 25
So, there you have it. Two answers to the question, and I also learned one way to handle arbitrary precision floating point division in Haskell. It is probably time to move on to the next chapter. But before I finally leave the birthday problem behind, here is one way to get the whole problem completely muddled. You can read a great explanation of why Carson got so mixed up in the New York times blog.
Clip courtesy of Cornell University