Let’s try to think about this question: What is the probability that a random positive integer starts with a 1?
Without thinking about it too much, we might be inclined to say 1/9. After all, for each , there are the same number of -digit integers starting with a 1 as with any other nonzero digit.
Let’s see a possible way to we could naturally be led to the answer 1/9. We write down a bunch of digits at random, making sure we don’t start with a 0, and we keep going until we get bored.
Since I don’t know how to turn that into math, I’ll replace it with a slightly simpler model: we choose a random integer from 1 to 999999. Again, 1/9 of these numbers start with a 1.
It’s easy to find a problem with this approach, though. Why should we stop at 999999? Why don’t we stop somewhere else? How about 199999? What happens then?
Now, the picture is completely different: just over half of these numbers are at least 100000, and all of these start with 1. In fact, roughly 5/9 of them start with a 1.
There are two lessons we should learn from this. The first is that there is no way of picking a random positive integer with a uniform distribution (so that every possibility is equally likely). (There’s no uniform measure on .)
The second is a bit more subtle. The point is that we should be careful not to confuse a number with its decimal expansion, or its digits. When we chose to stop at 999999, we did so only because that was convenient since we work in base 10 most of the time. But a number has almost nothing to do with its digits; if we were using a different base (for example, if we had only four fingers per hand), we’d still talk about all the same numbers. We’d write them differently, but they’d still be the same numbers.
Okay, so perhaps the question is not a very good one: if we stop at some place, we get 1/9; if we stop at another place, we get around 5/9. If we stopped at other points, we’d probably get something in between. We could just agree that this is an ill-posed question and move on with our lives.
Surprisingly, though, there is a right answer to this question. Let’s divide the numbers we see around us into two categories: the first consists of numbers that are chosen for their digits and aren’t really numbers; the second consists of numbers that don’t depend on base.
It’s probably not entirely obvious what I mean by that, so let me explain a little. Certain numbers that we think about frequently, like social security numbers and telephone numbers, don’t really have anything to do with numbers. We just write them down as numbers because it’s convenient to do so. But they lack any numerical properties: for example, in the very hypothetical world in which I used telephones, what would it mean if my telephone number was lower than someone else’s? Nothing at all. Such “numbers” don’t count anything.
On the other hand, some numbers we see regularly do have numerical meaning. For instance, it does mean something if my bank balance is lower than someone else’s. I think that distinguishing between the two categories should not present any difficulties.
Numbers in the first category presumably do start with a one 1/9 of the time, since there’s no reason for that not to be the case. We’re essentially using the same model as we did when we picked a random number from 1 to 999999.
So, we’re only interested in the second type of number, the ones that are actually numbers. How do we model those?
This is not so obvious, but here’s a way that makes me happy. Pick a big number . Now, pick a random number from 1 to . Now pick a random number from 1 to . That’s our final answer.
This model is similar to the previous one, except that we have an intermediate number in the middle before we get our final answer. This serves to dampen the effect of the choice of .
Now what’s the probability that begins with a 1? It still depends on . Here’s a Python script I wrote to test this, running 100000 times on each of various choices of .
import random def benf2(nmax,trials): s = 0 for i in range(0,trials): n1 = random.randrange(1,nmax+1) n2 = random.randrange(1,n1+1) if str(n2) == '1': s = s + 1 i = i + 1 return s for i in range(1,10): print 100000*i, benf2(100000*i,100000)
And here are the results:
100000 24014 200000 31031 300000 35997 400000 34595 500000 32877 600000 30602 700000 28810 800000 27186 900000 25521
Okay, so that’s better, but there’s still a fairly wide spread. Can we do better?
Of course; we just have to put more dampening factors in the middle. Once we’ve chosen , pick at random between 1 and . We can keep doing this a few more times. Each time, the results get better and better. Here is some code to test this:
def benk(nmax,k): n = nmax for i in range(0,k): n = random.randrange(1,n+1) return n def count(nmax,trials,k): s = 0 for i in range(0,trials): t = benk(nmax,k) if str(t)=='1': s = s + 1 return s
We can test this with two intermediate values, again running 100000 trials for each of various values of :
for i in range(1,10): print 100000*i,count(100000*i,100000,3)
And here are the results:
100000 30214 200000 27825 300000 30215 400000 31389 500000 32121 600000 31963 700000 31877 800000 31214 900000 30679
So, it’s getting better. By making larger, we get ever better results. Here is a test with .
for i in range(1,10): print 100000*i,count(100000*i,100000,6) 100000 30888 200000 30506 300000 30550 400000 30287 500000 30073 600000 30138 700000 30383 800000 30240 900000 30278
All in all, we see that the results converge to some number around .3. The right answer, it turns out, is , or around .3010. This is a (slightly) special case of a heuristic called Benford’s Law.
Slightly more generally, Benford’s Law say that the probability that a random number begins with a digit is , and in base , the probability that a random number begins with a digit is . Not quite (but very close to) an immediate consequence of this, then, is that the probability that the first two digits of a random base 10 number are 12 is .
Why should we expect this to be true? One reason is that it’s scale invariant. To see what this means, let’s look at income distributions. What’s the probability that a random person’s bank (in dollars) balance begins with a 1? We’d expect that probability not to change when we convert to a different currency. The Benford distribution does that. Let’s take a look at how this works.
As I write this, a dollar is worth almost exactly 200 Hungarian forints. Scale invariance suggests that the probability that a random person has a bank balance in dollars beginning with a 1 should be the same as the probability done in forints. When does a bank balance in dollars begin with a 1 when considered in forints? Exactly when it begins with a 5, 6, 7, 8, or 9. Benford’s Law tells us that the probability that the balance in dollars begins with a 1 is . On the other hand, the probability in forints is
Of course, this works in full generality as well.
Indeed, logarithmic functions are the only “reasonable” functions that have this scale invariance. So, if we expect the distribution to be modeled by a reasonable function that’s scale invariant, this is the only choice.
In practice, Benford’s Law works pretty well. For instance, if we select a random month (really, meaning a number from 1 to 12), the probability that it begins with a 1 is 4/12, or .3333. If we select a random day from a month (say, a 30-day month), the probability that it begins with a 1 is 11/30, or .3667. If we look at a list of countries by population (which includes a few non-countries, but that isn’t very important), there are 63 out of a total of 224 that begin with a 1, or .2813.