What is a random positive integer?


Let’s try to think about this question: What is the probability that a random positive integer starts with a 1?

Without thinking about it too much, we might be inclined to say 1/9. After all, for each n, there are the same number of n-digit integers starting with a 1 as with any other nonzero digit.

Let’s see a possible way to we could naturally be led to the answer 1/9. We write down a bunch of digits at random, making sure we don’t start with a 0, and we keep going until we get bored.

Since I don’t know how to turn that into math, I’ll replace it with a slightly simpler model: we choose a random integer from 1 to 999999. Again, 1/9 of these numbers start with a 1.

It’s easy to find a problem with this approach, though. Why should we stop at 999999? Why don’t we stop somewhere else? How about 199999? What happens then?

Now, the picture is completely different: just over half of these numbers are at least 100000, and all of these start with 1. In fact, roughly 5/9 of them start with a 1.

There are two lessons we should learn from this. The first is that there is no way of picking a random positive integer with a uniform distribution (so that every possibility is equally likely). (There’s no uniform measure on \mathbb{N}.)

The second is a bit more subtle. The point is that we should be careful not to confuse a number with its decimal expansion, or its digits. When we chose to stop at 999999, we did so only because that was convenient since we work in base 10 most of the time. But a number has almost nothing to do with its digits; if we were using a different base (for example, if we had only four fingers per hand), we’d still talk about all the same numbers. We’d write them differently, but they’d still be the same numbers.

Okay, so perhaps the question is not a very good one: if we stop at some place, we get 1/9; if we stop at another place, we get around 5/9. If we stopped at other points, we’d probably get something in between. We could just agree that this is an ill-posed question and move on with our lives.

Surprisingly, though, there is a right answer to this question. Let’s divide the numbers we see around us into two categories: the first consists of numbers that are chosen for their digits and aren’t really numbers; the second consists of numbers that don’t depend on base.

It’s probably not entirely obvious what I mean by that, so let me explain a little. Certain numbers that we think about frequently, like social security numbers and telephone numbers, don’t really have anything to do with numbers. We just write them down as numbers because it’s convenient to do so. But they lack any numerical properties: for example, in the very hypothetical world in which I used telephones, what would it mean if my telephone number was lower than someone else’s? Nothing at all. Such “numbers” don’t count anything.

On the other hand, some numbers we see regularly do have numerical meaning. For instance, it does mean something if my bank balance is lower than someone else’s. I think that distinguishing between the two categories should not present any difficulties.

Numbers in the first category presumably do start with a one 1/9 of the time, since there’s no reason for that not to be the case. We’re essentially using the same model as we did when we picked a random number from 1 to 999999.

So, we’re only interested in the second type of number, the ones that are actually numbers. How do we model those?

This is not so obvious, but here’s a way that makes me happy. Pick a big number N_0. Now, pick a random number N_1 from 1 to N_0. Now pick a random number N_2 from 1 to N_1. That’s our final answer.

This model is similar to the previous one, except that we have an intermediate number in the middle before we get our final answer. This serves to dampen the effect of the choice of N_0.

Now what’s the probability that N_2 begins with a 1? It still depends on N_0. Here’s a Python script I wrote to test this, running 100000 times on each of various choices of N_0.

import random
def benf2(nmax,trials):
	s = 0
	for i in range(0,trials):
		n1 = random.randrange(1,nmax+1)
		n2 = random.randrange(1,n1+1)
		if str(n2)[0] == '1':
			s = s + 1
		i = i + 1
	return s

for i in range(1,10):
	print 100000*i, benf2(100000*i,100000)

And here are the results:

100000 24014
200000 31031
300000 35997
400000 34595
500000 32877
600000 30602
700000 28810
800000 27186
900000 25521

Okay, so that’s better, but there’s still a fairly wide spread. Can we do better?

Of course; we just have to put more dampening factors in the middle. Once we’ve chosen N_2, pick N_3 at random between 1 and N_2. We can keep doing this a few more times. Each time, the results get better and better. Here is some code to test this:

def benk(nmax,k):
	n = nmax
	for i in range(0,k):
		n = random.randrange(1,n+1)
	return n

def count(nmax,trials,k):
	s = 0
	for i in range(0,trials):
		t = benk(nmax,k)
		if str(t)[0]=='1':
			s = s + 1
	return s

We can test this with two intermediate values, again running 100000 trials for each of various values of N_0:

for i in range(1,10):
	print 100000*i,count(100000*i,100000,3)

And here are the results:

100000 30214
200000 27825
300000 30215
400000 31389
500000 32121
600000 31963
700000 31877
800000 31214
900000 30679

So, it’s getting better. By making k larger, we get ever better results. Here is a test with k=6.

for i in range(1,10):
	print 100000*i,count(100000*i,100000,6)

100000 30888
200000 30506
300000 30550
400000 30287
500000 30073
600000 30138
700000 30383
800000 30240
900000 30278

All in all, we see that the results converge to some number around .3. The right answer, it turns out, is \log_{10}(2), or around .3010. This is a (slightly) special case of a heuristic called Benford’s Law.

Slightly more generally, Benford’s Law say that the probability that a random number begins with a digit d is \log_{10}(d+1)-\log_{10}(d), and in base b, the probability that a random number begins with a digit d is \log_d(b+1)-\log_d(b). Not quite (but very close to) an immediate consequence of this, then, is that the probability that the first two digits of a random base 10 number are 12 is \log_{10}(13)-\log_{10}(12).

Why should we expect this to be true? One reason is that it’s scale invariant. To see what this means, let’s look at income distributions. What’s the probability that a random person’s bank (in dollars) balance begins with a 1? We’d expect that probability not to change when we convert to a different currency. The Benford distribution does that. Let’s take a look at how this works.

As I write this, a dollar is worth almost exactly 200 Hungarian forints. Scale invariance suggests that the probability that a random person has a bank balance in dollars beginning with a 1 should be the same as the probability done in forints. When does a bank balance in dollars begin with a 1 when considered in forints? Exactly when it begins with a 5, 6, 7, 8, or 9. Benford’s Law tells us that the probability that the balance in dollars begins with a 1 is \log_{10}(2). On the other hand, the probability in forints is

(\log_{10}(6)-\log_{10}(5))+(\log_{10}(7)-\log_{10}(6))+(\log_{10}(8)-\log_{10}(7))+(\log_{10}(9)-\log_{10}(8))+(\log_{10}(10)-\log_{10}(9))=\log_{10}(10)-\log_{10}(5)=\log_{10}(2).

Of course, this works in full generality as well.

Indeed, logarithmic functions are the only “reasonable” functions that have this scale invariance. So, if we expect the distribution to be modeled by a reasonable function that’s scale invariant, this is the only choice.

In practice, Benford’s Law works pretty well. For instance, if we select a random month (really, meaning a number from 1 to 12), the probability that it begins with a 1 is 4/12, or .3333. If we select a random day from a month (say, a 30-day month), the probability that it begins with a 1 is 11/30, or .3667. If we look at a list of countries by population (which includes a few non-countries, but that isn’t very important), there are 63 out of a total of 224 that begin with a 1, or .2813.

I’d recommend reading Terence Tao‘s article on Benford’s Law and other phenomena of a similar flavor for further insights.

About these ads

About Simon

Hi. I'm Simon Rubinstein-Salzedo. I'm a mathematics postdoc at Dartmouth College. I'm also a musician; I play piano and cello, and I also sometimes compose music and study musicology. I also like to play chess and write calligraphy. This blog is a catalogue of some of my thoughts. I write them down so that I understand them better. But sometimes other people find them interesting as well, so I happily share them with my small corner of the world.
This entry was posted in math for everyone. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s