Benford's Law and file sizes

Bruce Ediger
July 2013

Benford's Law

Wikipedia has this entry for Benford's Law, which deals with the distribution of leading digits of a collection of samples: In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time.

But is this "law" true for Linux file sizes? Many references claim that "computer file sizes" follow Benford's Law, but only if you don't account for "file type". I decided to test it. I wrote a program to gather data on Linux file sizes to test the common assertion that file sizes follow Benford's Law.

DigitABCD
0790112361233265
163691914892585711524
25097080713157326791
32449742356110965472
4185663283187314510
5147382733563902787
6120792488449892188
7127372296340251875
8125782236255031781
9107161854132472168
File count2284733647108680339361
    Table 1 graphical representation

Conclusion

Benford's Law appears to be true for Linux file sizes.

The Program

I wrote a C-language program so as to get numbers in a finite amount of time. It uses the C standard library function sprintf() to create a string representation of a file's size, then counts the first character of the string representation.

The program uses the GNU version of the ftw() C standard library function, as the POSIX-compatible version will skip files and directory trees. Don't use the POSIX-compatible ftw().