Wikipedia has this entry for Benford's Law, which deals with the distribution of leading digits of a collection of samples: In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time.
But is this "law" true for Linux file sizes? Many references claim that "computer file sizes" follow Benford's Law, but only if you don't account for "file type". I decided to test it. I wrote a program to gather data on Linux file sizes to test the common assertion that file sizes follow Benford's Law.
|
![]() |
Benford's Law appears to be true for Linux file sizes.
I wrote a C-language program so as to get numbers in a finite amount of time. It uses the C standard library function sprintf() to create a string representation of a file's size, then counts the first character of the string representation.
The program uses the GNU version of the ftw() C standard library function, as the POSIX-compatible version will skip files and directory trees. Don't use the POSIX-compatible ftw().