Practical Techniques in Serial Number Analysis

Here’s a 70 year old paper that’s worth a closer look:

Leo A. Goodman (1954) Some Practical Techniques in Serial Number Analysis, Journal of the American Statistical Association, 49:265, 97-112

It’s a little hard to find a PDF of this paper, as Taylor & Francis Group jealously guard access to the Journal of the American Statistical Association. As of this writing, JSTOR has it, and you may be able to get access via a public library, or some college or university library. T&F want $64.00 for it.

The author, Leo A. Goodman wrote a ton of papers over a 70 year academic career. He wrote his doctoral thesis in 1948, this isn’t even his first big paper. It was written just as digital computers were making their way to any kind of use at all.

I’m not knowledgeable about statistics, I was an Aerospace Engineering major in college, most of my college math is calculus, and since college, I’ve been interested in foundations of math. I’m groping my way through this.

I see Goodman explaining his practical techniques in sections 3.1 and 3.2. Section 3.1 deals with the case of when the initial serial number is known, and section 3.2 the case when the initial serial number is unknown. The algebra appears simple enough, but Goodman has a reluctance to do calculations, which seems very odd to the 21st Century eye, which is attached to someone with access to lots of numerical calculation power. Goodman is only OK at expository writing, which I guess I should be grateful for, but his lack of numbered equations and references to tables in papers from the bibliography makes it harder to follow.

A lot of papers that have impenetrable mathematical prose can be understood through the example(s) in them. The example that Leo A. Goodman gives is pretty simple. Thirty-one serial numbers from bookshelves and desks, “pieces of equipment” in use in his department at the University of Chicago.

Example section of Practical Techniques paper

Goodman’s data is 31 serial numbers in 3 lines of text,

83, 135, 274, 380, 668, 895, 955, 964, 1113, 1174, 1210, 1344, 1387, 1414,
1610, 1668, 1689, 1756, 1865, 1874, 1880, 1936, 2005, 2006, 2065, 2157, 2220,
2224, 2396, 2543, 2787

Goodman figures out two basic things about the data:

  1. How many serial numbered items were purchased?
  2. How close to a uniform distribution is this group of serial numbers?

After some re-reading, I’ve decide that Goodman is not assuming the initial serial number is 1 or 0. He uses the methods outlined in section 3.2.1 to calculate these things. Those methods do not assume a lower bound on serial numbers.

How many serial numbered items?

Goodman doesn’t start with a symbolic equation, note what each symbol means, fill in values, then do arithmetic. He just dives in with d 32/30 = (2787 - 83)32/30 = ... = 2884.31. Goodman mentions that we see this from Sec 3.2.1 of the paper. Section 3.2.1 starts on page numbered 105 in the PDF I have. Footnote 2 on page numbered 106 appears to be where we see this.

p = d(n+1)/(n-1)

  • p is the “total production”, the number of serial numbered items.
  • d is the difference between the maximum and minimum serial numbers
  • n is the number of serial numbers.

p = (2787 -83)(31 + 1)/(31 - 1) = (2787 - 83)32/30

Footnote 2 indicates that this formula is from Goodman’s reference 3, Goodman, Leo A., “Serial number analysis,” Journal of the American Statistical Association, 47 (1952), 622-34. I guess in the 1950s, you kept your back issues of the Journal on the shelf, handy for references like this.

The closest I can find to the equation for p above is on the page numbered 626 of “Serial number analysis”.

f = d(n + 1)/(n - 1) - 1

In that paper, Goodman calls f the minimum-variance unbiased estimator of p, the “total production”.

d is the difference between max and min serial numbers, and there’s 31 serial numbers in Goodman’s data, which is the value of n

f = (2787 - 83)(31 + 1)/(31 - 1) - 1 = (2787 - 83)32/30 - 1 = 2883.3

Goodman says that the University of Chicago had 2885 serial numbered pieces of equipment. His estimate is remarkably close.

How close to a uniform distribution is this group of serial numbers?

Back to Goodman (1954). In Section 5, An Application, Goodman writes:

From Figure 5.1 we see that the maximum absolute difference between the two cumulative distributions is (9.65 - 5)/29 =.16. If the serial numbers obtained are a random sample from a population of uniformly distributed serial numbers, then there is more than a 1-.68280 =.3172 probability of obtaining a maximum absolute difference of .16 or larger (see page 428, Table 1, N = 29, in [1]).

Figure 5.1 from Goodman (1954)

Figure 5.1, Goodman, 1954

There’s a lot going on here. As near as I can tell from Figure 5.1, the “cumulative distribution” is the (0-indexed) position of a serial number in a sorted list of serial numbers. That is, the X-axis values are serial numbers, and the Y-axis values are the count of serial numbers that appeared before a given serial number in the sorted list.

The way I read Goodman’s application, 9.65 is the vertical distance from the “step” in the cumulative distribution that has Y-value 5 to the diagonal line. Serial number 895 is the 6th serial number, with 5 before it in the list.

The diagonal line runs from (83,0) to (2787,30), equation is y = 0.011054x - 0.917465

At x = 895, y = 0.011054*855 - 0.917465 = 8.96, not 9.65.

I can’t explain this. I know that people were really into analog methods in the 1950s, but I hesitate to measure the distance from the serial number 895 to the line with a ruler. I’d be measuring a laser print of a scan of a typeset graph that has a boo boo in the Y-axis scale. That’s no way to go through life.

This mistake makes Goodman’s calculation into: (8.96 - 5)/29 = .14

The next part is more involved:

.. then there is more than a 1-.68280 =.3172 probability of obtaining a maximum absolute difference of .16 or larger (see page 428, Table 1, N = 29, in [1]).

Reference 1 is Birnbaum, Z. W., “Numerical tabulation of the distribution of Kolmogorov’s statistic for finite sample size”, Journal of the American Statistical Association, 47 (1952), 425-41.

This paper is harder to find, but I scrounged up a copy.

fragment of Table 1, Birnbaum (1952)

Fragment of Table 1, Birnbaum (1952)

That’s the part of Table 1 that has the column for N = 29. You can see a faintly highlighted value of .68280 in the row for c = 5. I know where the value .68280 comes from.

Birnbaum’s 1952 paper is odd to the 21st Century eye. It’s a tabulation of some runs of a computer program. The computer in question was fairly primitive, the SWAC. There must have been some value in this tabulation, otherwise valuable time on the SWAC wouldn’t have been alloted. Birnbaum doesn’t do a great job of writing down exactly what the program did, but he’s careful to say how he eliminated rounding errors and other numerical concerns.

The c = 5 row is a potentially confusing choice. There’s 2 possibilities for what c stands for, because the serial number 895 is at step 5, and 9.65 - 5 = 4.65, which is almost 5. Birnbaum also does not define c. From a close read, I think Birnbaum’s c is the difference between the cumulative distribution, and the continuous distribution. The examples in Birnbaum’s paper do some interpolating. I think that Goodman is using the c = 5 row because he got a maximum of 4.65 between continuous (the line) and cumulative distributions, not because that maximum is at serial number with a cumulative distribution function value of 5.

Goodman’s choice of the N = 29 section of Birnbaum’s table vs using N = 31 also puzzled me. I think this is because Goodman uses 2 serial numbers (83,0, 2787,30) to define the line of the theoretical continuous distribution, so they don’t count as “samples” in some sense. Section 3.2.1 also subtracts 2 from the number of samples, but so casually that it must be some widely-known statistical thing.

Birnbaum uses “Empirical distribution function” where Goodman uses “cumulative distribution”. Birnbaum’s description uses “1-indexing”, (83,1) instead of (83,0). But that’s OK: Goodman’s assumed continuous distribution uses (83,0) as one point, it’s not in empirical (cumulative) distribution. (135, 1) is the first point in the cumulative distribution.

That leaves how Goodman decided his cumulative distribution was uniformly distributed.

In section 3.2.4, page numbered 108, Goodman writes:

For example if n =31, the sample cumulative distribution of the n-2=29 adjusted serial numbers obtained can be graphed. The maximum absolute difference between this sample cumulative and the cumulative of the uniform distribution (the diagonal line) can then be determined. From Table 1 (N = 29) on page 428 of [1], we note that the probability is .98076 that this maximum absolute difference between the cumulatives will be less than 8/29. Hence, if a test is to be performed at the .01924 level of significance, the hypothesis of randomness and consecutive (uniformly distributed) serial numbers will be accepted whenever the maximum absolute difference between the cumulatives is less than 8/29.

Reference 1 is the Birnbaum paper.

Goodman has calculated that 8 or less difference between the line and the cumulative distribution steps means the serial numbers are uniformly distributed.

Can this method really detect non-uniform distributions of serial numbers?

I wrote a program that simulates a sample of serial numbers from a range using the Go standard package math/rand. I gave it one option to choose “serial numbers” from a uniform distribution, one option to choose from a normal distribution.

Code repo

Uniform Serial Number Distribution

Simulated serial number sample

Example simulated serial number sample

The figure above contains a plot of cumulative and continuous distributions for one run of my simulator. I had it choose 31 “serial numbers” at random, values up to 2885. This particular run had a maximum distance from cumulative to continuous distribution of 4.354 at serial number 981. That’s a lot less than the 8 value Goodman thinks is necessary to guarantee he had a uniform distribution sample of serial numbers. Using Goodman’s equation, the estimated total production is 3013.33 pieces of equipment. There’s a green vertical line where the maximum difference between cumulative and continuous distributions occurs.

another re-creation of Goodman’s figure 5.1

Above, for comparison, Goodman’s figure 5.1 to the same scale. Another vertical green line shows where Goodman calculate the maximum difference between cumulative and continuous distributions..

If I have my program do 30 repetitions of a 31-sample, maximum 2885 serial number, I get a mean estimate of total production of 2858.6, minimum 2459, maximum 3044. The mean value of the difference between continuous and cumulative distributions is 3.99, minimum of 1.69, maximum of 9.30. That maximum is the sole simulation that had a difference of more than 8. Lots of other repetitions have convinced me that very few simulations exceed the value of 8.

Normal Serial Number Distribution

I decided to use a normal serial number distribution to see if Goodman’s criteria for uniform distribution would fail. Picking different means and standard deviations for the normal distribution lets me simulate choosing serial numbers from the lower or upper ends of the range at will.

Simulated normal distribution serial number sample

Example simulated normal distribution serial number sample

Above, a simultion of 31 serial numbers in a sample, normal distribution with a mean of 500, standard deviation of 500. X and Y axes are the same scale as the example uniform distribution above. The estimated total production is 1838.93, at serial number 955. The maximum difference between distributions is 8.06, so this sample does not pass Goodman’s criteria.

100 repetitions of the simulation with mean serial number of 500, standard deviation of 500 gives me a mean estimated total production of 1533.5, low of 1153, high of 2266. The mean difference between distributions is 6.70. 33 of the simulations have a difference between distribution of above 8, failing Goodman’s criteria for a uniform distribution.

The same sort of thing is true for other choices of mean and standard deviation. The estimated total production is low (relative to 2885), and a good many of the samples pass Goodman’s test for being a uniform distribution sample.

My conclusion is that either the Kolmogorov statistic test isn’t great in this situation, or that Goodman picked a distribution difference that’s too large.

The only other alternative is that my knowledge of statistics isn’t up to this task.

Connections

Goodman himself was too young to have participated in WW2 - he was born in 1928, he had turned 17 a few days before V-J Day. There’s no way he was involved in any parallel construction of data performed to avoid disclosing that the UK had cracked Enigma encryption. Goodman doesn’t cite or mention Ruggles and Brodie in this paper, but does cite his own paper, Serial Number Analysis from 1952 which uses Ruggles and Brodie as motivation.