Estimating the total production count from a few random serial numbers

Posted on

Problem

This is basically my attempt to the German Tank Problem in which we are given a few serial numbers and expected to return an estimation of the total production count. Such as, if 4 tanks are captured with serial numbers 19, 37, 42 and 60, return the total number of German tanks there.

For a given number of serial numbers it basically works as follows:

  • Take a case of 1000 units production and get the given serials array’s length many random serial numbers.
  • Sort them in ascending order and map the difference of the serial numbers so [19, 37, 42, 60] would result [19, 18, 5, 18] (differentiation)
    • Calculate the average of differences
    • Divide 1000 by the average diff and get a multiplier
    • Repeat the above routine 1000 times and calculate an average multiplier
    • Apply the obtained multiplier to the given serial numbers array

As per this example we should expect Germans to have 80 tanks in total or so. Do you think it’s a reasonable approach?

function estimateTotalTankCount(serials){

  Array.prototype.shuffle = function(){
    var i = this.length,
        j;
    while (i > 1) {
      j = ~~(Math.random()*i--);
      this[i] = [this[j],this[j]=this[i]][0];
    }
    return this;
  };

  var getDiff = m => Array(1000).fill()                            // make an array of size 1000 and fill with undefined
                                .map((_,i) => i+1)                 // fill the array with serial numbers from 1 to 1000
                                .shuffle()                         // shuffle the array
                                .slice(0,m)                        // get first `m` many serial numbers
                                .sort((a,b) => a-b)                // sort them ascending
                                .map((e,i,a) => e - (a[i-1] || 0)) // map the difference of serial number with the previous
                                .reduce((p,c) => p + c/m,0),       // calculate the average of differences
   multiplier = m => Array(1000).fill()                            // create 1000 test cases
                                .map(e => 1000/getDiff(m))         // find multipliers of each tests
                                .reduce((p,c) => p + c/1000,0);    // calculate the average of the multipliers
   
  return serials.sort((a,b) => a - b)                              // sort them in ascending order
                .map((e,i,a) => e - (a[i-1] || 0))                 // map the difference of serial number with the previous
                .reduce((p,c,i,a) => p + c/a.length,0)             // calculate average of serial number differences
                *multiplier(serials.length);                       // get multiplier for serials.length many serials and multiply by avg diff
}

console.log(Math.round(estimateTotalTankCount([19, 37, 42, 60])));
console.log(Math.round(estimateTotalTankCount([117, 232, 122, 173, 167, 12, 168, 204, 4, 229 ])));

I have decided to add the functional code as well. It’s a pretty simple algorithm. We are given a sorted [s1, s2, s3, s4, s5] array and we will get the differential of it [d1, d2, d3, d4, d5] as follows;

 <---d1---><--d2--><-d3-><---d4---><d5><---dn--->
[.........s1......s2....s3........s4..s5........n]

Then we get the mean of the diff array. After that to reach the “n” figure we multiply it with serials.length + 1.

function estimateTotalProductionCount(serials){
   return Math.round(serials.sort((a,b) => a - b)                  // sort them in ascending order
                            .map((e,i,a) => e - (a[i-1] || 0))     // map the difference of serial number with the previous
                            .reduce((p,c,i,a) => p + c/a.length,0) // calculate average of serial number differences
                            *(serials.length+1));                  // the multiplier is serials.length + 1
}

console.log(estimateTotalProductionCount([117, 232, 122, 173, 167, 12, 168, 204, 4, 229 ]));

As you see the functional one gives 255 while the statistical one weighs towards 258.

Solution

Comma notation

It is not recommended to use the comma notation for the following reasons:

  • Adding or removing variables requires you to fix notation in the neighbourhood of that variable.
  • Messing this up, you drop variables into the global scope.

You can mitigate some of those risks by using strict mode ("use strict";), but the maintenance overhead is still there. If we look at your syntax for defining multiplier and getDiff, the comma behind .reduce(..) is there to carry over the var declaration 7 lines earlier. That is scary, because someone maintaining this needs to keep track of the entire chain, which increases the chance of messing something up somewhere.

let and const

You are using the fat arrow syntax for functions. Keep in mind that this syntax is not supported in any version of Internet Explorer. Since you use ES6 syntax, consider using let and const instead of var. let does not get hoisted, and does not escape the context of, for example, for-loops. const is for constants, which means you can’t accidentally assign a new value to such a variable.

Shuffle

This looks like an attempt for code golf, rather than an attempt to make this code run as fast as possible.

  • Using pre/post increment/decrement may cause issues for people trying to understand the code. Instead write it on two lines. Space does not come at a premium.
  • Your usage of a list, and list lookup to swap two items performs worse than a readable alternative where we store one of the values in a temporary variable instead.
  • Your usage of ~~(val) to floor a number is correct in this case, and performs slightly better than Math.floor(..). Keep in mind that this implementation breaks when an Array contains 231 or more elements, as ~~(Math.pow(2,31)) === -2147483648. Values of 232 and up will always return the last 32 bits interpreted as a signed 32-bit integer.

The following alternative runs in 77% of the time than your codegolfed version does in my tests (average of 100 runs on array with 100.000 items). i and j must be declared as var for optimal performance, at least when I tested it. You can declare tmp inside the while-loop with let for a similar performance.

Array.prototype.shuffle = function(){
  var i = this.length;
  var j;
  var tmp;
  while (i > 1) {
    j = ~~(Math.random()*i);
    i -= 1;
    tmp = this[j];
    this[j] = this[i];
    this[i] = tmp;
  }
  return this;
};

Keep in mind that Math.random() does not return cryptographically secure random numbers. This means that your shuffled arrays might be somewhat predictable, skewing results in a statistical analysis. You might want to look into window.crypto, even though that seems to be a pseudo random number generator too.

Do you think it’s a reasonable approach..?

I have no idea. The only way to find it out is to write tests. Your code only contains one test and it doesn’t tell you much.

Testing this is not rocket since. Create sets of unique random numbers up to an upper bound which you’d consider the exact result. Then shove those numbers into your algorithm. What does the result look like? What’s the error between the result and the correct value?

What’s the standard deviation of the error when you try this with 10000 samples?
How does the standard deviation change depending on the total tank number? How does it change depending on the number of known serial numbers?

How stable is the algorithm against change of distributions of serial numbers? What if you pick your serial numbers with something other than an even distribution? Say a normal distribution? Or some other distribution that picks preferably serial numbers only from certain range of serial numbers?

The review comments are inline after ///. A summary of the review follows the code

/// Overall:
/// - Use let/const over var. Declare functions as const.
/// - Bring the nested functions out at module scope. 
///   This reduces function size and noise. Also has small perf benefits. 
///   Note that this doesn't apply to functions requiring scope closure.

/// Add JSDoc to help understand the signature of the functions.
function estimateTotalTankCount(serials){
  /// Add strict mode, 'use strict';

  /// Prefer keeping shuffle as a module level function instead of 
  /// overloading native types. This is a contained exercise, however
  /// actual usage would have leaked this function to the global
  /// prototype, as a cleanup is absent in the code.
  Array.prototype.shuffle = function(){
    /// One declaration per line helps readability and prevents
    /// some hard to catch typo induced errors.
    var i = this.length,
        j;
    while (i > 1) {
      /// Please explain this trick, preferably with a MDN article link.
      j = ~~(Math.random()*i--);
      /// Prefer array de-structuring for swapping elements.
      this[i] = [this[j],this[j]=this[i]][0];
    }
    return this;
  };
  /// Prefer a summary comment over method-level explanations.
  /// Code Comments should answer the Why over What (JSDoc
  /// answers the What). E.g. the below could be summarized as -
  /// Prepare the sample 1000 serials and calculate the average
  /// difference.
  var getDiff = m => Array(1000).fill()                            // make an array of size 1000 and fill with undefined
                                .map((_,i) => i+1)                 // fill the array with serial numbers from 1 to 1000
                                .shuffle()                         // shuffle the array
                                .slice(0,m)                        // get first `m` many serial numbers
                                .sort((a,b) => a-b)                // sort them ascending
                                .map((e,i,a) => e - (a[i-1] || 0)) // map the difference of serial number with the previous
                                .reduce((p,c) => p + c/m,0),       // calculate the average of differences
   multiplier = m => Array(1000).fill()                            // create 1000 test cases
                                .map(e => 1000/getDiff(m))         // find multipliers of each tests
                                .reduce((p,c) => p + c/1000,0);    // calculate the average of the multipliers
  /// Please add a nullability check for serials. Prefer ?. operator.
  return serials.sort((a,b) => a - b)                              // sort them in ascending order
                .map((e,i,a) => e - (a[i-1] || 0))                 // map the difference of serial number with the previous
                .reduce((p,c,i,a) => p + c/a.length,0)             // calculate average of serial number differences
                *multiplier(serials.length);                       // get multiplier for serials.length many serials and multiply by avg diff
}

/// Math.round should be done within the estimateTotalTankCount
/// function prior to returning the results.
console.log(Math.round(estimateTotalTankCount([19, 37, 42, 60])));
console.log(Math.round(estimateTotalTankCount([117, 232, 122, 173, 167, 12, 168, 204, 4, 229 ])));

Overall:

  • Use let/const over var. Declare functions as const.
  • Bring the nested functions out at module scope. This reduces function size and noise. Also has small performance benefits. Note that this doesn’t apply to functions requiring scope closure.

Add JSDoc to help understand the signature of the functions.

function estimateTotalTankCount(serials){  

Add strict mode

‘use strict’;

Prefer Shuffle at the Module Level

Prefer keeping shuffle as a module level function instead of overloading native types. This is a contained exercise, however actual usage would have leaked this function to the global prototype, as a cleanup is absent in the code.

Prefer Summary Comments

Prefer a summary comment over method-level explanations. Code Comments should answer the Why over What (JSDoc
answers the What).

Location of use of Math.round

Math.round should be done within the estimateTotalTankCount function prior to returning the results.

Leave a Reply

Your email address will not be published. Required fields are marked *