Streaming variable length integers from a file

Posted on

Problem

I wrote this script as a proof of concept of reading a file from disk as variable-length quantities (ints encoded using 7 bits from as many bytes as required). I’m mostly making use of the file system ReadStream, for reading byte buffers from files and the varint package for decoding variable length values.

This is is my first attempt at working with nodejs. When I tested it on a 1GB file read from SSD, it took about 7 minutes during which it used about 50mb of ram and saturated one CPU core.

So my question is, is there anything I’m doing here which could be optimised for CPU load?

var fs, vi, sys, rs, ints, total, remainder;
fs = require('fs');
vi = require('varint');
sys = require('sys');

rs = fs.createReadStream('/path/to/datafile');
total = 0;

// a buffer for carrying over incomplete values between chunks
remainder = new Buffer(0);

rs.on('data', function(chunk) {
  var buffer, value, byteLen;
  buffer = Buffer.concat([remainder, chunk]);

  while(buffer.length){
    // decode the value and check how many bytes it required
    value = vi.decode(buffer);
    byteLen = vi.decode.bytesRead;

    if (byteLen <= buffer.length){
      // keep the result and update position in the buffer
      total += value;
    } else {
      // result is incomplete, store remainder to be prepended to the next chunk
      remainder = new Buffer(buffer.length);
      buffer.copy(remainder);
    }

    // shift decoded bytes off the buffer
    buffer = buffer.slice(byteLen);
  }
});

// display grand total after the whole file has been read
rs.on('end', function() {
  sys.puts((total));
});

Solution

It turns out it was possible to increase the speed about 12 times by using the offset argument of varint.decode to traverse the buffer rather than slicing read bytes off of the buffer for each value.

rs.on('data', function(buffer) {
  var buffer, value, byteLen, offset;

  if (remainder) {
    buffer = Buffer.concat([remainder, buffer]);    
    remainder = null;
  } 

  offset = 0;

  while(offset < buffer.length){
    // decode the value and check how many bytes it required
    value = vi.decode(buffer, offset);
    offset += vi.decode.bytesRead;

    if (offset <= buffer.length){
      // keep the result and update position in the buffer
      total += value;
    } else {
      // result is incomplete, store remainder to be prepended to the next chunk
      remainder = new Buffer(buffer.length - offset);
      buffer.copy(remainder, 0, offset);
    }
  }
});

Much better.

UPDATED to incorporate suggestions.

As you found, the key is to reduce the number of buffer manipulations. You could reduce it further by inserting the remainder of the previous buffer into the beginning of the new chunk, assuming Buffer supports this operation and that concat doesn’t already optimize this by keeping both buffers separate and bridging the gap in decode.

I’m not familiar with any of these libraries so I won’t attempt to code it up exactly.

on data
    if there is a remainder
        insert remainder before chunk
    process chunk
    calculate remainder but don't copy it to a new buffer

Leave a Reply

Your email address will not be published. Required fields are marked *