Transcoding of a large histogram

Posted on

Problem

I’m currently working on transcoding a large histogram so that the lessened file can be used within a viewer style GUI. The transcoding works by using Tokens to define runs of bins.

For example, Token C (below) defines any number above 255 (or 8 bits) up to 16 bit) or 63535) of 0 height bins in a row.

At the moment, my choke point is the transcoding of the data elements I have defined from the histogram.

Firstly, a large histogram is turned into a list of encodingData objects Which describe ‘runs’ of bins on the histogram.

Code for the EncodingData class:

/** Holds data for the run-length encoding
*
*/
public class EncodingData {


/**
 *  The number of bins in the run
 */
private int length;

/**
 *  the frequency of the bins in the run
 */
private int frequency;

public EncodingData(int length, int frequency) {
    this.length = length;
    this.frequency = frequency;
}

public int getLength() {
    return length;
}

public void incrementLength() {
    this.length++;
}

public int getFrequency() {
    return frequency;
}

public void setFrequency(int frequency) {
    this.frequency = frequency;
}
}

The Transcoder class takes in the large list of encoding data objects and turns it into a single string which is token-based. The first method determines how to handle the runs of encodingData objects. I have omitted some of the details of this class, so don’t worry about the inputs.

private StringBuilder specBuilder;

 public String transcode(Resolution resolution, List<EncodingData> encodingData, String rname, int startPOS, ParsedAPUCommand command, int lastPOS) {

     specBuilder = new StringBuilder();

    //loop through -> bin > 1 or a single bin of 1 -> use regular characters
    //runs of all 1's -> if all the H's are less than 255, use R
    //runs of all 1's -> if all the H's are less than 65535, use S.

    int posTracker = startPOS;
    int byteCount = 0;

    List<EncodingData> run = new ArrayList<>();
    for (EncodingData data : encodingData) {
        byteCount += data.getLength();

        if (byteCount >= 1000) {
            //append a new W element and reset the byteCount;
            posTracker += byteCount;
            byteCount = 0;
            specBuilder.append(String.format("W000000000000000000%010d", posTracker));
        }

        if (data.getLength() == 1) {
            //can continue the run.
            run.add(data);
        } else {
            //must stop the run
            if (run.size() == 1) { //encode the single run = 1 element
                transcodeUsingRegularElements(run.get(0));
                run.clear();
            }else{
                transcodeSpecialOrMultipleRegularElements(run);
                run.clear();
            }

        //transcode the single greater than 1 bin run.
        transcodeUsingRegularElements(data);
        }


    }

    //encode last if run is not empty
    if (run.size() > 1) {
        //must encode the current run list and then encode the single greater than 1 list
        //count the heights, if less than 255, special R, if less than 65535 but more than 255 use S.
        //if greater than 65535 -> split the list until it fits and encode in multiple Special characters
        transcodeSpecialOrMultipleRegularElements(run);
        run.clear();
    } else if (run.size() == 1){ //encode the single run = 1 element
        transcodeUsingRegularElements(run.get(0));
        run.clear();
    }

    //append the escape sequence
    specBuilder.append("000000000000000000");


    return specBuilder.toString();
}


  /** Takes in a run of multiple encoding data points and finds the most efficient way of writing it
 * @param run list of encoding data points
 * @return String to append to the chunk
 */
private void transcodeSpecialOrMultipleRegularElements(List<EncodingData> run) {

    int SPECIAL_SIZE = 0;
    int SPECIAL_SWITCH = 0;

    //calculate the space required to encode the run using just special elements R and S
    //loop through an and see if all of the heights are less than 255, or less than 65535
    boolean lessThanOrEqual255 = true;
    boolean lessThanOrEqual65535 = true;

    for(EncodingData data : run){
        if(data.getFrequency() > 255) lessThanOrEqual255 = false;
        if(data.getFrequency() > 65535) lessThanOrEqual65535 = false;
    }

    //set the switch
    if(lessThanOrEqual255){
        SPECIAL_SIZE = 2 + run.size();
        SPECIAL_SWITCH = 1;
    }else if(lessThanOrEqual65535){
        SPECIAL_SIZE = 2 + 2*(run.size());
        SPECIAL_SWITCH = 2;
    }

    //Calculate the space required using just elements A, E,I, M and Z -> sizes (0,1,1,2,3) * (8 bits) respectively

    int REGULAR_SIZE = 0;

    if(SPECIAL_SWITCH != 0){ //if 0, frequencies greater than 65535, special not possible

        for(EncodingData data: run){
            int frequency = data.getFrequency();
            if(frequency <= 1) REGULAR_SIZE ++;
            else if(frequency <= 255) REGULAR_SIZE +=2;
            else if(frequency <= 65535) REGULAR_SIZE +=3;
            else REGULAR_SIZE += 7;
        }
    }

    //Determine and do the encoding
    if(SPECIAL_SWITCH == 0){
        //encode all in regular.
        for(EncodingData data : run){
            transcodeUsingRegularElements(data);
        }
    }else {
        if(SPECIAL_SIZE < REGULAR_SIZE){
            //encode in special
            if(SPECIAL_SWITCH == 1){
                //encode using a single R
                transcodeUsingSpecialElements(run, 'R');
            }else{
                //encode using a single S
                transcodeUsingSpecialElements(run, 'S');
            }
        }else{
            //encode all in regular
            for(EncodingData data : run){
                transcodeUsingRegularElements(data);
            }
        }
    }
}

/** Transcodes the run of encoding data using char element
 * @param run List of encoding data
 * @param element special element to use
 * @return String the transcoded string
 */
private void transcodeUsingSpecialElements(List<EncodingData> run, char element) {

    if(element != 'R' && element != 'S' ) return;

    int runCounter = 0;
    int sizeCounter = run.size();
    if(element == 'R'){
        for(EncodingData data : run){

            if(runCounter == 0){
                if(sizeCounter < 255){
                    specBuilder.append('R');
                    specBuilder.append(String.format("%03d", sizeCounter));
                }else{
                    specBuilder.append('R');
                    specBuilder.append("255");
                }
            }

            specBuilder.append(String.format("%03d", data.getFrequency()));
            runCounter++;

            if(runCounter == 255){
                runCounter = 0;
                sizeCounter -=255;
            }
        }
    }else { //S
        for (EncodingData data : run) {
            if (runCounter == 0) {
                if (sizeCounter < 255) {
                    specBuilder.append("S");
                    specBuilder.append(String.format("%03d", sizeCounter));
                } else {
                    specBuilder.append("S");
                    specBuilder.append("255");
                }
            }

            specBuilder.append(String.format("%05d", data.getFrequency()));
            runCounter++;

            if (runCounter == 255) {
                runCounter = 0;
                sizeCounter -= 255;
            }
        }
    }
}

/** Transcodes the encoding data point, spanning one or many bins
 * @param data encoding data object
 * @return String to append
 */
private void transcodeUsingRegularElements(EncodingData data) {

    int length = data.getLength();
    int height = data.getFrequency();

    //work out bits needed for both in
    int lengthBits = 0;
    int heightBits = 0;

    if(length == 0) lengthBits = 0;
    else if(length == 1) lengthBits = 1;
    else if(length <= 255) lengthBits = 8;
    else if(length <= 65535) lengthBits = 16;
    else if(length > 65536) lengthBits = 32;

    if(height == 0)  heightBits = 0;
    else if(height == 1)  heightBits = 1;
    else if(height <= 255) heightBits = 8;
    else if(height <= 65535) heightBits = 16;
    else if(height > 65536)  heightBits = 32;

    try {
        char token = determineToken(heightBits, lengthBits);
        specBuilder.append(token);
    }catch(InvalidTokenException ite){
        System.out.println(ite.getMessage());
        System.exit(0);
    }

    //get the maximum numbers for amount of bits - i.e for 16, 65335 (2^ bits - 1)

    specBuilder.append((length == 0 || length == 1) ? "" : String.format("%0"  + String.valueOf((int)(Math.pow(2, lengthBits))).length() + "d", length));
    specBuilder.append((height == 0 || height == 1) ? "" : String.format("%0" + String.valueOf((int)(Math.pow(2, heightBits))).length() + "d",  height));

}

/** Determines the token to use from the token enum
 * @param heightBits bits from the value
 * @param lengthBits bits from the length
 * @return char - the token
 * @throws InvalidTokenException token not found
 */
private char determineToken(int heightBits, int lengthBits) throws InvalidTokenException{

    //if height is 1 or 0, rewrite the bit size since we won't write that in the transcoding
    //find the correct token
    for(Token token : Token.values()){
        if(token.getHeight() == heightBits && token.getLength() == lengthBits) return token.name().charAt(0); //only one char per name.
    }

    throw new InvalidTokenException("Token does not exist");
}

the Token enum is defined below:

/** All lengths and heights in bits.
* All 1's are to be ignored in writing
* i.e 1 - 0 is transcoded as A.
* 1 -1 is transcoded as E
* 1 - 209 is transcoded as I209
* 1 - 2 is transcoded as I002
* 1 - 40000 is transcoded as M40000
* 1 - 290 is transcoded as M00290
*/
public enum Token {

A (1, 0),
B (8, 0),
C (16,0),
D (32,0),
E (1, 1),
F (8, 1),
G (16,1),
H (32,1),
I (1 ,8),
J (8, 8),
K (16,8),
L (32,8),
M (1,16),
N (8,16),
O (16,16),
P (32,16),
Q (16,32),
Z (1,32);


private final int length;
private final int height;


Token(int length, int height) {
    this.length = length;
    this.height = height;
}

public int getLength() {
    return length;
}

public int getHeight() {
    return height;
}

}

My issue is that significant amount of time of the process is spent looping around the list of encoding data elements in the transcode method.
Are there any obvious mistakes here which may be slowing the whole process down? I’ve been thwarted in the past by expensive String operations, so have tried to avoid these by holding a global StringBuilder and appending to that, to avoid String creation.

The code works, it’s just very slow for large files. An example of the size, is a histogram of 140,000,000 bins. This corresponds to ~ a list of 6,000,000 EncodingData objects.

If I can help explain anything further, please let me know.

Solution

First of all, I don’t think using a global StringBuilder variable is going to make a significant difference in performance (if at all), because it does not inhibit the creation of String objects per se, it just doesn’t cause them to be passed from one method to another. But you are still creating String objects with all your invocations of String.format(String, Object...), and whether you pass them to one of your own methods or directly to StringBuilder.append(String) is, in the end, only a matter of readability. Your javadoc comments reveal that you once had the transcoding methods return a String instead of void. I would go back to this approach, because then the methods would only access state that is directly passed to them as arguments, which I think would be a clearer code design than if global state is involved (by the way, your method transcode still has a local specBuilder variable that hides the field of the same name).

I know your question was about performance, but your code is so extremely complicated, probably much more complicated than it needs to be, that it would most likely be more than helpful to first clean up and simplify the code before thinking about performance, because then it will be much easier to spot opportunities where you can save performance.

The first thing I would rewrite is your implementation of the various tokens and the method to obtain the appropriate token for an EncodingData, because it doesn’t depend on anything else in your code. There is a system to your assignment of the letters to the various ranges, at least until the letter P, so instead of hard-coding every single token that falls into this system, you can write an algorithm that calculates the letter based on the system. The last two tokens are exceptions (for whatever reason), so you would need to handle them separately.

public class Token {

    private static final int[] upperBoundaries;

    static {
        upperBoundaries = new int[4];
        upperBoundaries[0] = (1 << 0) - 1;  //0
        upperBoundaries[1] = (1 << 1) - 1;  //1
        upperBoundaries[2] = (1 << 8) - 1;  //255
        upperBoundaries[3] = (1 << 16) - 1; //65535
    }

    /*
     * Gets the index of the upper boundary of the range
     * that the passed value falls into, i.e. 0 for 0 (and
     * all negative values), 1 for 1, 2 for 2-255, 3 for
     * 256-65535, and 4 for all values greater than 65535
     */
    private static int getIndexOfUpperBoundary(int value) {
        int index = Arrays.binarySearch(upperBoundaries, value);
        if (index < 0) {
            index = -(index + 1);
        }
        return index;
    }

    public static char determineToken(int height, int length) {
        if (height < 0) {
            throw new IllegalArgumentException("Invalid height: " + height);
        } else if (length < 1) {
            throw new IllegalArgumentException("Invalid length: " + length);
        } else if (height <= upperBoundaries[3]) {
            int heightIndex = getIndexOfUpperBoundary(height);
            int lengthIndex = getIndexOfUpperBoundary(length) - 1; //length cannot be 0, hence -1
            return (char) ('A' + heightIndex * 4 + lengthIndex);
        } else if (length > upperBoundaries[2] && length <= upperBoundaries[3]) {
            return 'Q';
        } else if (length == 1) {
            return 'Z';
        } else {
            throw new InvalidTokenException("Token does not exist");
        }
    }
}

This admittedly reduces Token to a utility class, but it also makes the method transcodeUsingRegularElements(EncodingData data) a lot more compact, because the responsibility of detecting the range in which height and length fall that determines the token would then lie in the Token class, which means that the transcoding method can simply pass the height and length as a parameter without having to worry about in which range these values fall. It might also be faster than your code, because it doesn’t have to loop through several enum values in order to find the correct one, but calculates the character directly from the height and length.

The next thing I would do is redesign the methods transcodeSpecialOrMultipleRegularElements(List<EncodingData>) and transcodeUsingSpecialElements(List<EncodingData>, char). If I understand your code correctly, these methods are only meant to operate on EncodingData objects that represent one-length-bin-runs. This means that the length of the EncodingData objects should never be passed to these methods in the first place, because the methods not only don’t depend on the lengths, but they would even produce incorrect results should an EncodingData with a length other than 1 be passed to these methods. So instead, I would make these two methods accept a List<Integer> rather than a List<EncodingData>, where the List<Integer> represents the frequencies of the single bins. This is only for readability, it doesn’t impact performance, but as I said, your code is so complicated that every opportunity to simplify it can help prevent headaches.

Now, let’s look at the method transcodeSpecialOrMultipleRegularElements(List<EncodingData>) in detail. I would omit the two boolean flags lessThanOrEqual255 and lessThanOrEqual65535 because they are interdependent, meaning that it is possible that they contradict each other (namely if lessThanOrEqual255 == true and lessThanOrEqual65535 == false). Using one integer as a switch, as you did, is sufficient. However, for the sake of readability, I would redesign this switch to store the maximum upper boundary, which could either be 255 or 65535, because right now, it is impossible to tell what the variable SPECIAL_SWITCH is for without reading through the whole method.

private static final NavigableSet<Integer> upperBoundaries;

static {
    upperBoundaries = new TreeSet<>();
    upperBoundaries.add((1 << 8) - 1);  //255
    upperBoundaries.add((1 << 16) - 1); //65535
}

/*
 * Calculates the binary logarithm rounded up to an integer
 */
private static int log2ceil(int value) {
    if (value <= 0) {
        throw new IllegalArgumentException("value must be positive");
    } else {
        return 32 - Integer.numberOfLeadingZeros(value - 1);
    }
}

/*
 * Let's assume the method accepts a List<Integer>
 * instead of a List<EncodingData>
 */
private String transcodeSpecialOrMultipleRegularElements(List<Integer> frequencies) {

    Integer maximumUpperBoundary = Integer.MIN_VALUE; //can be null, hence Integer and not int
    for (int frequency : frequencies) {
        if (frequency > maximumUpperBoundary) {
            maximumUpperBoundary = upperBoundaries.ceiling(frequency);
            if (maximumUpperBoundary == null) {
                break;
            }
        }
    }

    OptionalInt specialSize;
    if (maximumUpperBoundary == null) {
        specialSize = OptionalInt.empty();
    } else {
        int bytesNeededPerElement = (int) Math.ceil(log2ceil(maximumUpperBoundary + 1) / 8.0);
        specialSize = OptionalInt.of(2 + frequencies.size() * bytesNeededPerElement);
    }

    //...
}

Note that the loop determining the maximum upper boundary is terminated once a frequency greater than 65535 is encountered, because any subsequent frequency cannot affect the maximum upper boundary, whereas your code iterates through the whole list of EncodingData objects, even if there already has been an element with a frequency greater than 65535, so you can also save performance here. I also made specialSize an OptionalInt, so that it only contains a value if it is relevant (besides, uppercase variable names are conventionally only used for constants, i.e. immutable static final fields, so I changed the variable name to specialSize).

Next, about the method transcodeUsingSpecialElements(List<EncodingData> run, char element). Apart from the horrendous code duplication (the if(element == 'R') and the else block differ only in the token character and the width of the format String), I would redesign the method so that the width is also passed as a parameter, just like the character. That way, ascertaining the validity of the combination of the character and format String width is solely the responsibility of the caller, whereas right now, this responsibility is duplicated in both transcodeUsingSpecialElements and transcodeSpecialOrMultipleRegularElements, which makes the code confusing and fragile (fragile in the sense that, should you ever wish to swap R and S, you have to rewrite both methods, whereas if you pass the width as a parameter, you only have to rewrite transcodeSpecialOrMultipleRegularElements).

Also, there are some instances where small changes to the code can make it more concise and thereby easier to read. For example, this code:

runCounter++;

if(runCounter == 255){
    runCounter = 0;
    sizeCounter -=255;
}

Can be replaced with this:

runCounter = (runCounter + 1) % 255;
sizeCounter--;

Or this code:

if(runCounter == 0){
    if(sizeCounter < 255){
        specBuilder.append('R');
        specBuilder.append(String.format("%03d", sizeCounter));
    }else{
        specBuilder.append('R');
        specBuilder.append("255");
    }
}

Can be simplified to:

if (runCounter == 0) {
    specBuilder.append('R');
    specBuilder.append(String.format("%03d", Math.min(sizeCounter, 255)));
}

Besides, I would suggest renaming the variable sizeCounter to something like elementsRemaining, which would describe its purpose more clearly.

Finally, I think there is still a bug in the transcode method. By putting the line

transcodeUsingRegularElements(data);

at the end of the loop through the elements, single-bin-elements will be transcoded twice, because they are added to run and transcoded in the aforementioned line.

I hope this helps, even if most of my suggestions were not directly related to performance.

According to this, String.format() is a costly operation, which makes sense since the format has to be parsed (each time it is called). I would seek the help of some libraries like Apache StringUtils or Guava to do the necessary padding and then replace format with concatanation.

Also, I did not analyze the code, but if the big list can be split and parallel processing applied to it (perhaps using fork-join DP) then that would provide a big boost

Leave a Reply

Your email address will not be published. Required fields are marked *