# Karp-Rabin with bitwise hash

Posted on

Problem

As a Rags-to-Riches re-implementation of the Karp-Rabin question here, from @tsleyson, I thought it would be interesting to understand the algorithm better, and to use bitwise-shifting instead of prime multiplication, to compute the rolling hash.

Karp-Rabin is a text-search algorithm that computes a rolling hash of the current text, and only if the rolling hash matches the search term’s hash, does it double-check that the search actually matches. Because the hash is computed in O(1)

$O\left(1\right)$

$O(1)$ time for each advance through the text, it is essentially an O(n)

$O\left(n\right)$

$O(n)$ search algorithm with an additional O(m)

$O\left(m\right)$

$O(m)$ confirmation check for each potential match that is found. The best case is O(n)

$O\left(n\right)$

$O(n)$ for no matches, and worst case is O(nm)

$O\left(nm\right)$

$O(nm)$ if everything matches (for example, searching for “aa” in “aaaaaaaaaaaaa”).

The description for the algorithm suggests using a ‘base’ prime number as the foundation for a hashing algorithm that allows you to ‘rotate’ an out-of-scope character off of the hash, and then add a new member to it. As you ‘roll’ the hash down the search text, if the hash matches the hash of your pattern, you can then double-check to see if it is just a collision, or a real match.

Instead of using a system requiring repeated multiplications by prime numbers, I decided to use a computationally simpler bit-shifting algorithm (though the logic may be more complicated, the computations should be simpler). In essence, the algorithm I use uses a long value (64-bit) as a hash, and it rotates characters on to the hash, and uses XOR bitwise logic to ‘add’ or ‘remove’ the characters.

I am looking for a review of the usability, and any performance suggestions.

import java.util.Arrays;

public class KarpRabin {

private static final int[] EMPTY = new int;

// Use a prime number as the shift, which means it takes a while to reuse
// bit positions in the hash. Creates a good hash distribution in the
private static final int SHIFT = 31;
// bits in the long hash value
private static final int BITS = 64;

private final char[] patternChars, textChars;
private final long patternHash;
private final int size;
private final int endPosition;
private final int tailLeftShift, tailRightShift;

private long textHash = 0L;
private int pos = 0;

/*
* Private constructor, only accessible from public static entry methods.
*/
private KarpRabin(String pattern, String text) {
// inputs pre-validated.
// inputs not null, not empty, and pattern is shorter than text.
patternChars = pattern.toCharArray();
textChars = text.toCharArray();
size = this.patternChars.length;
endPosition = this.textChars.length - size;

// calculate the bit-shifts needed to remove the tail of the search
// since the hash values are 'rotated' each time, we can count the
// rotations that will be needed (size - 1) before a char is at the
// tail, and that will be the relative left/right shifts needed.
// note that left and right shifts perform a 'rotation' when combined,
// essentially rotating the character by tailLeft bits to the left.
tailLeftShift = ((size - 1) * SHIFT) % BITS;
tailRightShift = BITS - tailLeftShift;

// Compute the hash of the pattern
// use a temp variable ph because patternHash is final
long ph = 0L;
for (char c : this.patternChars) {
}
patternHash = ph;

// seed the hash for the first X chars in the text to search.
textHash = 0L;
for (int i = 0; i < size; i++) {
}

// advance the search to the first 'hit' (if there is one).
// check to make sure we are not already at a match first.
if (textHash != patternHash || !confirmed()) {
}
}

/*
* Shift the existing hash, and XOR in the new char.
*/
private final long addHash(long base, char c) {
return (base << SHIFT | base >>> (BITS - SHIFT)) ^ c;
}

/*
* Shift the char to remove to the place it would be at the tail, and XOR it
* off.
*/
private final long removeHash(long base, char c) {
long ch = c;
// this is essentially a rotation in a 64-bit space.
ch = ch << tailLeftShift | ch >>> tailRightShift;
return base ^ ch;
}

/*
* Return the next match, and advance the search if needed.
*/
private int next() {
if (pos > endPosition) {
return -1;
}
int ret = pos;
return ret;
}

/*
* move the position to the next 'hit', or the end of the search text.
*/
while (++pos <= endPosition) {
// remove the tail char from the hash
textHash = removeHash(textHash, textChars[pos - 1]);
textHash = addHash(textHash, textChars[pos + size - 1]);
// check to see if we have a match.
if (textHash == patternHash && confirmed()) {
return;
}
}

}

/*
* Double-check a hash-hit. Assumes the hash computes are equal.
*/
private boolean confirmed() {
for (int i = 0; i < size; i++) {
if (patternChars[i] != textChars[pos + i]) {
return false;
}
}
return true;
}

/**
* Identify whether the pattern appears in the given text.
* <p>
* Null input values will return false, and an empty-string pattern will
* return true (the empty string is a match in any other non-null string)
*
* @param pattern
*            the pattern to search for
* @param text
*            the text to search in
* @return true if the pattern exists in the text.
*/
public static boolean match(final String pattern, final String text) {
if (pattern == null || text == null || pattern.length() > text.length()) {
return false;
}
if (pattern.isEmpty()) {
return true;
}

KarpRabin kr = new KarpRabin(pattern, text);
return kr.next() >= 0;
}

/**
* Identify all locations where the pattern exists in the search text.
* <p>
* Null input values will result in no matches, and the empty search pattern
* will be located at all positions in the search text.
*
* @param pattern
*            the pattern to search for
* @param text
*            the text to search in
* @return the integer positions of each found occurrence. An empty array
*         will be returned if there are no matches.
*/
public static int[] allMatches(String pattern, String text) {
if (pattern == null || text == null || pattern.length() > text.length()) {
return EMPTY;
}
if (pattern.isEmpty()) {
// empty string is found everywhere
int[] ret = new int[text.length()];
for (int i = 0; i < ret.length; i++) {
ret[i] = i;
}
return ret;
}

KarpRabin kr = new KarpRabin(pattern, text);
// guess the size first, and expand/trim it as needed.
int[] possibles = new int[text.length() / pattern.length()];
int count = 0;
int pos = 0;
while ((pos = kr.next()) >= 0) {
if (count >= possibles.length) {
// expand the matches (a lot). This indicates overlapping results.
// which is unusual.
possibles = Arrays.copyOf(possibles, possibles.length * 2 + 2);
}
possibles[count] = pos;
count++;
}
// trim the matches.
return Arrays.copyOf(possibles, count);
}
}


I have been testing it against some input data similar to @tsleyson’s, and then added in a performance test too, that checks the scalability of a non-matching case.

private static final void scalabilityTests() {
String input = "abcdefghijklmnopquestuvwxyz";
long[] times = new long;
for (int i = 0; i < times.length; i++) {
int sum = 0;
long nanos = System.nanoTime();
for (int j = 0; j < 5000; j++) {
sum += allMatches("abcdefghx", input).length;
}
times[i] = sum + System.nanoTime() - nanos;
// double the times....
input += input;
}
System.out.println("Scalability (should double each time): " + Arrays.toString(times));
}

public static void main(String[] args) {
String[] inputs = { "", null, "abr", "abra", "abrabrabra",
"abbbbrrraaabbbccrrabbba" };
for (String inp : inputs) {
int[] matches = allMatches("abra", inp);
System.out.printf("Matches %s %s at %s%n", inp, match("abra", inp), Arrays.toString(matches));
}
// run three times for warmups....
scalabilityTests();
scalabilityTests();
scalabilityTests();
}


Solution

to use bitwise-shifting instead of prime multiplication, to compute the rolling hash.

This is an optimization, I’m not sure about. It’s a bit faster (1 instead of 3 or 4 cycles latency, but probably the same throughput on modern Intel), but the hash quality is worse.

This assumes the use of rotations optimized by the JIT, which is not the case for the reviewed code. It’s quite possible that a hand-made rotation is worse than multiplication.

You need no prime, any odd number would do.

// Use a prime number as the shift, which means it takes a while to reuse
// bit positions in the hash. Creates a good hash distribution in the
private static final int SHIFT = 31;


For int, the rotation distance of 31 would be equivalent to -1, which is not good as the next char uses about the same bits. For long, it’s not bad.

You’re speaking about primes, but the shift is modulo 64 and there’s no such thing as primes in the ring Z/64. Again, what you need is a number co-prime to 64, i.e., odd. This assures that it takes 64 operations to land in the same position.

I don’t think 31 is very good, as after two operations, you get 62, i.e., -2. So using three ASCII chars, a collision can be easily produced as their bits overlap.

When sticking with ASCII, 7 should be fine, but for general strings, I’d suggest at least 16. FWIW, with 17, you’re guaranteed that there’ll no collisions for up 3 steps. After 4 steps, it totals to 68, i.e., 4, which is not perfect, so I’d go for 19 instead. A measurement with real data could tell us more, while I guess, the answer is that it doesn’t matter much.

(base << SHIFT | base >>> (BITS - SHIFT))


Wrong! Use Long.rotateLeft. It does exactly the same, but is clearer and much faster as it’s an intrinsic. Whenever the JIT sees Long.rotateLeft, it uses the rotation instruction; when it sees an equivalent code, it doesn’t care.

Actually, you need no BITS as the distance is taken modulo 64 (guaranteed in Java, undefined in C).

public static boolean match(final String pattern, final String text)


See my comment on the linked CR concerning indexOf and contains.

Concerning naming, you’re looking for a pattern in a text, that’s fine, but maybe searching a needle in a haystack would be even clearer.

if (pattern == null || text == null || pattern.length() > text.length()) {
return false;
}


Don’t allow nulls. Just throw as every null means a problem and the earlier you fail, the better. See Guava’s arguments.

I’d probably go for

if (pattern.length() >= text.length()) {
return pattern.equals(text);
}


and simply let it blow on any null.

Someone less lazy would add explicit

import static com.google.common.base.Preconditions.checkNotNull;

checkNotNull(pattern);
checkNotNull(text);


int[] possibles =


I’d call it result.

int pos = 0;
while ((pos = kr.next()) >= 0) {


This should be a for loop.