Rabin-Karp String matching

Posted on

Problem

I am learning algorithms and have implemented the Rabin-Karp string search algorithm. I didn’t implement a rolling HashCode, but it’s judged correct by this online judge.

It’s slower than the str.indexOf() implementation (7ms, compared to 4 ms). Can this be improved? What is the time complexity for this?

class Solution {
public int strStr(String haystack, String needle) {
    if (haystack == null || needle == null)
        return -1;
    int needleHash = hashCode(needle);
    int plen = needle.length();
    boolean t = true;
    int tlen = haystack.length() - needle.length() + 1;
    for (int i = 0; i < tlen; i++) {
        String sub = haystack.substring(i, plen + i);
        int haystackHash = hashCode(sub);
        if (haystackHash == needleHash) {
            for (int j = 0; j < plen; j++) {
                char[] charN = needle.toCharArray();
                char[] charH = sub.toCharArray();
                if (charN[j] != charH[j]) {
                    t = false;
                }
            }
            if (t == true) {
                return i;
            }
        }
    }
    return -1;
}

public int hashCode(String value) {
    int h = 777;
    if (value.length() > 0) {
        char val[] = value.toCharArray();

        for (int i = 0; i < val.length; i++) {
            h = 31 * h + val[i];
        }
    }
    return h;
  }
}

Solution

You’re missing the essential benefit of using a rolling hash:

for (int i = 0; i < tlen; i++) {
    String sub = haystack.substring(i, plen + i);
    int haystackHash = hashCode(sub);

Here, we examine plen characters of haystack at every character position, which is O(mn) where m is the haystack length and n the needle length.

By using a rolling hash, we only need to examine two characters each time around the loop (one to add to the hash, and one to remove from it). That is of course O(m), which greatly improves the efficiency when n is large.

Only inspect the substring when there’s a hash match, which should be infrequent.


Other issues:

  • Why convert needle and sub to char arrays to compare them? You certainly don’t want to do that every time around the j loop; you don’t even need that loop if you write if (sub == needle) return i;.

  • if (t == true) is more simply written if (t) since that’s the only true value that t can hold. Again, not needed when you eliminate the j loop.


Improved code

You meant to write something more like this:

public int strStr(final String haystack, final String needle)
{
    if (haystack == null || needle == null)
        return -1;

    final int nlen = needle.length();
    final int hlen = haystack.length();
    if (nlen > hlen)
        return -1;

    int needleHash = 0;
    int hash_remove = 1;
    for (char c: needle.toCharArray()) {
        needleHash = addHash(needleHash, c);
        hash_remove = addHash(hash_remove, '');
    }

    int haystackHash = 0;
    for (int i = 0;  i < nlen - 1;  ++i) {
        haystackHash = addHash(haystackHash, haystack.charAt(i));
    }

    for (int i = 0;  i + nlen < hlen;  ++i) {
        haystackHash = addHash(haystackHash, haystack.charAt(i + nlen));
        if (haystackHash == needleHash && haystack.substring(i, i+nlen) == needle)
            return i;
        haystackHash -= haystack.charAt(i) * hash_remove;
    }

    return -1;
}

private int addHash(int h, char c)
{
    // calculation may overflow, but that's fine
    return 31 * h + c;
}

The only magic in there is the computation and use of hash_remove, to undo the effect of characters that have moved out of the match window.

Leave a Reply

Your email address will not be published. Required fields are marked *