Multiple regex matches to correct OCR output

Posted on

Problem

Using OCR, my program reads the current selection for each of these categories: format, invert color? and storage. The OCR result will always have inconsistent error here and there, which is why I need a Regex match and return the correct wording for each selection:

//format
private Regex bmp8 = new Regex(".bmp.8....");
private Regex bmp24 = new Regex(".bmp.24....");
private Regex png24 = new Regex(".png.24....");
private Regex raw = new Regex("raw");
private Regex ts = new Regex("....setup");
private Regex csv = new Regex(".csvd...");

//invert color?
private Regex yes = new Regex("y.s");
private Regex no = new Regex("n.");

//storage
private Regex intrnl = new Regex(".nte...");
private Regex erase = new Regex("e.ase");
private Regex transfer = new Regex("t.ansf.");
private Regex usbdev = new Regex("usbdev.");

private string CheckStorage(string input)
{
    if (intrnl.IsMatch(input))
    {
        return "Internal";
    }

    else if (erase.IsMatch(input))
    {
        return "Erase";
    }

    else if (transfer.IsMatch(input))
    {
        return "Transfer";
    }

    else
    {
        return "USB Device";
    }
}

private string CheckInvertColor(string input)
{
    if (yes.IsMatch(input))
    {
        return "Yes";
    }

    else
    {
        return "No";
    }
}

private string CheckFormat(string input)
{
    if (bmp8.IsMatch(input))
    {
        return "bmp(8-bit)";
    }

    else if (bmp24.IsMatch(input))
    {
        return "bmp(24-bit)";
    }

    else if (png24.IsMatch(input))
    {
        return "png(24-bit)";
    }

    else if (raw.IsMatch(input))
    {
        return "Raw";
    }

    else if (ts.IsMatch(input))
    {
        return "Trace & Setup";
    }

    else
    {
        return "csv Data";
    }
}

I am sure there is a better way to tidy up this code, perhaps using List<>?

Solution

I see something that bugs me in your code, that being default return values and some unused regex expressions declared, for example in the following function:

private string CheckInvertColor(string input)
{
    if (yes.IsMatch(input))
    {
        return "Yes";
    }

    else
    {
        return "No";
    }
}

Why do you only check if the yes regex is a match, what if the input is different from the one that matches the no answer? What do you expect the code to do in this case? You return a “No” in any case that the yes regex does not match, then why have the no regex declared at all?

This also applies to the “USB device” and “csv data” of their respecting functions, you are not actually checking those values yet have them declared still.

I stress this because in my opinion, you should prevent unexpected inputs from returning a value for which you are expecting a specific input for. Instead the code should throw an exception or a value which then you can manage elsewhere to notify that an unexpected input has been fed to the function.

In regards to tiding up your code, there are some options you can use but the one that suits your needs is the following one:

Dictionaris

Although you may already know, a dictionary is a class that returns a value if the key supplied exists in said dictionary, with it your regex declarations would look like in the following:

private Dictionary<Regex, string> formatDictionary = new Dictionary<Regex, string>()
{
    {new Regex(".bmp.8...."),"bmp(8-bit)"},
    {new Regex(".bmp.24...."),"bmp(24-bit)"},
    {new Regex(".png.24...."),"png(24-bit)"},
    {new Regex("raw"),"Raw"},
    {new Regex("....setup"),"Trace & setup"},
    {new Regex(".csvd...."),"csv data"},
}

private Dictionary<Regex, string> inversionDictionary = new Dictionary<Regex, string>()
{
    {new Regex("y.s"),"Yes"},
    {new Regex("n."),"No"}
}

private Dictionary<Regex, string> storageDictionary = new Dictionary<Regex, string>()
{
    {new Regex(".nte..."),"Internal"},
    {new Regex("e.ase"),"Erase"},
    {new Regex("t.ansf."),"Transfer"},
    {new Regex("usbdev."),"USB Device"}
}

This way, with a LINQ query, you can check the validity of the input:

if (!storageDictionary.Any(x => x.Key.IsMatch(input)))
    throw new Exception("the input is not valid");

This way you can ensure that the input matches at least one of the Regex you defined, now you can recover the corresponding value:

return storageDictionary[storageDictionary.Where(x => x.Key.IsMatch(input)).First().Key];

Your functions would end up looking like this:

private string CheckInvertColor(string input)
{
    if (!storageDictionary.Any(x => x.Key.IsMatch(input)))
        throw new Exception("the input is not valid");

    return storageDictionary[storageDictionary.Where(x => x.Key.IsMatch(input)).First().Key];
}

If you go with regex then in addition to what Oscar already said in his answer I wouldn’t hardcode regex (and their replacements), a configuration file would be better if, for example, printed documents from one specific customer have a very strange font with a weird t. No need to compile a specific version for him, just change local configuration.

However I wouldn’t use regex at all. What if bmp is recognized as bnip? Do you want to add more and more regex for each case you find? This method will soon become impracticable. I’d perform this using a string distance function:

From Wikipedia:

In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance (“inverse similarity”) between two text strings for approximate string matching or comparison and in fuzzy string searching.

If you have this list: yes, no, maybe and you have an input recognized as ye$ your regex will fail but an approximate string matching will correctly recognize ye$ as yes because it’s nearest match. You may also set a threshold to reject invalid (because too different) inputs. There are many functions, a simple and well-known one is Levenshtein distance. Your code will then be:

private static readonly string[] KnownStorages =
    new string[] { "Internal", "External", "Transfer", "USB Device" };

private static string FindClosestMatch(string[] availableValues, string value)
{
    return availableValues
        .OrderByDescending(x => Levenshtein(x, value))
        .First();
}

It will be used like this:

string storage = FindClosestMatch(KnownStorages, recognizedStorage);

C# implementation for Levenshtein() function is easy to find (also on Wikipedia article there is one simple implementation), note that you may want a case-insensitive comparison because it reduces possible mismatches. Also experiment with different distance functions, many of them are easy to implement and you can check where you balance speed and accuracy.

Side note: there are C# implementations of fuzzy regex engines however I don’t think you need such complexity (unless you want to match full sentences instead of single words).

For much more complex scenario you may want to use – in English – a soundex matching, for a possible implementation you may start from Soundex algorithm implementation in Java here on Code Review. EDIT: This kind of matching is not a direct replacement of distance function or regex (because matching doesn’t count differences but match by sound) but it may be a viable solution where you have a huge dictionary and input with very bad quality (where a simple distance function may return completely unrelated words). In that case it probably works best used in conjunction with distance matching to pre-filter dictionary.

Leave a Reply

Your email address will not be published. Required fields are marked *