Versatile string validation

Posted on

Problem

I passed a technical test the other day, one part included a cheap string validation and I am wondering this can be further improved to be more versatile to requirement changes.

The requirements were something like that

Create a Validate method, which accepts a string and returns true if it’s valid and false if it’s not.

A string is valid if it satisfies the rules below:

  • The string must be at least 6 characters long and not exceed 16 characters.
  • The string must contain only letters, numbers and optionally one hyphen (-).
  • The string must start with a letter, and must not end with a hyphen.
    For example, validate("Michelle Belle"); would return false because it contains a space.

My solution was like that:

public static class ComparableExtensions
{
    public static bool IsStrictlyLowerThan<TComparable>(this TComparable comparable, TComparable value)
        where TComparable : IComparable<TComparable>
    {
        return comparable.CompareTo(value) < 0;
    }

    public static bool IsStrictlyGreaterThan<TComparable>(this TComparable comparable, TComparable value)
        where TComparable : IComparable<TComparable>
    {
        return comparable.CompareTo(value) > 0;
    }

    public static bool IsStrictlyNotBetween<TComparable>(this TComparable comparable, TComparable lowerBound, TComparable upperBound)
        where TComparable : IComparable<TComparable>
    {
        if (lowerBound.IsStrictlyGreaterThan(upperBound))
        {
            throw new ArgumentOutOfRangeException(nameof(lowerBound) + nameof(upperBound));                
        }

        return comparable.IsStrictlyLowerThan(lowerBound) || comparable.IsStrictlyGreaterThan(upperBound);
    }
}

public static class CharExtensions
{
    public static bool IsLetterOrDigit(this char c)
    {
        return char.IsLetterOrDigit(c);
    }

    public static bool IsLetter(this char c)
    {
        return char.IsLetter(c);
    }

    public static bool IsHyphen(this char c)
    {
        return c == '-';
    }
}

public class Test
{
    public static bool Validate(string str)
    {
        if (str.Length.IsStrictlyNotBetween(6, 16))
        {
            return false;
        }

        if (!str.First().IsLetter() || str.Last().IsHyphen())
        {
            return false;
        }

        var hyphenCount = 0;

        for (var i = 1; i < str.Length - 1; i++)
        {
            if (str[i].IsLetterOrDigit())
            {
                continue;
            }
            if (str[i].IsHyphen())
            {
                hyphenCount++;
                if (hyphenCount > 1)
                {
                    return false;
                }
            }
            else
            {
                return false;
            }
        }

        return true;
    }
}

I purposefully decided to not go with Regular Expressions to keep the logic readable and I am wondering if my code can be further refactored to incorporate new business rules.

Solution

Not much to say about the extensions methods, as they are mostly wrappers.

However if you’re looking for ways to make the algorithm more readable, LINQ is your friend. You can replace most of your logic with a one-liner:

var hyphenCount = 0;

for (var i = 1; i < str.Length - 1; i++)
{
    if (str[i].IsLetterOrDigit())
    {
        continue;
    }
    if (str[i].IsHyphen())
    {
        hyphenCount++;
        if (hyphenCount > 1)
        {
            return false;
        }
    }
    else
    {
        return false;
    }
}

return true;

Like this:

return str.All(c => c.IsHyphen() || c.IsLetterOrDigit()) && str.Count(c => c.IsHyphen()) <= 1;

Which more clearly explains your intent, you can also move those expression to their separate methods to make it as readable as possible, this way you can keep adding new conditions, without modifying the existing logic (unless they interfere with one another that is).

  • There’s a bug: you’re not checking if the last character is a letter or digit, only that it isn’t a hyphen, so this fails to reject "abcdef&". Denis’ solution may be less efficient (2 iterations instead of 1), but with at most 16 characters that’s not much of a concern, and it’s both easier to read and it works correctly.
  • The first two rules read very nicely. I especially like that the business rules are translated to code almost 1:1, that will make updating easier. However, I do think those extension methods are over-engineered. str.Length < 6 || str.Length > 16 and !char.IsLetter(str.First()) || str.Last() == '-' is already quite readable, and that doesn’t require extra code that needs to be understood and maintained.
  • You can use => syntax for methods with single-expression bodies.

Little bit late but let me add another alternative solution, t3chb0t already took care of your code.

You have extension methods for IComparable<T> and Char, this forces you to write your business logic in code instead of a (pseudo) high-level language which resembles your business rules. If you do not want to use a regular expression (I’d use it) then you should try to match 1:1 the language of your requirements:

StringValidator.Validate(str)
    .LengthIsInRange(6, 16)
    .ContainsOnly(Matches.LetterOrDigits, Matches.Hyphen).HasAtMost(1, Matches.Hyphen)
    .StartsWith(Matches.Letter)
    .DoesNotEndWith(Matches.Hyphen);

This is simple enough to be self-describing. Does it hurt performance? Maybe but it’s seldom a problem in LoB applications and business rules can become incredibly convoluted. Is it a problem? Measure it, for performance critical code it’s easy to write an extension method (for StringValidator) to perform a specific task.

Note that I’m using an hypothetical Matches.LetterOrDigits() function with string -> bool signature instead of, let’s say, Char.IsLetterOrDigit() with char -> bool. Why? Because not every character (OK, I won’t repeat this again…) is a single Char then you have to compare strings (few examples: à or or ).

Your requirements are, probably, vague and your code broken regardless the interpretation you picked. What do they mean with letter? An US-ASCII Latin character? In that case your code is wrong because ā (LATIN SMALL LETTER A MACRON) is accepted. Is is a letter everything in the Letter Unicode category (or even just in the Latin Extended Unicode block)? Your code is wrong again because (LATIN SMALL LETTER A + COMBINING MACRON) is rejected. This very example can be solved normalizing the string before starting the validation but it’s not always that easy.

Imagine that instead of a single validation rule you have…1,000 of them. Or 10,000. I’m not a fan of extension method for primitive types (especially when they’re trivial) but the mistake (IMHO) is to work with a too low level of abstraction: you do not want to extend String.Length to replicate what > does: you want to extend String to support high level composable and chainable assertions.

Note: ContainsOnly() may even accept regexes, pseudo code:

static ContainsOnly(this StringValidator validator, ...string[] matches) => ...

static class Matches {
    public static string LetterOrDigit = "a-zA-Z0-9";
}

Composition should now be fast enough even when you cannot really ignore performance (and you can always have a MatchRegEx() extension method).


How to extend this? In previous example I assumed a pseudo-implementation like this:

public StringValidator LengthInRange(int min, int max) {
    if (IsValid) {
        IsValid = Value.Length >= min && Value.Length <= max;
    }

    return this;
}

You can easily extend it to run all validators to generate a list of errors:

var result = StringValidator.Validate(str)
    .LengthIsInRange(6, 16)
    .StartsWith(Matches.Letter)
    .DoesNotEndWith(Matches.Hyphen);

if (result.HasErrors)
    Console.WriteLine(String.Join(Environment.NewLine, result.Errors));

Or to throw an exception:

StringValidator.Validate(str)
    .LengthIsInRange(6, 16)
    .StartsWith(Matches.Letter)
    .DoesNotEndWith(Matches.Hyphen)
    .ThrowIfInvalid();

I think you’ve got the point.

I think you’ve seen enough alternative solutions so I’ll just review your code.

  • You should not invent new terms for x < 0 like IsStrictlyLowerThan as this makes things unnecessarily more difficult to understand. Instead, you should stick to the well known naming. In this case that would be IsLessThan or in case of x <= 0 it would be IsLessThanOrEqual – Everyone who write code would understand these without any further explanation. Or in case of IsStrictlyNotBetween should be IsNotInRangeExclusive etc.
  • There are voices in comments that suggest using == '-' instead of your extensions. I don’t agree with that. Your extensions are much better because they make the code look clean and consistent. Having extensions for some operations and not for others would make it look dirty.
  • I don’t however agree with your decision for not using Regex. With regex it’s usually simpler and easier to express complex validations. It’s also easier to make your validations case-insensitive or match other patterns or use such services as regex101.com to test them. It might be enough for now but sooner or later you will need it so don’t be so strict about always consider other solutions.
  • Keep also in mind that your IsStrictlyLowerThan APIs are currently case sensitive so you might consider using IEqualityComparer<string> as the last argument and not the TComparable that doesn’t offer you any additional functionality in this context. So your API should have signatures similar to this one:

    public static bool IsLessThan(this string value, string other, IComparer<string> comaprer = default)
    {
        return (comaprer ?? StringComparer.Ordinal).Compare(value, other) < 0;
    }
    

    Then you can use it with other comparers if necessary, e.g.:

    "FOO".IsLessThan("foo", StringComparer.OrdinalIgnoreCase)
    

N.B. I thought I’d answer the original test question myself in F# as a Kata, so this is a functional suggestion based on what I found.

I’ll only talk about the versatility part of the solution, others have touched on using Linq and I think your code is already quite readable.


If you can divide the original requirement into a number of smaller rules, where each rule has the same “Shape”.

This “Shape” is take a sequence of characters and return a Boolean.

So in C# we’d have:

bool LengthIsValid(string s) => s.Length >= 6 && s.Length <= 16;

bool Rule2(string s) => true; //e.g must start with Letter, or doesn't with with '-'

Now that we have each rule in the same “Shape”, you can create a composite set of rules and validate them all together:

var rules = new List<Func<string, bool>> 
{ 
    LengthIsValid, 
    Rule2,
    //etc.
};

Checking the rules is then just a matter of applying those rules to the input string:

bool CheckAllRules (string s) =>
    rules
    .All(rule => rule(s));

With this approach you get versatility from been able to create any number of composition of rules.

e.g. Creating Rules where the powerUsers don’t need to check the length (I can be root).

IEnumerable<Func<string, bool>> CreateRules(bool powerUser)
{
    if (!powerUser)
        yield return LengthIsValid;

    yield return Rule2;
}


The F# version if you’re interested in here

Leave a Reply

Your email address will not be published. Required fields are marked *