Function to convert “path” into regular expression

Posted on

Problem

This function converts a string like:
/{?lang(2-5):lowercase}/{page(2+):lowercase}/{article}-{id:integer}*.html*
Into the following regular expression:
@^(?J)/(?:(?P<lang>[a-zd-_W]{2,5})/)?(?P<page>[a-zd-_W]{2,})/(?P<article>[wW]*?)-(?P<id>[1-9]*?).*?.html.*?$@
Ready to be used somewhere else, using preg_match().

The basic structure is as follows:

  • Everything is between brackets ({...})
  • The next character can be a ?, to define it as optional
  • Then comes the name of the “item” to capture
  • Between parenthesys, you can specify the length. Lengths can be a single number (to meant “exactly this long”), an interval (2-5) or an interval without upper limit (2+).
  • You can say the type you want. It can be integer, 0-integer, lowercase, uppercase and (if unspecified) defaults to case insensitive.

This function converts the info there into a regular expression, usable with preg_match.

<?php

function convertPathToRegex($path){
    // (?J) -> repeated names in capturing groups (the J modifier is only for PHP 7.2+)
    static $regex = '@(?J){
        (?P<optional>?)? #defined if it is optional
        (?P<item>[a-z]w*) #item name
        (?:(
            (?P<length>[1-9]d*) # fixed length (default), or minimum length
            (?P<length_max>
                + # no maximum length
                |-[1-9]d* # specific maximum length
            )?
        ))?
        (?:
            # types are used as :<type>
            :(?P<type>
                (?:0-)?int(?:eger)? # treats as an integer (starting from 0 or 1)
                |num(?:ber)? # same as 0-int
                |[lu]c(?:ase)? # [l]ower or [u]pper case
                |(?:low|upp)er(?:case)? # lower, upper, lowercase, uppercase
            )
        )?
    }
    |(?P<item>
        * # any case insensitive text
    )@x';

    // default options
    static $default = array(
        'optional' => '',
        'item' => '*',
        'type' => 'ci',
        'length' => 0,
        'length_max' => 0
    );

    // types to be used on {name:type}
    static $types = array(
        '0-int' => 'd',
        'int' => '[1-9]',
        'ci' => '[wW]',
        'lc' => '[a-zd-_W]',
        'uc' => '[A-Zd-_W]',
    );

    // alternative names for $types
    static $types_map = array(
        '' => 'ci',
        'integer' => 'int',
        '0-integer' => '0-int',
        'num' => '0-int',
        'number' => '0-int',
        'lcase' => 'lc',
        'lower' => 'lc',
        'lowercase' => 'lc',
        'ucase' => 'uc',
        'upper' => 'uc',
        'uppercase' => 'uc'
    );

    // will contain all the into about the {items}
    $items = array();

    $format = preg_replace_callback($regex, function($matches)use(&$default, &$types, &$types_map, &$items){
        $item = array_merge($default, $matches);

        // the default is to select any text
        if($item['item'] === $default['item'])
        {
            $items[] = '.*?';

            // return %s to be used later with sprintf
            return '%s';
        }

        $regex = '(?P<' . $item['item'] . '>';
        $piece = isset($types_map[$item['type']])
            ? $types[$types_map[$item['type']]]
            : $types[$item['type']];

        if($item['type'] === 'int')
        {
            if($item['length'] >= 2)
            {
                // must subtract 1 from length and length_max to compensate for the [1-9] (1 char) at the beginning
                $piece .= 'd{' . ($item['length'] - 1) . (
                    $item['length_max']
                        ? ',' . (
                            $item['length_max'] !== '+'
                                ? abs($item['length_max'] - 1)
                                : ''
                        )
                        : ''
                    ) . '}';
            }
            else
            {
                /*
                    if a length exists, it must be lower than 2 (1 char)
                        so, nothing else needs to be done ($piece contains [1-9], which matches 1 char)

                    if no length is provided, match all the numbers ahead
                */
                $piece .= $item['length'] ? '' : 'd*';
            }
        }
        else if($item['length'] >= 2 || ($item['length_max'] && $item['length_max'] !== '+'))
        {
            /*
                only give it a length specification if and only if the length is 2 or higher
                    or if there's a maximum length
                this means that (1) and (1+) are skipped, but (1-5) returns {1,5} (regex)
            */
            $piece .= '{' . $item['length'] . (
                $item['length_max']
                    ? ',' . (
                        $item['length_max'] !== '+'
                            ? abs($item['length_max'])
                            : ''
                    )
                    : ''
                ) . '}';
        }
        else if(!$item['length'] || ($item['length'] === '1' && $item['length_max'] === '+'))
        {
            // if no length is specified (or is 1+), it means "all"
            $piece .= '+';
        }
        /*
            length of 1 doesn't need any treatment
            this is because $piece contains the specification for 1 character already
        */

        $regex .= $piece . ')';

        $items[] = $item['optional'] ? '(?:' . $regex . ')?' : $regex;

        // returns %s to be used with sprintf
        return '%s';
    }, $path);

    // all arguments must be in the same array, can't do $format, $items
    $new_regex = call_user_func_array(
        'sprintf',
        array_merge(
            array(preg_quote($format, '@')), // protects special chars, like periods and slashes
            $items
        )
    );

    return '@^(?J)' . str_replace(')?/', '/)?', $new_regex) . '$@';
}

It is a pretty complicated and quite massive.

I’ve decided to do not implement any memoization scheme, as this function is part of a larger project and this will be cached outside.

This function works as intended, as far as I could tell and from my testing.

Is there anything I can improve in this function?

Solution

There is only one thing that jumps out at me as whacky…

[a-zd-_W]

Regex101 breakdown

I think this means to match a lowercase substring, but that’s not what it is doing.

Since W is the inverse of w and because w represents A-Za-z0-9_, I think it is strange that the subpattern is used to replace the lowercase placeholder.

As is, your pattern can be expanded to the following equivalent:

(?:[a-z0-9_-]|[^A-Za-z0-9_])

This is far, far more characters than a-z. If I was new to using your system, I would expect lowercase to exclusively mean [a-z].

I mean if you were simply trying to deny uppercase substrings (and allow everything else) at that position, why wouldn’t you use a negated character class [^A-Z].

And as I say that, I ask if the placeholder itself is flawed. Perhaps more intuitive to make a not keyword/placeholder to be written as notupper or not:upper or maybe !upper if you need such functionality.

I guess what I am saying is, you should either adjust your placeholders’ respective patterns, or change the placeholder terminology.

Less of a concern, but perhaps something worth sharing is that most patterns that intend to match any character (including newlines) will either use [Ss] or . with the s pattern modifier. Your [wW] works the same, just not commonly used.

Leave a Reply

Your email address will not be published. Required fields are marked *