Problem
This function converts a string like:
/{?lang(2-5):lowercase}/{page(2+):lowercase}/{article}-{id:integer}*.html*
Into the following regular expression:
@^(?J)/(?:(?P<lang>[a-zd-_W]{2,5})/)?(?P<page>[a-zd-_W]{2,})/(?P<article>[wW]*?)-(?P<id>[1-9]*?).*?.html.*?$@
Ready to be used somewhere else, using preg_match()
.
The basic structure is as follows:
- Everything is between brackets (
{...}
) - The next character can be a
?
, to define it as optional - Then comes the name of the “item” to capture
- Between parenthesys, you can specify the length. Lengths can be a single number (to meant “exactly this long”), an interval (
2-5
) or an interval without upper limit (2+
). - You can say the type you want. It can be integer, 0-integer, lowercase, uppercase and (if unspecified) defaults to case insensitive.
This function converts the info there into a regular expression, usable with preg_match
.
<?php
function convertPathToRegex($path){
// (?J) -> repeated names in capturing groups (the J modifier is only for PHP 7.2+)
static $regex = '@(?J){
(?P<optional>?)? #defined if it is optional
(?P<item>[a-z]w*) #item name
(?:(
(?P<length>[1-9]d*) # fixed length (default), or minimum length
(?P<length_max>
+ # no maximum length
|-[1-9]d* # specific maximum length
)?
))?
(?:
# types are used as :<type>
:(?P<type>
(?:0-)?int(?:eger)? # treats as an integer (starting from 0 or 1)
|num(?:ber)? # same as 0-int
|[lu]c(?:ase)? # [l]ower or [u]pper case
|(?:low|upp)er(?:case)? # lower, upper, lowercase, uppercase
)
)?
}
|(?P<item>
* # any case insensitive text
)@x';
// default options
static $default = array(
'optional' => '',
'item' => '*',
'type' => 'ci',
'length' => 0,
'length_max' => 0
);
// types to be used on {name:type}
static $types = array(
'0-int' => 'd',
'int' => '[1-9]',
'ci' => '[wW]',
'lc' => '[a-zd-_W]',
'uc' => '[A-Zd-_W]',
);
// alternative names for $types
static $types_map = array(
'' => 'ci',
'integer' => 'int',
'0-integer' => '0-int',
'num' => '0-int',
'number' => '0-int',
'lcase' => 'lc',
'lower' => 'lc',
'lowercase' => 'lc',
'ucase' => 'uc',
'upper' => 'uc',
'uppercase' => 'uc'
);
// will contain all the into about the {items}
$items = array();
$format = preg_replace_callback($regex, function($matches)use(&$default, &$types, &$types_map, &$items){
$item = array_merge($default, $matches);
// the default is to select any text
if($item['item'] === $default['item'])
{
$items[] = '.*?';
// return %s to be used later with sprintf
return '%s';
}
$regex = '(?P<' . $item['item'] . '>';
$piece = isset($types_map[$item['type']])
? $types[$types_map[$item['type']]]
: $types[$item['type']];
if($item['type'] === 'int')
{
if($item['length'] >= 2)
{
// must subtract 1 from length and length_max to compensate for the [1-9] (1 char) at the beginning
$piece .= 'd{' . ($item['length'] - 1) . (
$item['length_max']
? ',' . (
$item['length_max'] !== '+'
? abs($item['length_max'] - 1)
: ''
)
: ''
) . '}';
}
else
{
/*
if a length exists, it must be lower than 2 (1 char)
so, nothing else needs to be done ($piece contains [1-9], which matches 1 char)
if no length is provided, match all the numbers ahead
*/
$piece .= $item['length'] ? '' : 'd*';
}
}
else if($item['length'] >= 2 || ($item['length_max'] && $item['length_max'] !== '+'))
{
/*
only give it a length specification if and only if the length is 2 or higher
or if there's a maximum length
this means that (1) and (1+) are skipped, but (1-5) returns {1,5} (regex)
*/
$piece .= '{' . $item['length'] . (
$item['length_max']
? ',' . (
$item['length_max'] !== '+'
? abs($item['length_max'])
: ''
)
: ''
) . '}';
}
else if(!$item['length'] || ($item['length'] === '1' && $item['length_max'] === '+'))
{
// if no length is specified (or is 1+), it means "all"
$piece .= '+';
}
/*
length of 1 doesn't need any treatment
this is because $piece contains the specification for 1 character already
*/
$regex .= $piece . ')';
$items[] = $item['optional'] ? '(?:' . $regex . ')?' : $regex;
// returns %s to be used with sprintf
return '%s';
}, $path);
// all arguments must be in the same array, can't do $format, $items
$new_regex = call_user_func_array(
'sprintf',
array_merge(
array(preg_quote($format, '@')), // protects special chars, like periods and slashes
$items
)
);
return '@^(?J)' . str_replace(')?/', '/)?', $new_regex) . '$@';
}
It is a pretty complicated and quite massive.
I’ve decided to do not implement any memoization scheme, as this function is part of a larger project and this will be cached outside.
This function works as intended, as far as I could tell and from my testing.
Is there anything I can improve in this function?
Solution
There is only one thing that jumps out at me as whacky…
[a-zd-_W]
I think this means to match a lowercase substring, but that’s not what it is doing.
Since W
is the inverse of w
and because w
represents A-Za-z0-9_
, I think it is strange that the subpattern is used to replace the lowercase placeholder.
As is, your pattern can be expanded to the following equivalent:
(?:[a-z0-9_-]|[^A-Za-z0-9_])
This is far, far more characters than a-z
. If I was new to using your system, I would expect lowercase
to exclusively mean [a-z]
.
I mean if you were simply trying to deny uppercase substrings (and allow everything else) at that position, why wouldn’t you use a negated character class [^A-Z]
.
And as I say that, I ask if the placeholder itself is flawed. Perhaps more intuitive to make a not
keyword/placeholder to be written as notupper
or not:upper
or maybe !upper
if you need such functionality.
I guess what I am saying is, you should either adjust your placeholders’ respective patterns, or change the placeholder terminology.
Less of a concern, but perhaps something worth sharing is that most patterns that intend to match any character (including newlines) will either use [Ss]
or .
with the s
pattern modifier. Your [wW]
works the same, just not commonly used.