Problem
I’m getting user submitted text from a <textarea></textarea>
, and I want to edit it a bit before storing it in a database. Here are my 3 regexes, with explanations below them:
/(?:^((pZ)+|((?!R)pC)+)(?1)*)|((?1)$)|(?:((?2)+|(?3)+)(?=(?2)|(?3)))/um
- For every line…
- Match all leading whitespace and unicode control characters (“WS/CC”) (EXCEPT LINE BREAKS).
- Match all trailing WS/CC (EXCEPT LINE BREAKS).
- Match all but one WS/CC between non-whitespace characters.
- regex101
- For every line…
/(pZ+)|((?!R)pC)/u
- Match remaining WS/CC (EXCEPT LINE BREAKS).
- regex101
/(^R+)|(R+$)|(R(?=R{2}))/u
- Match all leading line breaks.
- Match all trailing line breaks.
- Match all but 2 consecutive line breaks.
- regex101
This is pretty much how I’m planning on using these regexes in my PHP script
// ...
// [Get user input from $_POST
// check its length, etc.,
// if everything looks O.K., store it in $str]
// ...
// Trim unnecessary whitespace (leave line breaks)
$str = preg_replace("/(?:^((pZ)+|((?!R)pC)+)(?1)*)|((?1)$)|(?:((?2)+|(?3)+)(?=(?2)|(?3)))/um", '', $str);
// Convert remaining whitespace to regular spaces (leave line breaks)
$str = preg_replace("/(pZ+)|((?!R)pC)/u", ' ', $str);
// Trim line breaks
$str = preg_replace("/(^R+)|(R+$)|(R(?=R{2}))/u", '', $str);
// Sanitize for safe printing between html tags
$str = htmlspecialchars($str, ENT_HTML5, 'UTF-8');
// ...
// [Store $str in DB with prepared statements]
// ...
And here’s a test case which I’ve executed on regex101:
Input
(with some WS/CC in it):
a b c
d e
f g h i j
k l m n
o p
Output after 1st regex
(still has some WS/CC, but none consecutive, except for line breaks):
a b c
d e
f g h i j
k l m n
o p
Output after 2nd regex
(all previous WS/CC - except for line breaks - are now regular spaces):
a b c
d e
f g h i j
k l m n
o p
Output after 3rd regex
(max 2 consecutive line breaks, and no whitespace other than non-consecutive regular spaces):
a b c
d e
f g h i j
k l m n
o p
Based on my testing, this process seems to work as intended. I would however be very interested if you can think of any cases in which it would fail to behave as expected.
Also – in addition to possible fail-cases – I’d like to know if there are ways I could make these regexes perform better (faster), while still keeping the functionality as is. This is pretty much the first time I’ve ever done anything in regex, so I’m assuming there’s some room for optimizations.
And if you see something else that is done incorrectly/could be done better, do let me know.
EDIT: Added regex101 links
EDIT2: Also, I’ve heard there’s something called ReDoS, or Regex DoS. If you can tell that these regexes are susceptible to those, and which ones specifically, I’d like to know how to avoid them.
Solution
So I’ve toiled away at this for a while and although I am feeling a little out of my depth I’ll submit my approach which employs what I do understand about preg_replace and regex.
preg_replace()
allows mixed typepattern
andreplacement
parameters, so one call can execute multiple patterns/replacements.- Speeding up regex can be achieved by minimizing capture groups and alternatives (pipes) as well as using character classes and negated characters classes where appropriate.
I’ve prioritized accuracy and endeavored to provide the most robust patterns considering all possible character sequences that may be submitted from the textarea. Then I’ve attempted to write each pattern with greater efficiency (without impacting accuracy). Finally, I’ve tried to reduce the overall length and convolution of the regex patterns.
I want to be clear that I don’t have any experience with recursive patterns. I don’t know how/when to use them effectively and so none of my patterns employ them.
This is the sample input that I used which has whitespaces, newlines (single and greater than double), and control characters:
$text=" nn1nrn1327a abrnttrrncà1ê߀ abcbc dnntrn5 e2n3n4n";
This input string was designed to be as difficult as possible to best test my method.
$patt_repl=[
'/^[spC]+|[spC]+$/u'=>'', // aka mb_trim() -- remove all leading and trailing whitespace and control characters. DO NOT ADD m FLAG TO PATTERN, THAT WILL DAMAGE THE STRING
'/[spC]*?(R)[spC]*?(R)[spC]*/u'=>"nn", // cleans 2 or more new lines (uninterupted by non-white space characters)
'/(?!R)[hpC]*R[hpC]*(?!R)/u'=>"n", // cleans single new lines
'/n+(*SKIP)(*FAIL)|[hpC]+/u'=>' ' // convert 1 or more non-newline white-spaces and control characters to single space
];
$text=htmlspecialchars(preg_replace(array_keys($patt_repl),$patt_repl,$text), ENT_HTML5, 'UTF-8');
echo "This is the post-replace/result string dump:n";
var_dump($text);
Yes, my method uses 4 separate patterns rather than 3 like the OP. This is for improved accuracy. I believe the OP’s method had some gaps in accuracy, so that is the justification for the additional pattern.
The output from the above method is:
This is the post-replace/result string dump:
string(28) "a ab
cà ê߀ abcbc d
e"
…which is what I believe the output should be.
- It trims all leading and trailing whitespace and control characters from the input string.
- It reduces two or more newline characters (which may have leading, in-between, trailing whitespace or control characters) to
nn
. - It trims all leading and trailing horizontal whitespace or control characters from all single newline characters.
- It converts 1 or more consecutive horizontal whitespace or control characters to
(a single space). This, during my development, required me to use the
(*SKIP)(*FAIL)
technique to disqualify newline characters becausseh
wasn’t doing what I expected. I have read on SO and elsewhere that there can be some quirky/untrustworthy behaviors when dealing withh
andv
while processing certain characters on certain systems — but I do not possess the necessary wisdom to explain this.
For my own confidence, I wanted to have a way to confirm the existence/absense of whitespace and control characters in the string before and after the method. I found a very helpful function in the top comment at the php manual’s ord() page.
Here is a PHP demo that iterates the string and identifies each and every character to prove effectiveness.
Here are a few relevant regex links that I visited during my research:
https://www.regular-expressions.info/unicode.html
https://stackoverflow.com/questions/3230623/filter-all-types-of-whitespace-in-php
https://stackoverflow.com/questions/5471644/what-are-the-whitespaces-matched-by-s-in-php
On the matter of ReDoS and Catastrophic Backtracking, the threat is minimal (if not non-existent) because of my use of character classes and the structure of my alternatives. In case you haven’t already visited it, here is a good page to read if you want to investigate further.
Finally, I humbly ask that if anyone has a string that breaks my method, please provide it to me so that I can update my answer. And if any of my assertions are incorrect, please correct me.
First step, start with a simple solution: divide all in simple tasks, keep the patterns short and easily understandable.
$pats = [
'~A[pZpC]+|[pZpC]+z~u', // trim the string
'~R~u', // normalize newlines
'~pZ+|[^nPC]+~u', // replace Z and C with space
'~^ +| +$| K +~m', // trim lines, delete consecutive spaces
'~nnKn+~' // removes more than 2 consecutive newlines
];
$reps = [ '', "n", ' ', '', '' ];
$result = preg_replace($pats, $reps, $text);
Whatever the result, it’s better to begin with that before trying anything else, particularly to avoid spending time with a set of complicated patterns that, in fine, might be slower. This is essential for future comparisons and provides quickly a working solution.
A mix with classic functions is also possible:
$parts = preg_split('~^[pCpZ]+|[pCpZ]+$|R(?:[pCpZ]*?(R)[pCpZ]*)?~u', $text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$result = implode("n", array_map(function($i) { return trim(preg_replace('~[pCpZ]+~u', ' ', $i));}, $parts));
About pattern performances:
Remove all that is useless, don’t capture when you don’t need to capture, try to limit the number of groups, branches in alternations.
Be careful with subpatterns references (?1)
: even if the pattern looks shorter, it has a cost (the creation of a capture group, the call to the subpattern). Also, rewriting the subpattern is better than adding a quantifier to it: (?1)+
(one call per repetition).
One of the main cost in a pattern are the alternations in particular when they are at start the pattern. The reason is easy to understand: in the worst case, each branch is tested for a position in the string that doesn’t match.
Obviously the best cure is to avoid them, but unfortunately it isn’t always possible. You can also try to reduce the number of branches. However, several tricks exist to reduce this cost:
-
sorting the branches by probability of success.
-
the first character discrimination:
instead ofalpha|[0-9]beta|@gamma
,
you can write(?=[a0-9@])(?:alpha|[0-9]beta|@gamma)
or[a0-9@](?:(?<=a)lpha|(?<=[0-9])beta|(?<=@)gamma)
.
This way, positions in the string without ana
, a digit or a commercial at sign fail quickly without to test the three alternatives. Obviously, you can try to extend this to more than one character if the entropy of the string is low or if you have many alternatives, but keep in mind that characters in factor at the beginning are tested twice. -
Studying the pattern: gives good results when each alternative starts with a literal string.
With the S modifier, a fast algorithm searches the string for positions of the literal parts for each branch. Once done, the regex engine only tests the pattern at the selected positions. -
Building a pattern that always succeeds! With this kind of pattern, you are sure that one of the branch will succeed.
Note also that the escape sequence R
isn’t a character nor a character class but an alias for (?>rn|n|x0b|f|r|x85)
or (?>rn|n|x0b|f|r|x85|xe2x80[xa8xa9])
(depending on the mode). In other word R
is an hidden alternation.
I build a set of patterns/replacements that fits your requirements to illustrate these techniques:
$pats = [
// normalize newlines: Pattern studying
'~rn?|x0b|f|xe2x80[xa8xa9]|x85~S',
// replace C and Z chars with a space and keep 1 or 2 newlines
// if any: always successful pattern
'~
[^pZpC]+ K # part to keep: all that isn t C or Z chars
pZ* (?:[^PCn]+pZ*)* # C and Z chars except newlines
(?: # keep newlines if any or trim the end of the string
(n) # one
pZ*+ (?:[^PCn]+pZ*)*+
(?: (n) [pZpC]* )?+ # or two newlines (captured)
(?!z) # fails if at the end of the string
|
[pZpC]+ # end of the string
)? # eventually
|
[pZpC]+ # start of the string
~ux',
// trim lines: first character discrimination
'~ (?:$|(?<=^ ))~m' ];
$reps = [ "n", '$1$2 ', '' ];
$result = preg_replace($pats, $reps, $text);
Here you can find a test script for the different pattern sets, (you can also pass a callable to the constructor instead of an associative array of pattern/replacements) and useful functions:
About ReDoS attacks:
Except if you finally choose a pathological pattern and doesn’t limit the size of posted data, I don’t think it’s possible to crash your script with your patterns set. It isn’t so expensive to run, no more than any other script that filters and validates form data.
Other thing, when a pattern is badly written it reaches quickly the backtracking limit, the script doesn’t crash and a warning is emitted. You can even retrieve the error using preg_last_error()
.
If you want to increase the security against DoS, start at the server level, with the apache settings.
$str = preg_replace("/^[pZ]+|[pZ]+$|([pZ](?=[^pLpNpSpP]))/um", '',$str);
$str = preg_replace("/[^pLpNpSpPpZnrtf]/u", '', $str);
$str = preg_replace("/(^R+)|(R(?=R{2}|$))/u", '', $str);
So here are some improvements:
//Trim unnecessary white space (leave line breaks)
$str = preg_replace("/^[pZ]+|[pZ]+$|([pZ](?=[^pLpNpSpP]))/um", '', $str);
/^[pZ]+|[pZ]+$|([pZ](?=[^pLpNpSpP]))/gum
(~2ms)
/(?:^((pZ)+|((?!R)pC)+)(?1)*)|((?1)$)|(?:((?2)+|(?3)+)(?=(?2)|(?3)))/um
(~6ms)
// Convert remaining whitespace to regular spaces (leave line breaks)
$str = preg_replace("/[^pLpNpSpPpZnrtf]/u", '', $str);
/[^pLpNpSpPpZnrtf]/gum
(~0ms)
/(pZ+)|((?!R)pC)/u
(~2ms) Also doesn’t match all whitespace characters
// Trim line breaks
I couldn’t really come up with a solution to this one so I kinda simplified yours:
$str = preg_replace("/(^R+)|(R(?=R{2}|$))/u", '', $str);
/(^R+)|(R(?=R{2}|$))/u
(like 0.03ms faster)
Note: you can combine both #1 and #2 together like so:
$str = preg_replace("/^[pZ]+|[pZ]+$|[^pLpNpSpPpZnrtf]|([pZ](?=[^pLpNpSpP]))/u", '', $str);
/^[pZ]+|[pZ]+$|[^pLpNpSpPpZnrtf]|([pZ](?=[^pLpNpSpP]))/gum