php preg_replace kills cpu

Posted on

Problem

I am reading some HTML files and returning them upon request, but when they go through my codes I need to filter out some stuff from these HTMLs, everything is working fine and I’m using about 20 or more str_replace and preg_replace to do that which works pretty fine.

Until I used this

$responseBody = preg_replace('/<script>((?!<)[sS])*googletagmanager.com[sS]*?</script>/', '', $responseBody);

which is supposed to remove this part from html

<script>
(function (w, d, s, l, i) {
    w[l] = w[l] || []; w[l].push({
        'gtm.start':
        new Date().getTime(), event: 'gtm.js'
    }); var f = d.getElementsByTagName(s)[0],
    j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : ''; j.async = true; j.src =
    '//www.googletagmanager.com/gtm.js?id=' + i + dl; f.parentNode.insertBefore(j, f);
})(window, document, 'script', 'dataLayer', 'GTM');</script>

and it does that very well in a regex tester => http://regexr.com/3e8nb
but on the server cpu just goes up and the server halts.
why is this happening? and how can I improve this regex?

Solution

Simple as that:

/<script>[^<]*?googletagmanager.com.*?</script>/s

What’s killing performance for you, is most likely the backtracking followed by a look-ahead on the first padding. Using a character class instead of a look-ahead, and making it non-greedy should solve that issue.

The /s modifier allows you to use . instead of explicit [sS].

Since I assume you are not in control of the javascript code blocks being inserted in your code, you’d better play it safe and just use dots instead of negated capture classes like [^<]. The reason is: Google may write a “less than” comparison between <script> and googletagmanager and the match will fail.

As a side note, when your pattern includes a symbol that is also what your pattern uses as a delimiter, then you can avoid the need to escape your in-pattern / by changing the delimiters to a valid delimiting symbol that does not occur in your pattern, like ~.

Your regex demo link uses sample input that has two occurrences of googletagmanager. Only one of them is trailed by .com. If you want to match both occurrences, use this: Pattern Demo

~<script>.*?googletagmanager.*?</script>~s

If you only want to match the .com, then use this:

~<script>(?:(?!</script>).)*?googletagmanager.com.*?</script>~s

*note, the s flag that trails both patterns declares that dots should also match newline characters.

My non-.com pattern will successfully match both occurrences in 561 steps.
My .com pattern more than triples the step count at 1979 steps, but will be more reliable/accurate if Google shakes things up as I mentioned.

Ext3h’s pattern is also very fast at 572 steps and suitably matches .com only, but if it ever lets you down, you will know it can only be because Google added a < in just the right wrong place.

My assumption is that if you are going to bother to remove one googletagmanager script block, you probably want to remove all of them — in which case, you’ll want to use my first pattern for both accuracy and speed.

Leave a Reply

Your email address will not be published. Required fields are marked *