Function to remove specified html tags from string

Posted on


I recently wrote this function that, when supplied a string and an array of element types, will remove those element types from the string.

As you could imagine this is a very very simple function, but I wanted to get opinions on speed, functionality, and any other comments anyone can come up with that can improve this function.


    $str = <<<STRING
        <b>Chocolate</b> is very <div class="vanilla">Chocolate</div>!
        < b>England</b> is very <div class="skyscraper">City</div>!
        <b>England</b> is my <div>city</div>
        <b>England</b> is my <div>city</div>
        <b>England</b> is my <span>city</span>
        < b>Name</b>
        <div class="myclass"></div>

    echo htmlentities(removeTags($str, array("b", "div", "span")));

    function removeTags($str, $tags) {

        $tagsString = "";
        foreach($tags as $key => $v) { 
            $tagsString .= $key == count($tags)-1 ? $v : "{$v}|"; 

        $patterns = array("/(<s*b({$tagsString})b[^>]*>)/i", "/(</s*b({$tagsString})bs*>)/i");
        $output = preg_replace($patterns, "", $str);

        return $output;




Alright! You made me do it. I’m posting an answer 🙂

Parsing arbitrary HTML via regular expressions is nearly impossible. If you were using this on HTML code that you control yourself, it is certainly possible, because you can make sure that all of the edge cases never happen. However, if you are going to set this loose on arbitrary HTML from the internet, you will find that it fails to work in a number of cases. Your use case is a little simpler because you are just trying to remove the tags themselves (and not their contents), but even still it can be surprisingly tricky. You probably need a straight-up HTML parser. There are a bajillion of them for PHP. Other than the easy ones I mentioned in my initial comment, here are some more variations that I think will completely sink your current code. I have actually seen this first one in use on the intertubes: it is technically valid HTML:

<a href="#" class="tooltip" data-contents="<br>Hi<br>">Hover me</a>

Another good one: your new regular expression allows for more flexibility in the tag contents at the expense of tag specificity. In other words, I’m pretty sure that if you use your script to try to remove <b> tags, you will also end up removing <br> tags.

These, and the examples in my initial comment, are some exceptions I came up with just off the top of my head. In contrast, the internet cumulatively has a bajillion hours experience writing hard-to-parse HTML strings. Sure, your code might even work 90% of the time, but if you stick with regular expressions it will definitely fail on a regular basis. It’s up to you to decide if the failure rate is low enough.

Especially important: If you are trying to use this as a security tool then definitely don’t. It is nowhere near reliable enough. Sorry. It’s not that your algorithm isn’t well thought out. It’s just that HTML is a full dialect that needs an actual parser: regular expressions just aren’t meant for this kind of thing. It’s like trying to build an actual house out of toothpicks.

See this similar question/answer:

Regex to remove inline javascript from string

First things first, everyone is going to tell you that your approach is a bad idea. You haven’t told us if you’re passing user-supplied markup through this for sanitization or if it’s doing something more controlled.

If this is your XSS protection or attempt at preventing bizarre formatting, you’re going to be far better off getting a real DOM parser to remove these elements for you (or just give your users some markup language like markdown or bbcode).

For the code itself, it’s pretty simple. You could compact the pattern to something like

$output = preg_replace("/(<s*/?s*b($tagsString)b[^>]*/?s*>)/i", "", $str);

That allows you to do the replacement of both open and closing tags in one pass. Plus it gets self-closing tags.

The Risk

Consider the following input:

<<div>img src="junk" onError="maliciousFunction();">

This is based on a real XSS attempt I’ve seen in the wild that got through someone’s naive santization measures. What this does is load an imaginary image that will cause an error. Then the onError event will execute arbitrary javascript (redirecting to a different site, inserting popups, etc).

Your regex only matches the <div> and then removes it, creating a valid <img> tag with the malicious script inside. You’re going to have to create ever-more complex patterns to catch every little possibility.

Another Alternative

If you really, really don’t want to go to a non-HTML markup language or to use a parser, you’re better off using a whitelist than a blacklist.

HTML-encode the entire input and then selectively decode the tags and attributes you want to allow. At that point, you can more easily strip out everything except the tag name or everything except the tag name and style.

Dont re-invent the wheel

This function already exists natively in PHP. strip_tags() is almost the same your removeTags(), but the second parameter is inversed.

So my idea was to get a list of tags that already exist in the string and then see which ones weren’t in the lists of tags to remove, then let strip_tags() do the dirty work. Logically this should be two functions.

Getting tag names

I didn’t want to use regex because it’s notoriously bad at parsing HTML. I used simple string functions to get the tag names. My algorithm is as follows:

  1. explode() the HTML string on <, doing so will leave the tag name as the first word in each string in the array.
  2. Loop the array and explode each string on " " (space). Even if there is no space, this will result in an array of which the tag name is the first index, possibly followed by a > if there are no attributes.
  3. Get the tag name, trim it up and if it’s unique, add it to an array of tag names.

Here’s one possible implementation:

 * Get a list of tag names in the provided HTML string
 * @return Array
function getAllTagNames($html){
    $tags = array();
    $part = explode("<", $html);
    foreach($part as $tag){
        $chunk = explode(" ", $tag);
        if(empty($chunk[0]) || $chunk[0][0] == "/") continue;
        $tag = trim($chunk[0], " >");
        if(!in_array($tag, $tags)) $tags[] = $tag;
    return $tags;

Stripping tags (DEMO)

All that’s left to do is diff the array of tags to remove with the array of tags that exist to get an array of allowable tags. Then using a clever implode() we can generate the second parameter for strip_tags().

 * Strip only certain tags in the given HTML string
 * @return String
function removeTags($html, $tags){
    $existing_tags = getAllTagNames($html);
    $allowable_tags = '<'.implode('><', array_diff($existing_tags, $tags)).'>';
    return strip_tags($html, $allowable_tags);


strip_tags() is somewhat picky about syntax. If your markup is not valid this will not work well as is. This means your example code will not work unless you correct the syntax first. While it would be trivial to write a function to make the required corrections in your example code, that is beyond the scope of the question.

I know this is not an answer regarding code quality, but instead of re-inventing the wheel, check this library for cleaning user inputted HTML:

Leave a Reply

Your email address will not be published. Required fields are marked *