Validator and Sanitizer for HTML 5 attribute regex according to current HTML living standard

Posted on


According to an HTML 5 attribute name is defined like this:

Attribute names must consist of one or more characters other than
controls, U+0020 SPACE, U+0022 (“), U+0027 (‘), U+003E (>), U+002F
(/), U+003D (=), and noncharacters. In the HTML syntax, attribute
names, even those for foreign elements, may be written with any mix of
ASCII lower and ASCII upper alphas.

Creating a class which can handle or sanitize HTML 5 attribute names I have ended up with the following code – especially the following regex:

class AttributeNameValidator
    public const ATTRIBUTE_NAME_MATCHER = "/[sx{0000}x{0020}x{0022}x{0027}x{003E}x{002F}x{003D}x{200B}-x{200D}x{FDD0}-x{FDEF}x{FEFF}[:cntrl:]]+/u";

     * Checks if a given string is a valid HTML attribute name.
     * @param string $attributeName
     * @return bool: True if the given attribute name is a valid HTML attribute name.
    public static function isAttributeNameValid(string $attributeName): bool
        return (bool)preg_match(self::ATTRIBUTE_NAME_MATCHER, $attributeName);

     * Sanitizes a string to be a valid HTML5 attribute name.
     * @param string $attributeName
     * @return string
     * @throws NonSanitizeableException
    public static function sanitizeAttributeName(string $attributeName): string
        $sanitizedAttributeName = preg_replace(self::ATTRIBUTE_NAME_MATCHER, '', $attributeName);
        if(!$sanitizedAttributeName) {
            throw new NonSanitizeableException("Failed to sanitize attribute name");
        return $sanitizedAttributeName;

My manual tests seem to work well still I am not sure if the regex exactly matches the standard or if I have forgotten something. Is there still something to improve?


  • The x{0020} (SPACE) character is already included in s, you can omit that part.
  • I prefer to use p{Cc} over [:cntrl:] for consistency and because I don’t like to have nested square braces in a character class.
  • I expect isAttributeNameValid() to be checking that the whole string contains NO blacklisted characters. If you want to match the entire string then you will need “start of string” and “end of string” anchors and a negated character class in your pattern. But wait, you have a regular character class and you are returning true if one or more blacklisted characters are in the string — this seems precisely the opposite of what the method name suggests. Unless I am confused, you should replace (bool) with ! so that you return false when a match is made and true when there are no blacklisted characters found.
  • I don’t know if I agree with the wording of Failed to sanitize attribute name; it seems more truthful to say that the Attribute name had no salvagable characters.

Leave a Reply

Your email address will not be published. Required fields are marked *