Parsing annotation

Posted on

Problem

I have implemented code for parsing annotation:

/**
 * @Route(path="sample n test",code,value,boolean,test)
 * @access(code=false)
 * @sample as asdad asd

 * asd
 */
function sample()
{

}

$refelection = new ReflectionFunction("sample");
$pattern = "/@(w+?)(.*?)/U";

preg_match_all($pattern, $refelection->getDocComment(), $matches);

$matches = array_combine($matches[1], $matches[2]);

foreach ($matches as $key => $value)
{
    $params = array();
    $token = token_get_all("<?php " . trim($value) . "?>");

    if (substr($value, 0, 1) !== "(" || substr($value, -2, -1) !== ")")
    {
        continue;
    }
    echo $key;
    $limit = count($token) - 2;
    for ($i = 2; $i < $limit; $i++)
    {
        if (array_key_exists($i + 1, $token) && $token[$i + 1] == "=")
        {
            if (!is_array($token[$i + 2]))
            {
                die("invalid");
            }

            $params[$token[$i][1]] = $token[$i + 2][1];
            $i+=3;
        }
        else
        {
            if (!is_array($token[$i + 2]))
            {
                die("invalid");
            }

            $params[$token[$i][1]] = NULL;
            $i+=1;
        }

        if ($token[$i] !== "," && $token[$i] !== ")")
        {
            die("invalid");
        }
    }
    var_dump($params);
}

Please tell me if there are any cons, limitation or bug in this code or any alternatives.

Last annotation in doc comment is ignore, as you can see.

Solution

Focusing on the regex side only….

The use of the UNGREEDY modifier on your pattern, and then internally reversing that, is odd.

$pattern = "/@(w+?)(.*?)/U";

This can be written simpler, as:

$pattern = "/@(w+)(.*)/";

The above will work because regex will favour matching the full w+ before it starts on the .*. If you want to make it explicit (and I would, for the record), you can force a zero-width word-break anchor (b) in there, and write the pattern as:

$pattern = "/@(w+)b(.*)/";

Adding a second answer to address the parsing and PHP side of things.

You have some inconsistencies in the parsing of the $value. You have two specific places where you expect contradictory values. First, you have:

$token = token_get_all("<?php " . trim($value) . "?>");

where you trim(...) the value, which implies you expect white-space on it.

The very next line you check to make sure the first and last characters are actually parenthesis ( and ). You should trim the value before that check because trailing whitespace is surprisingly common, and would be legal. I cannot find a good reference on Annotation syntax, but it appears that there cannot be a space between the annotation name, and the opening parenthesis…. is this true?

Still, the correct solution for the whitespace problem is to solve it in the regex. This can also solve the parenthesis checking poblem. Consider the regex:

$pattern = "/@(w+)(s*([^(]*?)s*)/";

I have tested this out on regex101, and it looks good, but I have had to add the g modifier to the expression. I am not sure whether this is required on the match statement.

By using the pattern above you will:

  • not need to trim the $value
  • you will only match annotations with parenthesis, so no need to check
  • the value will be ‘trimmed’ as well.

This will mean you will have to adjust the parsing part slightly so that it does not expect the “(” and “)” tokens….

The rest of the parsing does look fine (to my untrained eye).

Leave a Reply

Your email address will not be published. Required fields are marked *