Python – basis of a parser library, followup

Posted on


This is a followup to this question: link. As advised there, I’ve rewritten the State class, which also simplified some of the other code.

I’m writing a parser library. I would like to know if the basis of it is sound and whether it can be improved in any way. The entire code can be seen in this repo: link, in the frozen branch review-11-02-2018 (file I’ll post relevant portions below.

The library is written around two things: a State class, and a notion of an effect. State objects represent current state of a parser chain, and keep track of the input that’s left to parse and a portion of input parsed by the last parser. A parser is just a callable that takes a State object and returns a new one (State objects are immutable). Output of one parser is passed to the next parser in a chain. A parser may also fail by throwing a ParsingFailure exception. Any parser in the chain may register an effect – a callable that takes an arbitrary value as the first argument and a State object as the second. If the chain succeeds, all effects registered during the parsing run are applied sequentially to a seed (with the seed or return value of the previous effect being the first argument, and the state of the chain at the moment of effect registration being the second), and the return value of the last effect becomes the output of entire chain, along with the final state. Is the concept sane? It works, but is it a reasonable way to do this?

State class is just a named tuple with some extra methods and is defined as follows:

class State(namedtuple("State", "string effect left_start left_end parsed_start parsed_end")):
    State objects represent current state of a parser chain (or an individual

    State objects provide two views over the input string: 'left', which spans
    a substring between 'left_start' and 'left_end' and represents unparsed
    input left from the previous parser, and 'parsed', which spans a substring
    between 'parsed_start' and 'parsed_end' and represents a portion of input
    the previous parser has parsed. Windows may overlap, and usually
    'parsed_end' and 'left_start' are the same, but not always.

    A State object is immutable and has following fields:
    * string (str): the input the parser chain is supposed to parse.
    * effect ((value, state) -> value): if the chain is successful, this will
      be called in sequence with other effects from the chain to form the
      chain's output value.
    * left_start, left_end (int): see above about the 'left' window.
    * parsed_start, parser_end (int): see above about the 'parsed' window.

    State objects are just named tuples, so they support a very convenient
    '_replace' method. !Note!: to avoid duplicating effects accidentally,
    '_replace' treats lack of 'effect' in its arguments as 'effect=None'. So if
    you want to copy an effect from another parser, you have to do it

    State objects' constructor takes the following arguments:
    1. string - the input.
    2. effect=None - the effect, transformation to be performed on success of
       the last parser.
    3. start=0 - will be translated into 'left_start'
    4. end=None - will be translated into 'left_end'. If set to None,
      'left_end' will be set to the length of the input.
    State objects created via this constructor have both 'parsed_start' and
    'parsed_end' set to 'left_start'.

    State objects have several properties:
    * left - returns a slice of input that's left to parse.
    * left_len - returns the length of the above slice without computing the
      slice itself.
    * parsed - returns a slice of input that's been parsed.
    * parsed_len - returns the length of the above slice, again without
      computing the slice.

    Finally, State objects have following public methods:
    * consume(how_many) - move 'how_many' characters from the left window into
      the parsed window. Raise ValueError if more input was consumed than left.
    * split(at) - split the State in two (and return them). The first keeps
      the input up to, but not including, 'at' as its 'left' window, the second
      gets the rest. Both have their 'parsed' windows reset to an empty string.
      The first gets 'effect' of the original, the second gets None.

    __slots__ = []

    def __new__(cls, string, effect=None, start=0, end=None):
        if end is None:
            end = len(string)
        assert 0 <= start <= end <= len(string)
        return super().__new__(cls, string, effect, start, end, start, start)

    def _replace(self, **kwargs):
        if "effect" not in kwargs:
            return super()._replace(effect=None, **kwargs)
        return super()._replace(**kwargs)

    def consume(self, how_many):
        Return a new State object with 'how_many' characters consumed and moved
        to the 'parsed' window.

        Raise ValueError if 'how_many' is negative or if consuming more
        characters than left in the 'left' window.
        if how_many < 0:
            raise ValueError("Negative number of consumed characters")
        left_start = self.left_start + how_many
        parsed_start = self.left_start
        parsed_end = parsed_start + how_many
        if left_start > self.left_end:
            raise ValueError("Consumed more characters than fits in the 'left' window")
        return self._replace(left_start=left_start, parsed_start=parsed_start,

    def split(self, at):
        Split the State in two. The first one keeps a portion of input up to
        'at'th character (exclusive), the second one gets the rest. Both have
        'parsed' window reset to an empty string. First one gets the effect of
        the original, the second one gets None.
        split_point = self.left_start + at
        first = self._replace(effect=self.effect,
        second = self._replace(effect=None,
        return first, second

    def left(self):
        Return the portion of input the last parser hasn't consumed.
        return self.string[self.left_start:self.left_end]

    def left_len(self):
        Return the length of the portion of input the last parser hasn't
        return self.left_end - self.left_start

    def parsed(self):
        Return the string parsed by the last parser.
        return self.string[self.parsed_start:self.parsed_end]

    def parsed_len(self):
        Return the length of the string parsed by the last parser.
        return self.parsed_end - self.parsed_start

Hopefully docstrings are descriptive enough. If any further clarification is needed, please tell me, I’ll edit it in.

Another important thing is parse function, which the user is supposed to call with a parser instead of calling parsers directly. Here it is:

def parse(seed, state_or_string, parser, verbose=False):
    Run a given parser on a given state object or a string, then apply combined
    chain or parser's effects to 'seed' and return a tuple
    (seed after effects, final state).

    On failure, return None unless 'verbose' is truthy, in which case return
    the ParsingFailure exception that has terminated the parsing process.
    if isinstance(state_or_string, str):
        state = State(state_or_string)
        state = state_or_string
        after = parser(state)
        if after.effect is not None:
            return after.effect(seed, after), after
        return seed, after
    except ParsingFailure as failure:
        if verbose:
            return failure
        return None
    except ParsingEnd as end:
        if end.state.effect is not None:
            return end.state.effect(seed, end.state), end.state
        return seed, end.state

Another important thing is chain parser generator, which performs the chaining logic described above, but I don’t want to post it here, because a) the question is bloated already, b) it also deals with lookahead, which seems to me a matter worth asking a separate question.

If you’ve read this far, thank you! Any suggestions on improving the library?


Following up on the previous comments they made about the parser, I will focus on code legibility

You may consider moving some paragraphs of the class docstring to the relevant part of the code, like the paragraph about A State object is immutable and has following fields may suit better on the __new__ method, documenting as well the parameter

Also, don’t forget to document the parameters of methods/functions, will help a lot knowing what your parser is doing when invoking them

Parameters like at might be better named as index which is universally used for it.

Parameters like how_many may be better named as characters_count or consumed_characters_count or similar, simply by reading it gives better insight on what the parameter refers to

if verbose:
    return failure
return None

This can be translated to ternary operator

return failure if verbose else None

For the following small methods, I would consider better naming as well, like this one

def left(self):
    Return the portion of input the last parser hasn't consumed.

Why not calling it portion_not_consumed or something on that way?

Just imagine you don’t know anything about a library, and you find a method called left_len. Unless you read the docstring, is hard to tell what is doing

You can as well consider putting some spacing between the lines of a same function, to improve a bit readability between the sections of it

Keep the good work!

Leave a Reply

Your email address will not be published. Required fields are marked *