Python tokenizer in Haskell + Parsec

Posted on

Problem

I wrote a tokenizer/lexer (difference?) for Python in Haskell: this is my code on GitHub.

I already tested it out with some of CPython’s standard library scripts, and it doesn’t fail. However, it probably outputs non-conforming output. But I don’t care about this right now.

What I’m looking for is a general (functional) style review. This is my first time using Parsec and Haskell for something relatively big, so I expect my functional style to be suboptimal.

Significant code snippets:

parseTokens :: LexemeParser [PositionedToken]
parseTokens = do
  P.skipMany ignorable
  tokens <- concat <$> P.many parseLogicalLine
  P.eof
  endIndents <- length . filter (>0) <$> getIndentLevels
  dedents <- mapM position $ replicate endIndents Dedent
  endMarker <- position EndMarker
  return (tokens ++ dedents ++ [endMarker])


parseIndentation :: LexemeParser [PositionedToken]
parseIndentation = mapM position =<< go
  where
    go = do
      curr <- length <$> P.many (P.char ' ')
      prev <- getIndentLevel
      case curr `compare` prev of
        EQ -> return []
        GT -> return [Indent] <* pushIndentLevel curr
        LT -> do
          levels <- getIndentLevels
          when (not $ elem curr levels) (fail $ "indentation must be: " ++ show levels)
          let (pop, levels') = span (> curr) levels
          putIndentLevels levels'
          return $ replicate (length pop) Dedent


parseLogicalLine :: LexemeParser [PositionedToken]
parseLogicalLine = do
  P.skipMany ignorable
  indentation <- parseIndentation
  (indentation ++) <$> begin
  where
    begin = do
      tokens <- P.many1 parsePositionedToken
      let continue = (tokens ++) <$> begin

      implicitJoin <- (> 0) <$> getOpenBraces
      if implicitJoin then do
        whitespace
        parseComment <|> (P.optional backslash *> eol)
        P.skipMany ignorable
        whitespace
        continue
      else do
        whitespace
        explicitJoin <- P.optionMaybe (backslash *> eol)
        case explicitJoin of
          Nothing   -> ((t -> tokens ++ [t]) <$> position NewLine) <* P.optional eol
          otherwise -> whitespace *> continue

(I was largely inspired by PureScript’s lexer, though I probably ended up with something a lot worse.)

Solution

for comparison

Here are some other ways I’ve found to parse indentation sensitive language:

  1. Use Text.Parsec.Indent Here is a blog post containing an example: (link)

  2. Have a look at the language-python package. It’s lexer is built using alex: (link)

  3. Purescript handles indentation via the mark, checkIndentation, indented and same functions: (link) Grep the rest of the source code for these functions to see how they are used. Its lexer defines a Indent token (link) but from what I can tell it is never returned, i.e. see (link). The indentation level is kept track of by the parser (link) via the mark function.

returning a list of tokens?

I’m not sure there is a lot of value in parsers which return [PositionedToken] except, perhaps, for testing purposes.

Most of the time you want to run a token parser incrementally, but parseTokens has to consume the entire file before it can return anything.

There should be a function which just returns the next token. This will come in handy in conjunction with the next comment I have…

handling dedents at eof

I would find another way to handle implicit dedents at the end of the file. I’m sure this can be handled at the grammar level. The code to add them is extremely messy – don’t you agree?

Have a look at how the language-python package does it:

https://github.com/bjpop/language-python/blob/master/src/Language/Python/Version2/Parser/Lexer.x#L229-L242

The function lexToken returns just a single token.
The lexer has state, so when it reaches EOF it checks the current indentation level. If the indentation level is > 1, it decrements it and returns a dedentToken. Otherwise it returns an EOF token.

Leave a Reply

Your email address will not be published. Required fields are marked *