I wrote a tokenizer/lexer (difference?) for Python in Haskell: this is my code on GitHub.
I already tested it out with some of CPython’s standard library scripts, and it doesn’t fail. However, it probably outputs non-conforming output. But I don’t care about this right now.
What I’m looking for is a general (functional) style review. This is my first time using Parsec and Haskell for something relatively big, so I expect my functional style to be suboptimal.
Significant code snippets:
parseTokens :: LexemeParser [PositionedToken] parseTokens = do P.skipMany ignorable tokens <- concat <$> P.many parseLogicalLine P.eof endIndents <- length . filter (>0) <$> getIndentLevels dedents <- mapM position $ replicate endIndents Dedent endMarker <- position EndMarker return (tokens ++ dedents ++ [endMarker]) parseIndentation :: LexemeParser [PositionedToken] parseIndentation = mapM position =<< go where go = do curr <- length <$> P.many (P.char ' ') prev <- getIndentLevel case curr `compare` prev of EQ -> return  GT -> return [Indent] <* pushIndentLevel curr LT -> do levels <- getIndentLevels when (not $ elem curr levels) (fail $ "indentation must be: " ++ show levels) let (pop, levels') = span (> curr) levels putIndentLevels levels' return $ replicate (length pop) Dedent parseLogicalLine :: LexemeParser [PositionedToken] parseLogicalLine = do P.skipMany ignorable indentation <- parseIndentation (indentation ++) <$> begin where begin = do tokens <- P.many1 parsePositionedToken let continue = (tokens ++) <$> begin implicitJoin <- (> 0) <$> getOpenBraces if implicitJoin then do whitespace parseComment <|> (P.optional backslash *> eol) P.skipMany ignorable whitespace continue else do whitespace explicitJoin <- P.optionMaybe (backslash *> eol) case explicitJoin of Nothing -> ((t -> tokens ++ [t]) <$> position NewLine) <* P.optional eol otherwise -> whitespace *> continue
(I was largely inspired by PureScript’s lexer, though I probably ended up with something a lot worse.)
Here are some other ways I’ve found to parse indentation sensitive language:
Text.Parsec.IndentHere is a blog post containing an example: (link)
Have a look at the language-python package. It’s lexer is built using
Purescript handles indentation via the
samefunctions: (link) Grep the rest of the source code for these functions to see how they are used. Its lexer defines a
Indenttoken (link) but from what I can tell it is never returned, i.e. see (link). The indentation level is kept track of by the parser (link) via the
returning a list of tokens?
I’m not sure there is a lot of value in parsers which return
[PositionedToken] except, perhaps, for testing purposes.
Most of the time you want to run a token parser incrementally, but
parseTokens has to consume the entire file before it can return anything.
There should be a function which just returns the next token. This will come in handy in conjunction with the next comment I have…
handling dedents at eof
I would find another way to handle implicit dedents at the end of the file. I’m sure this can be handled at the grammar level. The code to add them is extremely messy – don’t you agree?
Have a look at how the
language-python package does it:
lexToken returns just a single token.
The lexer has state, so when it reaches EOF it checks the current indentation level. If the indentation level is > 1, it decrements it and returns a
dedentToken. Otherwise it returns an EOF token.