Problem
I wrote a tokenizer/lexer (difference?) for Python in Haskell: this is my code on GitHub.
I already tested it out with some of CPython’s standard library scripts, and it doesn’t fail. However, it probably outputs non-conforming output. But I don’t care about this right now.
What I’m looking for is a general (functional) style review. This is my first time using Parsec and Haskell for something relatively big, so I expect my functional style to be suboptimal.
Significant code snippets:
parseTokens :: LexemeParser [PositionedToken]
parseTokens = do
P.skipMany ignorable
tokens <- concat <$> P.many parseLogicalLine
P.eof
endIndents <- length . filter (>0) <$> getIndentLevels
dedents <- mapM position $ replicate endIndents Dedent
endMarker <- position EndMarker
return (tokens ++ dedents ++ [endMarker])
parseIndentation :: LexemeParser [PositionedToken]
parseIndentation = mapM position =<< go
where
go = do
curr <- length <$> P.many (P.char ' ')
prev <- getIndentLevel
case curr `compare` prev of
EQ -> return []
GT -> return [Indent] <* pushIndentLevel curr
LT -> do
levels <- getIndentLevels
when (not $ elem curr levels) (fail $ "indentation must be: " ++ show levels)
let (pop, levels') = span (> curr) levels
putIndentLevels levels'
return $ replicate (length pop) Dedent
parseLogicalLine :: LexemeParser [PositionedToken]
parseLogicalLine = do
P.skipMany ignorable
indentation <- parseIndentation
(indentation ++) <$> begin
where
begin = do
tokens <- P.many1 parsePositionedToken
let continue = (tokens ++) <$> begin
implicitJoin <- (> 0) <$> getOpenBraces
if implicitJoin then do
whitespace
parseComment <|> (P.optional backslash *> eol)
P.skipMany ignorable
whitespace
continue
else do
whitespace
explicitJoin <- P.optionMaybe (backslash *> eol)
case explicitJoin of
Nothing -> ((t -> tokens ++ [t]) <$> position NewLine) <* P.optional eol
otherwise -> whitespace *> continue
(I was largely inspired by PureScript’s lexer, though I probably ended up with something a lot worse.)
Solution
for comparison
Here are some other ways I’ve found to parse indentation sensitive language:
-
Use
Text.Parsec.Indent
Here is a blog post containing an example: (link) -
Have a look at the language-python package. It’s lexer is built using
alex
: (link) -
Purescript handles indentation via the
mark
,checkIndentation
,indented
andsame
functions: (link) Grep the rest of the source code for these functions to see how they are used. Its lexer defines aIndent
token (link) but from what I can tell it is never returned, i.e. see (link). The indentation level is kept track of by the parser (link) via themark
function.
returning a list of tokens?
I’m not sure there is a lot of value in parsers which return [PositionedToken]
except, perhaps, for testing purposes.
Most of the time you want to run a token parser incrementally, but parseTokens
has to consume the entire file before it can return anything.
There should be a function which just returns the next token. This will come in handy in conjunction with the next comment I have…
handling dedents at eof
I would find another way to handle implicit dedents at the end of the file. I’m sure this can be handled at the grammar level. The code to add them is extremely messy – don’t you agree?
Have a look at how the language-python
package does it:
The function lexToken
returns just a single token.
The lexer has state, so when it reaches EOF it checks the current indentation level. If the indentation level is > 1, it decrements it and returns a dedentToken
. Otherwise it returns an EOF token.