String tokenizer

Posted on



  • Create a function called wordsAndSpaces that splits a string on groups of spaces, preserving each group of space characters as a single token.
    Example: wordsAndSpaces "extra spaces" gives ["extra", " ", "spaces"]


  • Input strings will contains only alphanumeric characters and space characters (no need to handle special characters)
  • Only Prelude functions are allowed (no imports e.g. Data.List.Split)


  • I’m looking specifically for feedback on how I can break thought patterns I have developed from working in imperative languages. My code likely has the telltale signs of a recovering Java programmer 🙂
wordsAndSpaces :: String -> [String]
wordsAndSpaces = wordsAndSpaces' []

wordsAndSpaces' :: [String] -> String -> [String]
wordsAndSpaces' tokens [] = tokens
wordsAndSpaces' tokens (char : xs)
  | shouldAddToLastToken char tokens = wordsAndSpaces' (addToLastToken char tokens) xs
  | otherwise = wordsAndSpaces' (tokens ++ [[char]]) xs

shouldAddToLastToken :: Char -> [String] -> Bool
shouldAddToLastToken c tokens
  | null tokens = False
  | otherwise =
    (isSpace c && (isSpace . last . last) tokens)
      || ((not . isSpace) c && (not . isSpace . last . last) tokens)

addToLastToken :: Char -> [String] -> [String]
addToLastToken char tokens = init tokens ++ [last tokens ++ [char]]

isSpace :: Char -> Bool
isSpace = (== ' ')


The big thing that jumped out to me was how you seem to be treating Haskell lists like Java… Arrays? (I haven’t used Java since… 2006?) If you’re repeatedly fiddling with the elements at the end of lists, you’re probably doing something wrong. I’m not sure what the complexity of this all is, but I think it might be something like O(m2n2)O(m2n2) where mm is the length of the original string and nn is the number of tokens. Haskell lists are linked lists, finding arbitrary elements (of which the last is one) is an O(n)O(n) operation, as is appending any element to the tail of a list (++).

I think largely as you gain familiarity with the Prelude you’ll begin to intuit the shape of efficient Haskell algorithms. Now that’s not to say Haskell doesn’t also have arrays, but we tend to use them differently. (How exactly that is being way out of the scope of this question.)

In this case the concept you’re looking for is a span. That is, the part of a list at the front that matches a predicate, and then the rest. In this case we care about the predicates isSpace and not . isSpace.

-- REPL session, if you're not familiar with REPLs yet coming from Java, check out GHCi
> span isSpace "   word word"
("   ","word word")
> span (not . isSpace) "wordword    word word"
("wordword","    word word")

span is in the Prelude, but it’s worth taking a crack at implementing it yourself to start building an intuition for living in Linked List World. Try it before you move on and I start to really blow your mind.

Here’s a solution using span. Spoiler’d so you can figure it out first.

wordsAndSpaces :: String -> [String]
wordsAndSpaces [] = []
wordsAndSpaces cs@(c:_) = case isSpace c of
    True  -> let (w, rest) = span isSpace cs  in w : wordsAndSpaces rest
    False -> let (w, rest) = break isSpace cs in w : wordsAndSpaces rest
                          -- break p = span (not . p), just a convenient name

In that solution you might notice how both tines of the fork (as it were) are very, very similar. Generally this is a hint that a handy library function exists or there’s some common trick that has a deeply unintuitive category theoretical name and/or requires a flash of abstractly reasoned insight to recognize. Luckily in this case you would just have to know about Data.List.groupBy :: (a -> a -> Bool) -> [a] -> [[a]].

Coming from outside the Prelude groupBy doesn’t fit your criteria for usage in this practice problem, but as with span it’s something worth knowing for the way we think about decomposing problems. If you didn’t immediately understand what groupBy is or does from its type, that’s more than alright for a beginner, but give it a think before reading on.

groupBy breaks a list into sublists based on which stretches of values are like one another. This probably makes most sense when you look at an example.

> groupBy (==) [1,1,2,2,2,3,4,5,5,5,5]

groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
isSpace :: (a      -> Bool)

So the question is, how do we turn the unary predicate isSpace into a binary predicate to pass to groupBy? The first step to recognize is that the Bool return values mean different things. isSpace says whether or not a character is a space, groupBy’s predicate is asking whether two things are the same. So we can test the result of isSpace on two elements for equality.

predicate :: Char -> Char -> Bool
predicate c d = isSpace c == isSpace d

But it turns out we even do that so frequently that there’s a name for it also, Data.Function.on :: (b -> b -> c) -> (a -> b) -> a -> a -> c. Used like—

predicate = (==) `on` isSpace

on takes a binary operation (b -> b -> c) and a unary operation (a -> b) and lifts the binop to work on as, whatever those may be. groupBy is a good example, sorts are another where you might do something like compare `on` color, if color had a type like Shape -> Color. We would like to sort Shapes by Color, so you use a higher order function (on) to build a function to do that! Try to implement on.

So, here’s how a Haskeller would write wordsAndSpaces.

wordsAndSpaces = groupBy ((==) `on` isSpace)

With that (probably, hopefully, shocking) motivating example, now try to write groupBy. Think back to the solution using span, consider its symmetries, and try to go one level up the abstraction stack. Ultimately Haskell coding is about solving the smallest part of a problem that you can, recognizing which higher order function can plumb that into solving a slightly larger problem, then repeating that process over and over again until you reach the end.

In this case (though I’ve kind of meandered), first we had to have figured out how to say whether or not something is a space. Next you figure out identifying a single stretch of spaces or not-spaces. Then the trick is categorizing all of the maybe-or-not-spaces. Each step in the path is something small and tackle-able on its own, and we try to consider the most general possible solution to it so that we don’t have to think about the whole problem at once.

Leave a Reply

Your email address will not be published. Required fields are marked *