Haskell sentence segregation

Posted on


I am trying to implement sentence segregation using Haskell, I have achieved a decent bulk of it using the NLP.FullStop library, but this doesn’t seem to account for sentences with full stops at the end of quotes like this." or like this.', or at the end of bracketed sentences like this.) I also want to deal with the character much in the same way as ", as a lot of the content I am dealing with uses this character. I’ve been unable to get a successful regex match on this character, so have resorted to replacing it with " before the regex…

import qualified Data.ByteString.Char8 as BC
import           Data.List.Split
import qualified NLP.FullStop as FS

splitter :: String -> [String]
splitter = concatMap FS.segment . splitPunc
  where splitPunc = map unwords . split puncSplitter . words
        puncSplitter = keepDelimsR $ whenElt (word -> BC.pack (splitPrep word) =~ puncExpr :: Bool)
        splitPrep = replace_ '”' '"'
        puncExpr = "\.[)'"][^w]?$" :: String

replace_ :: Eq b => b -> b -> [b] -> [b]
replace_ a b = map (x -> if (a == x) then b else x)


While your code works and uses type signatures, it’s missing documentation. It’s not clear from your description or your code what splitter‘s intended result will be on a given input. Documentation and tests are therefore highly welcome.

Also, it’s not clear why you’ve added an underscore to replace_. And your code is missing at least one include for =~. I assume that you just forgot to include that import line in your question and it is in your actual code.

That being said, the fullstop library is—according to its own documentation—a placeholder library:

Note that this package is mostly a placeholder. I hope the Haskell/NLP
communities will run with it and upload a more sophisticated (family
of) segmenter(s) in its place. Patches (and new maintainers) would be
greeted with delight!

Your quarrel about the line endings also comes from segment, since it hard-codes the allowed punctuations:

-- https://hackage.haskell.org/package/fullstop-0.1.4/docs/src/NLP-FullStop.html#stopPunctuation
stopPunctuation :: [Char]
stopPunctuation = [ '.', '?', '!' ] -- <<<<

Unfortunately, you cannot expand stopPunctuation, since content in parentheses (like this) does not lead to a new sentence. Note that .) and ." aren’t valid in some languages, though, they require ). and "., so it’s not clear what you try to achieve there (see comment above documentation above).

So all in all, well written, but without additional explanation or documentation there is no way to check whether the function actually does what you want. I also suggest you to add some tests.

Leave a Reply

Your email address will not be published. Required fields are marked *