I am trying to implement sentence segregation using Haskell, I have achieved a decent bulk of it using the
NLP.FullStop library, but this doesn’t seem to account for sentences with full stops at the end of quotes
like this." or
like this.', or at the end of bracketed sentences
like this.) I also want to deal with the character
” much in the same way as
", as a lot of the content I am dealing with uses this character. I’ve been unable to get a successful regex match on this character, so have resorted to replacing it with
" before the regex…
import qualified Data.ByteString.Char8 as BC import Data.List.Split import qualified NLP.FullStop as FS splitter :: String -> [String] splitter = concatMap FS.segment . splitPunc where splitPunc = map unwords . split puncSplitter . words puncSplitter = keepDelimsR $ whenElt (word -> BC.pack (splitPrep word) =~ puncExpr :: Bool) splitPrep = replace_ '”' '"' puncExpr = "\.[)'"][^w]?$" :: String replace_ :: Eq b => b -> b -> [b] -> [b] replace_ a b = map (x -> if (a == x) then b else x)
While your code works and uses type signatures, it’s missing documentation. It’s not clear from your description or your code what
splitter‘s intended result will be on a given input. Documentation and tests are therefore highly welcome.
Also, it’s not clear why you’ve added an underscore to
replace_. And your code is missing at least one include for
=~. I assume that you just forgot to include that import line in your question and it is in your actual code.
That being said, the
fullstop library is—according to its own documentation—a placeholder library:
Note that this package is mostly a placeholder. I hope the Haskell/NLP
communities will run with it and upload a more sophisticated (family
of) segmenter(s) in its place. Patches (and new maintainers) would be
greeted with delight!
Your quarrel about the line endings also comes from
segment, since it hard-codes the allowed punctuations:
-- https://hackage.haskell.org/package/fullstop-0.1.4/docs/src/NLP-FullStop.html#stopPunctuation stopPunctuation :: [Char] stopPunctuation = [ '.', '?', '!' ] -- <<<<
Unfortunately, you cannot expand
stopPunctuation, since content in parentheses (like this) does not lead to a new sentence. Note that
." aren’t valid in some languages, though, they require
"., so it’s not clear what you try to achieve there (see comment above documentation above).
So all in all, well written, but without additional explanation or documentation there is no way to check whether the function actually does what you want. I also suggest you to add some tests.