# Naive Bayes classifier

Posted on

Problem

I just started learning Haskell, and I must say it is unlike anything else. As my first not-entirely-trivial piece of code I tried to write a simple Bayes Classifier. It is in two parts, a classifier module and a cli.

What could be done to improve it? Is it entirely awful? Is it too imperative? Any and all feedback appreciated. Here are the two files:

The Main Classifier:

``````module BayesClassifier where
-- Text Classifier Using Bayes Formula
import Data.List
import Data.Char
type Category = String
newtype Classifier = Classifier { training :: [(Category, [String])] } deriving (Eq, Show)
-- Get a new classifer with no training
classifier :: Classifier
classifier = Classifier []
-- classifier probabilities
probabilityOfWordInCategory :: Classifier -> String -> Category -> Double
probabilityOfCategory :: Classifier -> Category -> Double
-- Adding + 1 for Laplacian Correction
probabilityOfWordInCategory (Classifier training) word category = let allInCategory = filter ((cat, _) -> cat == category) training
allInCategoryContainingWord = filter ((_, text) -> word `elem` text) allInCategory
in (fromIntegral \$ length allInCategoryContainingWord + 1) / (fromIntegral \$ length allInCategory + 1)
probabilityOfCategory (Classifier training) category =  let allInCategory = filter ((cat, _) -> cat == category) training
in (fromIntegral \$ length allInCategory) / (fromIntegral \$ length training)
-- Train a classifier
train :: Classifier -> String -> Category -> Classifier
train (Classifier training ) text category = Classifier \$ (category, cleanInput \$ text):training
-- Categorize text with a classifier
classify :: Classifier -> String -> Category
classify classifier text = fst \$ head \$ sortBy ((_, a) (_, b)  -> b `compare` a) \$ probabilities classifier text
-- Get Probability for each Category
probabilities :: Classifier ->  String -> [(Category, Double)]
probabilities classifier@(Classifier training) text =  map (cat -> (cat, probabilityForCategory classifier text cat)) \$ nub \$ map ((cat, _) -> cat) training
-- Get Probability for a passage in a certain category
probabilityForCategory :: Classifier -> String -> Category -> Double
probabilityForCategory classifier text category = (+) (log \$ probabilityOfCategory classifier category)  (sum \$ map (word -> log \$ probabilityOfWordInCategory classifier word category) \$ cleanInput text)
-- Lowercase, Remove Punctuation
cleanInput :: String -> [String]
cleanInput text = filter (w -> not (w `elem` stopWords)) \$ words \$ filter (`elem` ' ':['a'..'z']) \$ map toLower text
where stopWords = ["a","about"....(More Stop Words)...."yourself","yourselves"]
``````

and the Interface:

``````import System.IO
import BayesClassifier

interactionLoop myClassifier function = case function of
"start" ->
do
putStrLn "Enter an action [train|classify]"
action <- getLine
interactionLoop myClassifier action
"train" ->
do
putStr "Category: "
category <- getLine
putStr "Material: "
material <- getLine
interactionLoop (train myClassifier material category) "start"
"classify" ->
do
putStr "Material: "
material <- getLine
putStrLn \$ classify myClassifier material
putStrLn . show \$ probabilities myClassifier material
putStrLn "nnnn"
interactionLoop myClassifier "start"
_ ->
interactionLoop myClassifier "start"

main = do
hSetBuffering stdout NoBuffering
hSetBuffering stdin NoBuffering
interactionLoop classifier "start"
``````

Solution

Rather nice attempt 🙂

A small advice first. Do not use Category as a type name. It may confuse haskell people who might mistake it for a completely different thing.

``````import Data.List
import Data.Char
import qualified Data.Set as Set
import Test.HUnit
``````

I added the unit test import here, and the set.

It is rather adviced in haskell to look out for opportunities to take out
general definitions when possible from a more complicated expression. The idea is to build the language to describe your problem and then use it to solve the problem by describing it. So lots of tiny general purpose functions are very good.

``````onlyLetters = filter (x -> isLetter x || isSpace x)
``````

I prefer the below version because liftM2 follows naturally and is quite readable.

``````onlyLetters = filter (Ctl.liftM2 (||) isLetter isSpace)
``````

stopWords is clearly a constant. So take its construction out. Also see that a set is used. It makes the member operation cheaper.

``````stopWords = Set.fromAscList \$ sort ["a","about","you","yourself","yourselves"]
``````

Where possible use function composition. From my experience, it makes it easier to understand the essence of the function. Also if there is a choice between composing functions and using parenthesis or \$, go for (.), That would make refactoring easier later. Use currying in preference to explicit lambdas.

``````cleanInput :: String -> [String]
cleanInput = filter (flip Set.notMember stopWords) . words . clean
where clean = onlyLetters . (map toLower)
``````

Run tt to execute the test.

``````tests = TestList [TestLabel "clean" testClean]
tt = runTestTT tests

tInput = "I and you and elephant"
testClean = TestCase \$ assertEqual
"cleanInput"
["i","and","and","elephant"] (cleanInput tInput)
``````

Try to restrict your line width. First, it makes it easier to read your code, and second, it makes you on the lookout for refactoring opportunities.

``````-- Get Probability for a passage in a certain category
probabilityForCategory :: Classifier -> String -> Category -> Double
probabilityForCategory classifier text cat = sum \$ pfst cat : plst cat (cleanInput text)
where plst cat = map (log . probabilityOfWordInCategory classifier cat)
pfst = log . probabilityOfCategory classifier
``````

Note that the way you declared your data structure for Classifier, you get the function ‘training’ defined. So there is no point in extracting it using @ as you did.

``````-- Get Probability for each Category
probabilities :: Classifier ->  String -> [(Category, Double)]
probabilities classifier text =  map (wrap text) \$ nub \$ map fst (training classifier)
where wrap = wrapfn . probabilityForCategory classifier
``````

Thus we take out the wrapfn because it is nice and generic.

``````wrapfn fn cat = (cat, fn cat)
``````

We can do the same with classify

``````-- Categorize text with a classifier
classify :: Classifier -> String -> Category
classify = ((fst . extMax snd) .) . probabilities
``````

And our extract max is now nice and clean.

``````extMax :: Ord b => (x -> b) -> [x] -> x
extMax fn = maximumBy (compare `F.on` fn)
``````

The training function is clean, but remove the extra \$, and make use of your
training accessing function from classify.

``````-- How to train a dragon
train :: Classifier -> String -> Category -> Classifier
train c text category = Classifier \$ (category, cleanInput text) : training c
``````

The changes here should be self evident now.

``````filterElem :: Eq a1 => (a -> [a1]) -> a1 -> [a] -> [a]
filterElem fn word = filter ((elem word) . fn)
filterEq :: Eq b => (a -> b) -> b -> [a] -> [a]
filterEq fn x = filter ((x ==) . fn)

allInCategory :: Eq a => a -> [(a, b)] -> [(a, b)]
allInCategory = filterEq fst

fl = fromIntegral . length

-- Adding + 1 for Laplacian Correction
probabilityOfWordInCategory :: Classifier -> String -> Category -> Double
probabilityOfWordInCategory (Classifier training) word category = succ a / succ b
where a = fl \$ (filterElem snd) word y
b = fl y
y = allInCategory category training

probabilityOfCategory :: Classifier -> Category -> Double
probabilityOfCategory (Classifier training) category =  a / b
where a = fl (allInCategory category training)
b = fl training
``````

Also, make use of precedence of operators and avoid extra parenthesis. They are ugly 🙂

The most obvious problem to me is formatting. Cut down on line width by putting `let` and `case` expressions down:

``````probabilityOfWordInCategory (Classifier training) word category =
let allInCategory = filter ((cat, _) -> cat == category) training
allInCategoryContainingWord = filter ((_, text) -> word `elem` text) allInCategory
in (fromIntegral \$ length allInCategoryContainingWord + 1) / (fromIntegral \$ length allInCategory + 1)
``````

Also, when defining helper functions for an equation, see if using `where` instead of `let` looks nicer:

``````probabilityOfWordInCategory (Classifier training) word category =
(fromIntegral \$ length allInCategoryContainingWord + 1) / (fromIntegral \$ length allInCategory + 1)
where
allInCategory = filter ((cat, _) -> cat == category) training
allInCategoryContainingWord = filter ((_, text) -> word `elem` text) allInCategory
``````

Also, you can use point-free style to make `classify` more concise. Here’s the current form of `classify`:

``````classify a b = f1 \$ f2 \$ f3 \$ f4 a b
``````

We can eta-reduce by using the `(.)` operator:

``````classify a = f1 . f2 . f3 . f4 a
``````

Now for the revised definition of `classify`:

``````classify classifier = fst . head . sortBy ((_, a) (_, b)  -> b `compare` a)
. probabilities classifier
``````

If you don’t like writing out the transformations “backwards” like this, you can use the `>>>` operator from Control.Category:

``````classify classifier = probabilities classifier
>>> sortBy ((_, a) (_, b)  -> b `compare` a)
Finally, the `sortBy` transformation can be made more concise using the handy `on` function from Data.Function:
``````sortBy (flip compare `on` snd)