Calling a Haskell MD5 hashing function from C, and returning the result back to C land

Posted on

Problem

I’m starting to learn how to meld the worlds of C and Haskell. Looking for any feedback on this first function.

The function takes in a pointer to an array of unsigned chars and returns a pointer to 32 unsigned shorts. Or at least I would like it too :).

I am unsure about when the memory used to return the result is cleaned up. Is it possible that it will be garbage collected before I can use it on the C side?

Anyway here’s the code.

{-# LANGUAGE ForeignFunctionInterface #-}
module JSFLPlugin where
import Foreign.C.Types (CInt(..))   
import Foreign.ForeignPtr (newForeignPtr_)
import Foreign.Ptr (Ptr)
import Data.Serialize (encode)
import Foreign.ForeignPtr.Unsafe (unsafeForeignPtrToPtr)
import Data.Digest.Pure.MD5 (md5) 
import Data.ByteString (unpack)
import Data.ByteString.Internal (toForeignPtr, fromForeignPtr)
import Data.ByteString.Lazy.Internal (chunk)
import Data.ByteString.Lazy (empty)
import Data.Word (Word8, Word16)
import Data.Text.Format (hex)
import Control.Applicative ((<$>))
import Data.Text.Lazy.Builder (toLazyText)
import Data.Text.Lazy (toStrict)
import Data.Text.Foreign (asForeignPtr)
import Data.Monoid ((<>), mempty)

-- | Hash pointer to array of unsigned chars
hash :: CInt -> Ptr Word8 -> IO (Ptr Word16)
hash count addr =  do
    -- cast to ForeignPtr
    fptr <- newForeignPtr_ addr
    -- Make a strict ByteString
    let sbyte    = fromForeignPtr fptr 0 (fromIntegral count)
        -- Make a lazy ByteString from the strict one
        lbyte    = chunk sbyte empty
        -- Hash with md5 and encode as a strict ByteString
        digest   = encode . md5 $ lbyte
        -- Convert the each digest byte to a UTF-16 encoded hexdecimal character  
        hexBytes = toLazyText . foldr (x y -> y <> hex x) mempty . unpack $ digest 
    -- Convert to a strict Text and get the pointer to chars
    fmap (unsafeForeignPtrToPtr . fst) . asForeignPtr . toStrict $ hexBytes

foreign export ccall hash :: CInt -> Ptr Word8 -> IO (Ptr Word16)

EDIT: New version per Joey’s suggestions (compiled but not tested).

{-# LANGUAGE ForeignFunctionInterface #-}
module JSFLPlugin where
import Foreign.C.Types (CInt(..))   
import Foreign.ForeignPtr (newForeignPtr_)
import Foreign.Ptr (Ptr)
import qualified Data.Serialize as Serialize
import Data.Digest.Pure.MD5 (MD5Digest) 
import Data.ByteString (ByteString)
import Data.ByteString.Internal (fromForeignPtr)
import Data.Word (Word8, Word16)
import Data.Text (Text)
import Data.Text.Foreign (unsafeCopyToPtr)
import qualified Data.ByteString.Base16 as Base16
import Data.Text.Encoding (decodeUtf8)
import Crypto.Classes (hash')

pureHash :: ByteString -> Text
pureHash = decodeUtf8 . Base16.encode . Serialize.encode . md5 where
    md5 s = hash' s :: MD5Digest

hash :: CInt -> Ptr Word8 -> Ptr Word16 -> IO ()
hash count input output =  do
    fptr <- newForeignPtr_ input
    let sbyte = fromForeignPtr fptr 0 $ fromIntegral count
    unsafeCopyToPtr (pureHash sbyte) output

foreign export ccall hash :: CInt -> Ptr Word8 -> Ptr Word16 -> IO ()

Solution

First, you have way too many comments. They clutter the code, and some of them are completely redundant.

-- | Hash pointer to array of unsigned chars
hash :: CInt -> Ptr Word8 -> IO (Ptr Word16)

The comment provides no information beyond the function’s name and signature. But it doesn’t answer these questions:

  • What hashing algorithm is being used?

  • Why is the output Ptr Word16? UTF-16?

  • Is the output in hexadecimal, or raw bytes? If it’s hexadecimal, does it use capital or lowercase letters?

Before we dive into foreign API details, we can do a little refactoring. There’s a big chain of pure function application here. Let’s move it outside so it doesn’t clutter the scary foreign pointer stuff:

hashPure :: ByteString -> Text
hashPure sbyte =
    let lbyte    = chunk sbyte empty
        digest   = encode . md5 $ lbyte
        hexBytes = toLazyText . foldr (x y -> y <> hex x) mempty . unpack $ digest 
     in toStrict hexBytes

I think we can improve this a bit. Let’s first take a step back. Here’s all hashPure needs to do:

  • Hash the input string (a strict ByteString) using MD5

  • Convert it to a hex string (as Text)

To get there, we have to do a bunch of annoying conversions. But we can avoid these if we tweak a couple of our library choices.

Because of the Hash instance on MD5Digest, we can eliminate a strict to lazy ByteString conversion:

import Crypto.Classes (hash')
import Data.Digest.Pure.MD5 (MD5Digest)

hashPure :: ByteString -> Text
hashPure sbyte =
    let digest   = encode (hash' sbyte :: MD5Digest)
        hexBytes = toLazyText . foldr (x y -> y <> hex x) mempty . unpack $ digest 
     in toStrict hexBytes

Your code for converting to hex is bogus. To illustrate why:

>foldr (x y -> y <> hex x) mempty [1,2,3]
"321"

Namely:

  • Your use of foldr flips the order. What you meant was: foldr (x y -> hex x <> y)

  • The hex function does not pad the number to two characters.

To avoid the first problem, you could have just said mconcat (map hex). foldr/build fusion should optimize it.

To fix the second problem, and avoid two more conversions, you can use the base16-bytestring package. You could avoid yet another conversion if you make hash return an array of chars instead of wide chars, but I won’t do that yet.

hashPure :: ByteString -> Text
hashPure sbyte =
    let digest   = encode (hash' sbyte :: MD5Digest)
        hexBytes = Base16.encode digest
     in decodeUtf8 hexBytes

We can make it a little prettier using point-free style:

hashPure :: ByteString -> Text
hashPure =
    decodeUtf8 . Base16.encode . Serialize.encode . md5
  where
    md5 s = hash' s :: MD5Digest

Now on to the ForeignPtr stuff. I saved the best for last.

hash :: CInt -> Ptr Word8 -> IO (Ptr Word16)
hash count addr =  do
    fptr <- newForeignPtr_ addr
    let sbyte = fromForeignPtr fptr 0 (fromIntegral count)
    unsafeForeignPtrToPtr . fst <$> asForeignPtr (hashPure sbyte)

Your two points of interaction with C:

  • Convert a C array of bytes to a ByteString

  • Convert a Text to a block of wide chars, and return a pointer to it.

Your first conversion is sound, but I would use unsafePackCStringLen instead (which does the same thing):

hash :: CInt -> Ptr Word8 -> IO (Ptr Word16)
hash count addr =  do
    sbyte <- unsafePackCStringLen (castPtr addr, fromIntegral count)
    unsafeForeignPtrToPtr . fst <$> asForeignPtr (hashPure sbyte)

The second conversion is wrong. ForeignPtr is for managing C objects in Haskell, not the other way around. The returned pointer will be invalidated when the ForeignPtr returned by asForeignPtr is garbage collected.

You fix this, you would need to malloc a large enough buffer, then copy the text content into it. Unfortunately, Data.Text.Foreign does not provide a convenient way to do this without a needless allocation.

I would change the signature of the C function from this:

uint16_t *hash(int blksize, uint8_t *block);

To this:

void hash(uint8_t *block, size_t blksize, char out[33]);

Namely:

  • Write a string to a caller-supplied buffer of the correct size, rather than allocating a buffer that the caller has to free. A hex-encoded MD5 hash is always 32 characters long. The 33rd byte is for the zero terminator.

  • Output a C string instead of UTF-16. Unless you plan to throw the output into a system call or something that takes UTF-16, you shouldn’t be using UTF-16.

  • Flip pointer and size arguments. This is mainly a matter of taste.

See if you can implement this part yourself. This post is already too long.

Here is a compositional rewriting of hash:

hash :: CInt -> Ptr Word16 -> Ptr Word8 -> IO ()
hash inputLength output input 
    = (flip unsafeCopyToPtr output  . pureHash . copyInput <=< newForeignPtr_) input
    where copyInput fptr = fromForeignPtr fptr 0 $ fromIntegral inputLength

<=< is composition of kleisli arrows of a monad from Control.Monad. It has two advantages over >>= and =<<: it’s associative and data flows in the same direction as in ..

We get foreign pointer, copy data from it, calculate pure hash and copy result to output. This description is literally encoded in the pipeline, so I think the code is readable and maintainable.

Note that I flipped the arguments so if you want you can eta-reduce input.

Leave a Reply

Your email address will not be published. Required fields are marked *