Problem
I’m starting to learn how to meld the worlds of C and Haskell. Looking for any feedback on this first function.
The function takes in a pointer to an array of unsigned chars and returns a pointer to 32 unsigned shorts. Or at least I would like it too :).
I am unsure about when the memory used to return the result is cleaned up. Is it possible that it will be garbage collected before I can use it on the C side?
Anyway here’s the code.
{-# LANGUAGE ForeignFunctionInterface #-}
module JSFLPlugin where
import Foreign.C.Types (CInt(..))
import Foreign.ForeignPtr (newForeignPtr_)
import Foreign.Ptr (Ptr)
import Data.Serialize (encode)
import Foreign.ForeignPtr.Unsafe (unsafeForeignPtrToPtr)
import Data.Digest.Pure.MD5 (md5)
import Data.ByteString (unpack)
import Data.ByteString.Internal (toForeignPtr, fromForeignPtr)
import Data.ByteString.Lazy.Internal (chunk)
import Data.ByteString.Lazy (empty)
import Data.Word (Word8, Word16)
import Data.Text.Format (hex)
import Control.Applicative ((<$>))
import Data.Text.Lazy.Builder (toLazyText)
import Data.Text.Lazy (toStrict)
import Data.Text.Foreign (asForeignPtr)
import Data.Monoid ((<>), mempty)
-- | Hash pointer to array of unsigned chars
hash :: CInt -> Ptr Word8 -> IO (Ptr Word16)
hash count addr = do
-- cast to ForeignPtr
fptr <- newForeignPtr_ addr
-- Make a strict ByteString
let sbyte = fromForeignPtr fptr 0 (fromIntegral count)
-- Make a lazy ByteString from the strict one
lbyte = chunk sbyte empty
-- Hash with md5 and encode as a strict ByteString
digest = encode . md5 $ lbyte
-- Convert the each digest byte to a UTF-16 encoded hexdecimal character
hexBytes = toLazyText . foldr (x y -> y <> hex x) mempty . unpack $ digest
-- Convert to a strict Text and get the pointer to chars
fmap (unsafeForeignPtrToPtr . fst) . asForeignPtr . toStrict $ hexBytes
foreign export ccall hash :: CInt -> Ptr Word8 -> IO (Ptr Word16)
EDIT: New version per Joey’s suggestions (compiled but not tested).
{-# LANGUAGE ForeignFunctionInterface #-}
module JSFLPlugin where
import Foreign.C.Types (CInt(..))
import Foreign.ForeignPtr (newForeignPtr_)
import Foreign.Ptr (Ptr)
import qualified Data.Serialize as Serialize
import Data.Digest.Pure.MD5 (MD5Digest)
import Data.ByteString (ByteString)
import Data.ByteString.Internal (fromForeignPtr)
import Data.Word (Word8, Word16)
import Data.Text (Text)
import Data.Text.Foreign (unsafeCopyToPtr)
import qualified Data.ByteString.Base16 as Base16
import Data.Text.Encoding (decodeUtf8)
import Crypto.Classes (hash')
pureHash :: ByteString -> Text
pureHash = decodeUtf8 . Base16.encode . Serialize.encode . md5 where
md5 s = hash' s :: MD5Digest
hash :: CInt -> Ptr Word8 -> Ptr Word16 -> IO ()
hash count input output = do
fptr <- newForeignPtr_ input
let sbyte = fromForeignPtr fptr 0 $ fromIntegral count
unsafeCopyToPtr (pureHash sbyte) output
foreign export ccall hash :: CInt -> Ptr Word8 -> Ptr Word16 -> IO ()
Solution
First, you have way too many comments. They clutter the code, and some of them are completely redundant.
-- | Hash pointer to array of unsigned chars
hash :: CInt -> Ptr Word8 -> IO (Ptr Word16)
The comment provides no information beyond the function’s name and signature. But it doesn’t answer these questions:
-
What hashing algorithm is being used?
-
Why is the output
Ptr Word16
? UTF-16? -
Is the output in hexadecimal, or raw bytes? If it’s hexadecimal, does it use capital or lowercase letters?
Before we dive into foreign API details, we can do a little refactoring. There’s a big chain of pure function application here. Let’s move it outside so it doesn’t clutter the scary foreign pointer stuff:
hashPure :: ByteString -> Text
hashPure sbyte =
let lbyte = chunk sbyte empty
digest = encode . md5 $ lbyte
hexBytes = toLazyText . foldr (x y -> y <> hex x) mempty . unpack $ digest
in toStrict hexBytes
I think we can improve this a bit. Let’s first take a step back. Here’s all hashPure
needs to do:
-
Hash the input string (a strict ByteString) using MD5
-
Convert it to a hex string (as Text)
To get there, we have to do a bunch of annoying conversions. But we can avoid these if we tweak a couple of our library choices.
Because of the Hash instance on MD5Digest, we can eliminate a strict to lazy ByteString conversion:
import Crypto.Classes (hash')
import Data.Digest.Pure.MD5 (MD5Digest)
hashPure :: ByteString -> Text
hashPure sbyte =
let digest = encode (hash' sbyte :: MD5Digest)
hexBytes = toLazyText . foldr (x y -> y <> hex x) mempty . unpack $ digest
in toStrict hexBytes
Your code for converting to hex is bogus. To illustrate why:
>foldr (x y -> y <> hex x) mempty [1,2,3]
"321"
Namely:
-
Your use of
foldr
flips the order. What you meant was:foldr (x y -> hex x <> y)
-
The hex function does not pad the number to two characters.
To avoid the first problem, you could have just said mconcat (map hex)
. foldr/build fusion should optimize it.
To fix the second problem, and avoid two more conversions, you can use the base16-bytestring package. You could avoid yet another conversion if you make hash
return an array of chars instead of wide chars, but I won’t do that yet.
hashPure :: ByteString -> Text
hashPure sbyte =
let digest = encode (hash' sbyte :: MD5Digest)
hexBytes = Base16.encode digest
in decodeUtf8 hexBytes
We can make it a little prettier using point-free style:
hashPure :: ByteString -> Text
hashPure =
decodeUtf8 . Base16.encode . Serialize.encode . md5
where
md5 s = hash' s :: MD5Digest
Now on to the ForeignPtr stuff. I saved the best for last.
hash :: CInt -> Ptr Word8 -> IO (Ptr Word16)
hash count addr = do
fptr <- newForeignPtr_ addr
let sbyte = fromForeignPtr fptr 0 (fromIntegral count)
unsafeForeignPtrToPtr . fst <$> asForeignPtr (hashPure sbyte)
Your two points of interaction with C:
-
Convert a C array of bytes to a
ByteString
-
Convert a
Text
to a block of wide chars, and return a pointer to it.
Your first conversion is sound, but I would use unsafePackCStringLen instead (which does the same thing):
hash :: CInt -> Ptr Word8 -> IO (Ptr Word16)
hash count addr = do
sbyte <- unsafePackCStringLen (castPtr addr, fromIntegral count)
unsafeForeignPtrToPtr . fst <$> asForeignPtr (hashPure sbyte)
The second conversion is wrong. ForeignPtr is for managing C objects in Haskell, not the other way around. The returned pointer will be invalidated when the ForeignPtr returned by asForeignPtr is garbage collected.
You fix this, you would need to malloc a large enough buffer, then copy the text content into it. Unfortunately, Data.Text.Foreign does not provide a convenient way to do this without a needless allocation.
I would change the signature of the C function from this:
uint16_t *hash(int blksize, uint8_t *block);
To this:
void hash(uint8_t *block, size_t blksize, char out[33]);
Namely:
-
Write a string to a caller-supplied buffer of the correct size, rather than allocating a buffer that the caller has to free. A hex-encoded MD5 hash is always 32 characters long. The 33rd byte is for the zero terminator.
-
Output a C string instead of UTF-16. Unless you plan to throw the output into a system call or something that takes UTF-16, you shouldn’t be using UTF-16.
-
Flip pointer and size arguments. This is mainly a matter of taste.
See if you can implement this part yourself. This post is already too long.
Here is a compositional rewriting of hash
:
hash :: CInt -> Ptr Word16 -> Ptr Word8 -> IO ()
hash inputLength output input
= (flip unsafeCopyToPtr output . pureHash . copyInput <=< newForeignPtr_) input
where copyInput fptr = fromForeignPtr fptr 0 $ fromIntegral inputLength
<=<
is composition of kleisli arrows of a monad from Control.Monad
. It has two advantages over >>=
and =<<
: it’s associative and data flows in the same direction as in .
.
We get foreign pointer, copy data from it, calculate pure hash and copy result to output. This description is literally encoded in the pipeline, so I think the code is readable and maintainable.
Note that I flipped the arguments so if you want you can eta-reduce input
.