Archive for May 2007

More adventures in parsing

I received an interesting comment from Conal Elliott on my previous post on parsing. I have to admit I wasn’t sure I understood him at first, I’m still not sure I do, but I think I have an idea of what he means :-)

Basically my code is very sequential in that I use the do construct everywhere in the parsing code. Personally I thought that makes the parser very easy to read since the code very much mimics the structure of the maps file. I do realise the code isn’t very “functional” though so I thought I’d take Conal’s comments to heart and see what the result would be.

Let’s start with observation that every entity in a line is separated by a space. However some things are separated by other characters. So the first thing I did was write a higher-order function that first reads something, then reads a character and returns the first thing that was read:

thenChar c f = f >>= (\ r -> char c >> return r)

Since space is used as a separator so often I added a short-cut for that:

thenSpace  = thenChar ' '

Then I put that to use on parseAddress:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start <- thenChar '-' $ many1 hexDigit
        end <- many1 hexDigit
        return $ Address (hexStr2Int start) (hexStr2Int end)

Modifying the other parsing functions using thenChar and thenSpace is straight forward.

I’m not entirely sure I understand what Conal meant with the part about liftM in his comment. I suspect his referring to the fact that I first read characters and then convert them in the “constructors”. By using liftM I can move the conversion “up in the code”. Here’s parseAddress after I’ve moved the calls to hexStr2Int:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start <- liftM hexStr2Int $ thenChar '-' $ many1 hexDigit
        end <- liftM hexStr2Int $ many1 hexDigit
        return $ Address start end

After modifying the other parsing functions in a similar way I ended up with this:

parsePerms = let
        cA a = case a of
            'p' -> Private
            's' -> Shared
    in do
        r <- liftM (== 'r') anyChar
        w <- liftM (== 'w') anyChar
        x <- liftM (== 'x') anyChar
        a <- liftM cA anyChar
        return $ Perms r w x a

parseDevice = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        maj <- liftM hexStr2Int $ thenChar ':' $ many1 hexDigit
        min <- liftM hexStr2Int $ many1 hexDigit
        return $ Device maj min

parseRegion = let
        hexStr2Int = Prelude.read . ("0x" ++)
        parsePath = (many1 $ char ' ') >> (many1 $ anyChar)
    in do
        addr <- thenSpace parseAddress
        perm <- thenSpace parsePerms
        offset <- liftM hexStr2Int $ thenSpace $ many1 hexDigit
        dev <- thenSpace parseDevice
        inode <- liftM Prelude.read $ thenSpace $ many1 digit
        path <- parsePath <|> string ""
        return $ MemRegion addr perm offset dev inode path

Is this code more “functional”? Is it easier to read? You’ll have to be the judge of that…

Conal, if I got the intention of your comment completely wrong then feel free to tell me I’m an idiot ;-)

Adventures in parsing

I’ve long wanted to dip my toes in the Parsec water. I’ve made some attempts before, but always stumbled on something that put me in the doldrums for so long that I managed to repress all memories of ever having tried. A few files scattered in my ~/devo/test/haskell directory tells the story of my failed attempts. Until now that is :-)

I picked a nice and regular task for my first real attempt: parsing /proc/<pid>/maps. First a look at the man-page offers a good description of the format of a line:

address           perms offset  dev   inode      pathname
08048000-08056000 r-xp 00000000 03:0c 64593      /usr/sbin/gpm

So, I started putting together some datatypes. First off the address range:

data Address = Address { start :: Integer, end :: Integer }
    deriving Show

Then I decided that the ’s’/'p’ in the permissions should be called Access:

data Access = Shared | Private
    deriving Show

The basic permissions (rwx) are simply represented as booleans:

data Perms = Perms {
        read :: Bool,
        write :: Bool,
        executable :: Bool,
        access :: Access
    }
    deriving Show

The device is straightforward as well:

data Device = Device { major :: Integer, minor :: Integer }
    deriving Show

At last I tie it all together in a final datatype that represents a memory region:

data MemRegion = MemRegion {
        address :: Address,
        perms :: Perms,
        offset :: Integer,
        device :: Device,
        inode :: Integer,
        pathname :: String
    }
    deriving Show

All types derive Show (and receive default implementations of show, at least when using GHC) so that they are easy to print.

Now, on to the actual “parsec-ing”. Faced with the option of writing it top-down or bottom-up I chose the latter. However, since the format of a single line in the maps file is so simple it’s easy to imagine what the final function will look like. I settled on bottom-up since the datatypes provide me with such an obvious splitting of the line. First off, parsing the address range:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start <- many1 hexDigit
        char '-'
        end <- many1 hexDigit
        return $ Address (hexStr2Int start) (hexStr2Int end)

Since the addresses themselves are in hexadecimal and always are of at least length 1 I use many1 hexDigit to read them. I think it would be safe to assume the addresses always are 8 characters (at least on a 32-bit machine) so it would be possible to use count 8 hexDigit but I haven’t tried it. I’ve found two ways of converting a string representation of a hexadecimal number into an Integer. Above I use the fact that Prelude.read interprets a string beginning with 0x as a hexadecimal number. The other way I’ve found is the slightly less readable fst . (!! 0) . readHex. According to the man-page the addresses are separated by a single dash so I’ve hardcoded that in there.

Testing the function is fairly simple. Using gchi, first load the source file then use parse:

*Main> parse parseAddress "" "0-1"
Right (Address {start = 0, end = 1})
*Main> parse parseAddress "hhh" "01234567-89abcdef"
Right (Address {start = 19088743, end = 2309737967})

Seems to work well enough. :-)

Next up, parsing the permissions. This is so very straightforward that I don’t think I need to comment on it:

parsePerms = let
        cA a = case a of
            'p' -> Private
            's' -> Shared
    in do
        r <- anyChar
        w <- anyChar
        x <- anyChar
        a <- anyChar
        return $ Perms (r == 'r') (w == 'w') (x == 'x') (cA a)

For parsing the device information I use the same strategy as for the address range above, this time however the separating charachter is a colon:

parseDevice = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        maj <- many1 digit
        char ':'
        min <- many1 digit
        return $ Device (hexStr2Int maj) (hexStr2Int min)

Next is to tie it all together and create a MemRegion instance:

parseRegion = let
        hexStr2Int = Prelude.read . ("0x" ++)
        parsePath = (many1 $ char ' ') >> (many1 $ anyChar)
    in do
        addr <- parseAddress
        char ' '
        perm <- parsePerms
        char ' '
        offset <- many1 hexDigit
        char ' '
        dev <- parseDevice
        char ' '
        inode <- many1 digit
        char ' '
        path <- parsePath <|> string ""
        return $ MemRegion addr perm (hexStr2Int offset) dev (Prelude.read inode) path

The only little trick here is that there are lines that lack the pathname. Here’s an example from the man-page:

address           perms offset  dev   inode      pathname
08058000-0805b000 rwxp 00000000 00:00 0

It should be noted that it seems there is a space after the inode entry so I keep a char ' ' in the main function. Then I try to parse the line for a path, if there is none that attempt will fail immediately and instead I parse for an empty string, parsePath <|> string "". The pathname seems to be prefixed with a fixed number of spaces, but I’m lazy and just consume one or more. I’m not sure exactly what characters are allowed in the pathname itself so I’m lazy once more and just gobble up whatever I find.

To exercise what I had so far I decided to write a function that reads the maps file for a specific process, based on its pid, parses the contents and collects all the MemRegion instances in a list.

getMemRegions pid = let
        fp = "/proc" </> show pid </> "maps"
        doParseLine' = parse parseRegion "parseRegion"
        doParseLine l = case (doParseLine' l) of
            Left _ -> error "Failed to parse line"
            Right x -> x
    in do
        mapContent <- liftM lines $ readFile fp
        return $ map doParseLine mapContent

The only thing that really is going on here is that the lines are passed from inside an IO monad into the Parser monad and then back again. After this I can try it out by:

*Main> getMemRegions 1

This produces a lot of output so while playing with it I limited the mapping to the four first lines by using take. The last line then becomes:

return $ map doParseLine (take 4 mapContent)

Now it’s easy to add a main that uses the first command line argument as the pid:

main = do
    pid <- liftM (Prelude.read . (!! 0)) getArgs
    regs <- getMemRegions pid
    mapM_ (putStrLn . show) regs

Well, that concludes my first adventure in parsing :-)

[Edit 27-05-2007 13:15]

I received an email asking for it so here are the import statements I ended up with:

import Control.Monad
import System
import System.FilePath
import Text.ParserCombinators.Parsec

I’m on a planet…

Yes, I’m now featured on a planet. This is a first for me so it took me a few days to work up the courage to post something to my own blog. I mean the pressure really is on now, right? Especially since I’m on Planet Haskell;-)

Oh, BTW, I’m all set to go to LugRadio Live in July. Drop me a line if you are going and feel like meeting in person for a chat, a beer, signing GPG keys, or all of the above.

M$ grasping at straws…

Types are funky…

Signals in Haskell

When playing with Haskell I sometimes get the feeling that I’m the last one in the world to figure out how some things work. It’s almost embarrassing to write about it at times, but I fear I’ll forget and end up redoing my trivial experiments if I don’ti. So, here goes…

Dealing with signals in Haskell is fairly straight forward. Well, at least as far as dealing with signals ever is straight forward. In System.Posix.Signals you find installHandler. The arguments are a little funky but they get a little clearer if you look up the man-page for sigaction.

Here’s a simple example that installs a handler for SIGINT:

module Main where

import System.Posix.Signals
import Control.Concurrent

handler :: IO ()
handler = putStrLn "In handler"

doNothing :: IO ()
doNothing = do
    threadDelay 1000000
    putStrLn "Repeating"

main :: IO ()
main = do
    installHandler sigINT (Catch handler) Nothing
    sequence_ $ repeat doNothing

Now that’s fairly straight forward, I think. However, it’d be nice to be able to let the main thread know that we’ve recieved a signal in some way. From perusing the Haskell Wiki it seems the way to communicate between threads, at least for simple cases, is by using MVars. So, extending the example above to also keep track of how many times SIGINT has been received could look like this:

module Main where

import System.Posix.Signals
import Control.Concurrent
import Control.Concurrent.MVar

handler :: MVar Int -> IO ()
handler mi = do
    i <- takeMVar mi
    putStrLn "In handler"
    putMVar mi (i + 1)

doNothing :: MVar Int -> IO ()
doNothing mi = do
    threadDelay 1000000
    i <- readMVar mi
    putStrLn $ "Repeating " ++ (show i)

main :: IO ()
main = do
    mi <- newMVar 0
    installHandler sigINT (Catch $ handler mi) Nothing
    sequence_ $ repeat $ doNothing mi

It should be noted that, and here it all hinges on me having understood things correctly, handler has more or less become a sort of critical section. Taking the value out of the MVar makes sure that the readMVar in doNothing blocks until a new value is put back. Probably not a very useful thing in this particular example, but it’s still worth keeping in mind for the future.

  1. Or maybe I shouldn’t compare myself to the highly (dare I say it?) academical entries usually found on the Haskell Planet. Maybe I should just consider my writing to be “Haskell for mortals”.[back]

Unescaping URLs in Python and Haskell

A while ago I decided that in order to learn Haskell I should make an attempt to stop reaching for Python whenever I needed to solve a “small problem” and use Haskell instead. This morning I found a need to unescape URLs inside Vim and I didn’t catch myself until after I had written the following Python code:

#! /usr/bin/env python

import urllib
import sys

for l in sys.stdin:
    print urllib.unquote(l),

After a little bit of cursing I started browsing Hoogle and after bit of searching I found the Network.URI module. My Haskell solution ended up looking like this:

module Main where

import Network.URI

main :: IO ()
main = interact $ unlines . (map unEscapeString) . lines

Short, sweet, and easy to understand, I think.

In case you want to go the other direction, i.e. to escape strings in Haskell then combining escapeURIString and isUnreserved is the right thing to do.