Posts tagged ‘parsec’

TagSoup, meet Parsec!

Recently I began writing a tool to scrape some information off a web site for some off-line processing. After writing up the basics using TagSoup I showed what I had to a colleague. His first comment was “Can’t you use Parsec for that?” It took me a second to realise that he didn’t mean that I should write my own XML parser but rather that Parsec allows writing parsers of a list of anything. So I thought I’d see just what it’d take to create a parser for [Tag].

A look at the string parser shipped with Parsec offered a lot of inspiration.

First the basic type, TagParser:

type TagParser = GenParser Tag

The basic function of Parsec is tokenPrim, basically that’s what other basic parsers use. Taking a cue from the string parser implementation I defined a function called satisfy:

satisfy f = tokenPrim
        show
        (\ pos t _ -> updatePosTag pos t)
        (\ t -> if (f t) then Just t else Nothing)

The positioning in a list of tags simply an increase of column, irrespective of what tag is processed:

updatePosTag s _ = incSourceColumn s 1

Now I have enough to create the first Tag parser—one that accepts a single instance of the specified kind:

tag t = satisfy (~== t) <?> show t

It’s important to stick the supplied tag on the right of (~==). See its documentation for why that is. The second parser is one that accepts any kind of tag:

anyTag = satisfy (const True)

So far so good. The next parser to implement is one that accepts any kind of tag out of a list of tags. Here I want to make use of the convenient behaviour of (~==) so I’ll need to implement a custom version of elem:

l `elemTag` r = or $ l `elemT` r
    where
        l `elemT` [] = [False]
        l `elemT` (r:rs) = (l ~== r) : l `elemT` rs

With that in place it’s easy to implement oneOf and noneOf:

oneOf ts = satisfy (`elemTag` ts)
noneOf ts = satisfy (\ t -> not (t `elemTag` ts))

So, as an example of what this can be used for here is a re-implementation of TagSoup’s partitions:

partitions t = liftM2 (:)
        (many $ noneOf [t])
        (many $ liftM2 (:) (tag t) (many $ noneOf [t]))

Of course the big question is whether I’ll rewrite my original code using Parsec. Hmm, probably not in this case, but the next time I need to do some web page scraping it offers yet another option for doing it.

Adventures in parsing, part 4

I received a few comments on part 3 of this little mini-series and I just wanted to address them. While doing this I still want the main functions of the parser parseXxx to read like the maps file itself. That means I want to avoid “reversing order” like thenChar and thenSpace did in part2. I also don’t want to hide things, e.g. I don’t want to introduce a function that turns (a <* char ' ') <*> b into a <#> b.

So, first up is to do something about hexStr2Int <$> many1 hexDigit which appears all over the place. I made it appear in even more places by moving around a few parentheses; the following two functions are the same:

foo = a <$> (b <* c)
bar = (a <$> b) <* c

Then I scrapped hexStr2Int completely and instead introduced hexStr:

hexStr = Prelude.read . ("0x" ++) <$> many1 hexDigit

This means that parseAddress can be rewritten to:

parseAddress = Address <$>
    hexStr <* char '-' <*>
    hexStr

Rather than, as Conal suggested, introduce an infix operation that addresses the pattern (a <* char ' ') <*> b I decided to do something about a <* char c. I feel Conal’s suggestion, while shortening the code more than my solution, goes against my wish to not hide things. This is the definition of <##>:

(<##>) l r = l <* char r

After this I rewrote parseAddress into:

parseAddress = Address <$>
    hexStr <##> '-' <*>
    hexStr

The pattern (== c) <$> anyChar appears three times in parsePerms so it got a name and moved down into the where clause. I also modified cA to use pattern matching. I haven’t spent much time considering error handling in the parser, so I didn’t introduce a pattern matching everything else.

parsePerms = Perms <$>
    pP 'r' <*>
    pP 'w' <*>
    pP 'x' <*>
    (cA <$> anyChar)

    where
        pP c = (== c) <$> anyChar
        cA 'p' = Private
        cA 's' = Shared

The last change I did was remove a bunch of parentheses. I’m always a little hesitant removing parentheses and relying on precedence rules, I find I’m even more hesitant doing it when programming Haskell. Probably due to Haskell having a lot of infix operators that I’m unused to.

The rest of the parser now looks like this:

parseDevice = Device <$>
    hexStr <##> ':' <*>
    hexStr

parseRegion = MemRegion <$>
    parseAddress <##> ' ' <*>
    parsePerms <##> ' ' <*>
    hexStr <##> ' ' <*>
    parseDevice <##> ' ' <*>
    (Prelude.read <$> many1 digit) <##> ' ' <*>
    (parsePath <|> string "")

    where
        parsePath = (many1 $ char ' ') *> (many1 anyChar)

I think these changes address most of the comments Conal and Twan made on the previous part. Where they don’t I hope I’ve explained why I decided not to take their advice.

Adventures in parsing, part 3

I got a great many comments, at least by my standards, on my earlier two posts on parsing in Haskell. Especially on the latest one. Conal posted a comment on the first pointing me towards liftM and its siblings, without telling me that it would only be the first step towards “applicative style”. So, here I go again…

First off, importing Control.Applicative. Apparently <|> is defined in both Applicative and in Parsec. I do use <|> from Parsec so preventing importing it from Applicative seemed like a good idea:

import Control.Applicative hiding ( (<|>) )

Second, Cale pointed out that I need to make an instance for Control.Applicative.Applicative for GenParser. He was nice enough to point out how to do that, leaving syntax the only thing I had to struggle with:

instance Applicative (GenParser c st) where
    pure = return
    (<*>) = ap

I decided to take baby-steps and I started with parseAddress. Here’s what it used to look like:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start <- liftM hexStr2Int $ thenChar '-' $ many1 hexDigit
        end <- liftM hexStr2Int $ many1 hexDigit
        return $ Address start end

On Twan’s suggestion I rewrote it using where rather than let ... in and since this was my first function I decided to go via the ap function (at the same time I broke out hexStr2Int since it’s used in so many places):

parseAddress = do
    start <- return hexStr2Int `ap` (thenChar '-' $ many1 hexDigit)
    end <- return hexStr2Int `ap` (many1 hexDigit)
    return $ Address start end

Then on to applying some functions from Applicative:

parseAddress = Address start end
    where
        start = hexStr2Int <$> (thenChar '-' $ many1 hexDigit)
        end = hexStr2Int <$> (many1 hexDigit)

By now the use of thenChar looks a little silly so I changed that part into many1 hexDigit <* char '-' instead. Finally I removed the where part altogether and use <*> to string it all together:

parseAddress = Address <$>
    (hexStr2Int <$> many1 hexDigit <* char '-') <*>
    (hexStr2Int <$> (many1 hexDigit))

From here on I skipped the intermediate steps and went straight for the last form. Here’s what I ended up with:

parsePerms = Perms <$>
    ( (== 'r') <$> anyChar) <*>
    ( (== 'w') <$> anyChar) <*>
    ( (== 'x') <$> anyChar) <*>
    (cA <$> anyChar)

    where
        cA a = case a of
            'p' -> Private
            's' -> Shared

parseDevice = Device <$>
    (hexStr2Int <$> many1 hexDigit <* char ':') <*>
    (hexStr2Int <$> (many1 hexDigit))

parseRegion = MemRegion <$>
    (parseAddress <* char ' ') <*>
    (parsePerms <* char ' ') <*>
    (hexStr2Int <$> (many1 hexDigit <* char ' ')) <*>
    (parseDevice <* char ' ') <*>
    (Prelude.read <$> (many1 digit <* char ' ')) <*>
    (parsePath <|> string "")

    where
        parsePath = (many1 $ char ' ') *> (many1 anyChar)

I have to say I’m fairly pleased with this version of the parser. It reads about as easy as the first version and there’s none of the “reversing” that thenChar introduced.

More adventures in parsing

I received an interesting comment from Conal Elliott on my previous post on parsing. I have to admit I wasn’t sure I understood him at first, I’m still not sure I do, but I think I have an idea of what he means :-)

Basically my code is very sequential in that I use the do construct everywhere in the parsing code. Personally I thought that makes the parser very easy to read since the code very much mimics the structure of the maps file. I do realise the code isn’t very “functional” though so I thought I’d take Conal’s comments to heart and see what the result would be.

Let’s start with observation that every entity in a line is separated by a space. However some things are separated by other characters. So the first thing I did was write a higher-order function that first reads something, then reads a character and returns the first thing that was read:

thenChar c f = f >>= (\ r -> char c >> return r)

Since space is used as a separator so often I added a short-cut for that:

thenSpace  = thenChar ' '

Then I put that to use on parseAddress:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start <- thenChar '-' $ many1 hexDigit
        end <- many1 hexDigit
        return $ Address (hexStr2Int start) (hexStr2Int end)

Modifying the other parsing functions using thenChar and thenSpace is straight forward.

I’m not entirely sure I understand what Conal meant with the part about liftM in his comment. I suspect his referring to the fact that I first read characters and then convert them in the “constructors”. By using liftM I can move the conversion “up in the code”. Here’s parseAddress after I’ve moved the calls to hexStr2Int:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start <- liftM hexStr2Int $ thenChar '-' $ many1 hexDigit
        end <- liftM hexStr2Int $ many1 hexDigit
        return $ Address start end

After modifying the other parsing functions in a similar way I ended up with this:

parsePerms = let
        cA a = case a of
            'p' -> Private
            's' -> Shared
    in do
        r <- liftM (== 'r') anyChar
        w <- liftM (== 'w') anyChar
        x <- liftM (== 'x') anyChar
        a <- liftM cA anyChar
        return $ Perms r w x a

parseDevice = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        maj <- liftM hexStr2Int $ thenChar ':' $ many1 hexDigit
        min <- liftM hexStr2Int $ many1 hexDigit
        return $ Device maj min

parseRegion = let
        hexStr2Int = Prelude.read . ("0x" ++)
        parsePath = (many1 $ char ' ') >> (many1 $ anyChar)
    in do
        addr <- thenSpace parseAddress
        perm <- thenSpace parsePerms
        offset <- liftM hexStr2Int $ thenSpace $ many1 hexDigit
        dev <- thenSpace parseDevice
        inode <- liftM Prelude.read $ thenSpace $ many1 digit
        path <- parsePath <|> string ""
        return $ MemRegion addr perm offset dev inode path

Is this code more “functional”? Is it easier to read? You’ll have to be the judge of that…

Conal, if I got the intention of your comment completely wrong then feel free to tell me I’m an idiot ;-)

Adventures in parsing

I’ve long wanted to dip my toes in the Parsec water. I’ve made some attempts before, but always stumbled on something that put me in the doldrums for so long that I managed to repress all memories of ever having tried. A few files scattered in my ~/devo/test/haskell directory tells the story of my failed attempts. Until now that is :-)

I picked a nice and regular task for my first real attempt: parsing /proc/<pid>/maps. First a look at the man-page offers a good description of the format of a line:

address           perms offset  dev   inode      pathname
08048000-08056000 r-xp 00000000 03:0c 64593      /usr/sbin/gpm

So, I started putting together some datatypes. First off the address range:

data Address = Address { start :: Integer, end :: Integer }
    deriving Show

Then I decided that the ‘s’/'p’ in the permissions should be called Access:

data Access = Shared | Private
    deriving Show

The basic permissions (rwx) are simply represented as booleans:

data Perms = Perms {
        read :: Bool,
        write :: Bool,
        executable :: Bool,
        access :: Access
    }
    deriving Show

The device is straightforward as well:

data Device = Device { major :: Integer, minor :: Integer }
    deriving Show

At last I tie it all together in a final datatype that represents a memory region:

data MemRegion = MemRegion {
        address :: Address,
        perms :: Perms,
        offset :: Integer,
        device :: Device,
        inode :: Integer,
        pathname :: String
    }
    deriving Show

All types derive Show (and receive default implementations of show, at least when using GHC) so that they are easy to print.

Now, on to the actual “parsec-ing”. Faced with the option of writing it top-down or bottom-up I chose the latter. However, since the format of a single line in the maps file is so simple it’s easy to imagine what the final function will look like. I settled on bottom-up since the datatypes provide me with such an obvious splitting of the line. First off, parsing the address range:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start <- many1 hexDigit
        char '-'
        end <- many1 hexDigit
        return $ Address (hexStr2Int start) (hexStr2Int end)

Since the addresses themselves are in hexadecimal and always are of at least length 1 I use many1 hexDigit to read them. I think it would be safe to assume the addresses always are 8 characters (at least on a 32-bit machine) so it would be possible to use count 8 hexDigit but I haven’t tried it. I’ve found two ways of converting a string representation of a hexadecimal number into an Integer. Above I use the fact that Prelude.read interprets a string beginning with 0x as a hexadecimal number. The other way I’ve found is the slightly less readable fst . (!! 0) . readHex. According to the man-page the addresses are separated by a single dash so I’ve hardcoded that in there.

Testing the function is fairly simple. Using gchi, first load the source file then use parse:

*Main> parse parseAddress "" "0-1"
Right (Address {start = 0, end = 1})
*Main> parse parseAddress "hhh" "01234567-89abcdef"
Right (Address {start = 19088743, end = 2309737967})

Seems to work well enough. :-)

Next up, parsing the permissions. This is so very straightforward that I don’t think I need to comment on it:

parsePerms = let
        cA a = case a of
            'p' -> Private
            's' -> Shared
    in do
        r <- anyChar
        w <- anyChar
        x <- anyChar
        a <- anyChar
        return $ Perms (r == 'r') (w == 'w') (x == 'x') (cA a)

For parsing the device information I use the same strategy as for the address range above, this time however the separating charachter is a colon:

parseDevice = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        maj <- many1 digit
        char ':'
        min <- many1 digit
        return $ Device (hexStr2Int maj) (hexStr2Int min)

Next is to tie it all together and create a MemRegion instance:

parseRegion = let
        hexStr2Int = Prelude.read . ("0x" ++)
        parsePath = (many1 $ char ' ') >> (many1 $ anyChar)
    in do
        addr <- parseAddress
        char ' '
        perm <- parsePerms
        char ' '
        offset <- many1 hexDigit
        char ' '
        dev <- parseDevice
        char ' '
        inode <- many1 digit
        char ' '
        path <- parsePath <|> string ""
        return $ MemRegion addr perm (hexStr2Int offset) dev (Prelude.read inode) path

The only little trick here is that there are lines that lack the pathname. Here’s an example from the man-page:

address           perms offset  dev   inode      pathname
08058000-0805b000 rwxp 00000000 00:00 0

It should be noted that it seems there is a space after the inode entry so I keep a char ' ' in the main function. Then I try to parse the line for a path, if there is none that attempt will fail immediately and instead I parse for an empty string, parsePath <|> string "". The pathname seems to be prefixed with a fixed number of spaces, but I’m lazy and just consume one or more. I’m not sure exactly what characters are allowed in the pathname itself so I’m lazy once more and just gobble up whatever I find.

To exercise what I had so far I decided to write a function that reads the maps file for a specific process, based on its pid, parses the contents and collects all the MemRegion instances in a list.

getMemRegions pid = let
        fp = "/proc" </> show pid </> "maps"
        doParseLine' = parse parseRegion "parseRegion"
        doParseLine l = case (doParseLine' l) of
            Left _ -> error "Failed to parse line"
            Right x -> x
    in do
        mapContent <- liftM lines $ readFile fp
        return $ map doParseLine mapContent

The only thing that really is going on here is that the lines are passed from inside an IO monad into the Parser monad and then back again. After this I can try it out by:

*Main> getMemRegions 1

This produces a lot of output so while playing with it I limited the mapping to the four first lines by using take. The last line then becomes:

return $ map doParseLine (take 4 mapContent)

Now it’s easy to add a main that uses the first command line argument as the pid:

main = do
    pid <- liftM (Prelude.read . (!! 0)) getArgs
    regs <- getMemRegions pid
    mapM_ (putStrLn . show) regs

Well, that concludes my first adventure in parsing :-)

[Edit 27-05-2007 13:15]

I received an email asking for it so here are the import statements I ended up with:

import Control.Monad
import System
import System.FilePath
import Text.ParserCombinators.Parsec