Recently I began writing a tool to scrape some information off a web site for some off-line processing. After writing up the basics using TagSoup I showed what I had to a colleague. His first comment was “Can’t you use Parsec for that?” It took me a second to realise that he didn’t mean that I should write my own XML parser but rather that Parsec allows writing parsers of a list of anything. So I thought I’d see just what it’d take to create a parser for [Tag].
A look at the string parser shipped with Parsec offered a lot of inspiration.
First the basic type, TagParser:
type TagParser = GenParser Tag
The basic function of Parsec is tokenPrim, basically that’s what other basic parsers use. Taking a cue from the string parser implementation I defined a function called satisfy:
satisfy f = tokenPrim
show(\ pos t _-> updatePosTag pos t)(\ t ->if(f t)then Just t else Nothing)
The positioning in a list of tags simply an increase of column, irrespective of what tag is processed:
updatePosTag s _= incSourceColumn s 1
Now I have enough to create the first Tag parser—one that accepts a single instance of the specified kind:
tag t = satisfy (~== t)<?>show t
It’s important to stick the supplied tag on the right of (~==). See its documentation for why that is. The second parser is one that accepts any kind of tag:
anyTag = satisfy (const True)
So far so good. The next parser to implement is one that accepts any kind of tag out of a list of tags. Here I want to make use of the convenient behaviour of (~==) so I’ll need to implement a custom version of elem:
l `elemTag` r =or$ l `elemT` r
where
l `elemT` []=[False]
l `elemT` (r:rs)=(l ~== r) : l `elemT` rs
With that in place it’s easy to implement oneOf and noneOf:
Of course the big question is whether I’ll rewrite my original code using Parsec. Hmm, probably not in this case, but the next time I need to do some web page scraping it offers yet another option for doing it.
I received a few comments on part 3 of this little mini-series and I just wanted to address them. While doing this I still want the main functions of the parser parseXxx to read like the maps file itself. That means I want to avoid “reversing order” like thenChar and thenSpace did in part2. I also don’t want to hide things, e.g. I don’t want to introduce a function that turns (a <* char ' ') <*> b into a <#> b.
So, first up is to do something about hexStr2Int <$> many1 hexDigit which appears all over the place. I made it appear in even more places by moving around a few parentheses; the following two functions are the same:
foo = a <$> (b <* c)
bar = (a <$> b) <* c
Then I scrapped hexStr2Int completely and instead introduced hexStr:
Rather than, as Conal suggested, introduce an infix operation that addresses the pattern (a <* char ' ') <*> b I decided to do something about a <* char c. I feel Conal’s suggestion, while shortening the code more than my solution, goes against my wish to not hide things. This is the definition of <##>:
The pattern (== c) <$> anyChar appears three times in parsePerms so it got a name and moved down into the where clause. I also modified cA to use pattern matching. I haven’t spent much time considering error handling in the parser, so I didn’t introduce a pattern matching everything else.
parsePerms = Perms <$>
pP 'r' <*>
pP 'w' <*>
pP 'x' <*>
(cA <$> anyChar)
where
pP c = (== c) <$> anyChar
cA 'p' = Private
cA 's' = Shared
The last change I did was remove a bunch of parentheses. I’m always a little hesitant removing parentheses and relying on precedence rules, I find I’m even more hesitant doing it when programming Haskell. Probably due to Haskell having a lot of infix operators that I’m unused to.
I think these changes address most of the comments Conal and Twan made on the previous part. Where they don’t I hope I’ve explained why I decided not to take their advice.
I got a great many comments, at least by my standards, on my earlier twoposts on parsing in Haskell. Especially on the latest one. Conal posted a comment on the first pointing me towards liftM and its siblings, without telling me that it would only be the first step towards “applicative style”. So, here I go again…
First off, importing Control.Applicative. Apparently <|> is defined in both Applicative and in Parsec. I do use <|> from Parsec so preventing importing it from Applicative seemed like a good idea:
import Control.Applicative hiding ( (<|>) )
Second, Cale pointed out that I need to make an instance for Control.Applicative.Applicative for GenParser. He was nice enough to point out how to do that, leaving syntax the only thing I had to struggle with:
instance Applicative (GenParser c st) where
pure = return
(<*>) = ap
I decided to take baby-steps and I started with parseAddress. Here’s what it used to look like:
parseAddress = let
hexStr2Int = Prelude.read . ("0x" ++)
in do
start <- liftM hexStr2Int $ thenChar '-' $ many1 hexDigit
end <- liftM hexStr2Int $ many1 hexDigit
return $ Address start end
On Twan’s suggestion I rewrote it using where rather than let ... in and since this was my first function I decided to go via the ap function (at the same time I broke out hexStr2Int since it’s used in so many places):
parseAddress = do
start <- return hexStr2Int `ap` (thenChar '-' $ many1 hexDigit)
end <- return hexStr2Int `ap` (many1 hexDigit)
return $ Address start end
Then on to applying some functions from Applicative:
parseAddress = Address start end
where
start = hexStr2Int <$> (thenChar '-' $ many1 hexDigit)
end = hexStr2Int <$> (many1 hexDigit)
By now the use of thenChar looks a little silly so I changed that part into many1 hexDigit <* char '-' instead. Finally I removed the where part altogether and use <*> to string it all together:
I have to say I’m fairly pleased with this version of the parser. It reads about as easy as the first version and there’s none of the “reversing” that thenChar introduced.
I received an interesting comment from Conal Elliott on my previous post on parsing. I have to admit I wasn’t sure I understood him at first, I’m still not sure I do, but I think I have an idea of what he means
Basically my code is very sequential in that I use the do construct everywhere in the parsing code. Personally I thought that makes the parser very easy to read since the code very much mimics the structure of the maps file. I do realise the code isn’t very “functional” though so I thought I’d take Conal’s comments to heart and see what the result would be.
Let’s start with observation that every entity in a line is separated by a space. However some things are separated by other characters. So the first thing I did was write a higher-order function that first reads something, then reads a character and returns the first thing that was read:
thenChar c f = f >>= (\ r -> char c >> return r)
Since space is used as a separator so often I added a short-cut for that:
thenSpace = thenChar ' '
Then I put that to use on parseAddress:
parseAddress = let
hexStr2Int = Prelude.read . ("0x" ++)
in do
start <- thenChar '-' $ many1 hexDigit
end <- many1 hexDigit
return $ Address (hexStr2Int start) (hexStr2Int end)
Modifying the other parsing functions using thenChar and thenSpace is straight forward.
I’m not entirely sure I understand what Conal meant with the part about liftM in his comment. I suspect his referring to the fact that I first read characters and then convert them in the “constructors”. By using liftM I can move the conversion “up in the code”. Here’s parseAddress after I’ve moved the calls to hexStr2Int:
parseAddress = let
hexStr2Int = Prelude.read . ("0x" ++)
in do
start <- liftM hexStr2Int $ thenChar '-' $ many1 hexDigit
end <- liftM hexStr2Int $ many1 hexDigit
return $ Address start end
After modifying the other parsing functions in a similar way I ended up with this:
parsePerms = let
cA a = case a of
'p' -> Private
's' -> Shared
in do
r <- liftM (== 'r') anyChar
w <- liftM (== 'w') anyChar
x <- liftM (== 'x') anyChar
a <- liftM cA anyChar
return $ Perms r w x a
parseDevice = let
hexStr2Int = Prelude.read . ("0x" ++)
in do
maj <- liftM hexStr2Int $ thenChar ':' $ many1 hexDigit
min <- liftM hexStr2Int $ many1 hexDigit
return $ Device maj min
parseRegion = let
hexStr2Int = Prelude.read . ("0x" ++)
parsePath = (many1 $ char ' ') >> (many1 $ anyChar)
in do
addr <- thenSpace parseAddress
perm <- thenSpace parsePerms
offset <- liftM hexStr2Int $ thenSpace $ many1 hexDigit
dev <- thenSpace parseDevice
inode <- liftM Prelude.read $ thenSpace $ many1 digit
path <- parsePath <|> string ""
return $ MemRegion addr perm offset dev inode path
Is this code more “functional”? Is it easier to read? You’ll have to be the judge of that…
Conal, if I got the intention of your comment completely wrong then feel free to tell me I’m an idiot
I’ve long wanted to dip my toes in the Parsec water. I’ve made some attempts before, but always stumbled on something that put me in the doldrums for so long that I managed to repress all memories of ever having tried. A few files scattered in my ~/devo/test/haskell directory tells the story of my failed attempts. Until now that is
I picked a nice and regular task for my first real attempt: parsing /proc/<pid>/maps. First a look at the man-page offers a good description of the format of a line:
All types derive Show (and receive default implementations of show, at least when using GHC) so that they are easy to print.
Now, on to the actual “parsec-ing”. Faced with the option of writing it top-down or bottom-up I chose the latter. However, since the format of a single line in the maps file is so simple it’s easy to imagine what the final function will look like. I settled on bottom-up since the datatypes provide me with such an obvious splitting of the line. First off, parsing the address range:
parseAddress = let
hexStr2Int = Prelude.read . ("0x" ++)
in do
start <- many1 hexDigit
char '-'
end <- many1 hexDigit
return $ Address (hexStr2Int start) (hexStr2Int end)
Since the addresses themselves are in hexadecimal and always are of at least length 1 I use many1 hexDigit to read them. I think it would be safe to assume the addresses always are 8 characters (at least on a 32-bit machine) so it would be possible to use count 8 hexDigit but I haven’t tried it. I’ve found two ways of converting a string representation of a hexadecimal number into an Integer. Above I use the fact that Prelude.read interprets a string beginning with 0x as a hexadecimal number. The other way I’ve found is the slightly less readable fst . (!! 0) . readHex. According to the man-page the addresses are separated by a single dash so I’ve hardcoded that in there.
Testing the function is fairly simple. Using gchi, first load the source file then use parse:
*Main> parse parseAddress "" "0-1"
Right (Address {start = 0, end = 1})
*Main> parse parseAddress "hhh" "01234567-89abcdef"
Right (Address {start = 19088743, end = 2309737967})
Seems to work well enough.
Next up, parsing the permissions. This is so very straightforward that I don’t think I need to comment on it:
parsePerms = let
cA a = case a of
'p' -> Private
's' -> Shared
in do
r <- anyChar
w <- anyChar
x <- anyChar
a <- anyChar
return $ Perms (r == 'r') (w == 'w') (x == 'x') (cA a)
For parsing the device information I use the same strategy as for the address range above, this time however the separating charachter is a colon:
parseDevice = let
hexStr2Int = Prelude.read . ("0x" ++)
in do
maj <- many1 digit
char ':'
min <- many1 digit
return $ Device (hexStr2Int maj) (hexStr2Int min)
Next is to tie it all together and create a MemRegion instance:
It should be noted that it seems there is a space after the inode entry so I keep a char ' ' in the main function. Then I try to parse the line for a path, if there is none that attempt will fail immediately and instead I parse for an empty string, parsePath <|> string "". The pathname seems to be prefixed with a fixed number of spaces, but I’m lazy and just consume one or more. I’m not sure exactly what characters are allowed in the pathname itself so I’m lazy once more and just gobble up whatever I find.
To exercise what I had so far I decided to write a function that reads the maps file for a specific process, based on its pid, parses the contents and collects all the MemRegion instances in a list.
getMemRegions pid = let
fp = "/proc" </> show pid </> "maps"
doParseLine' = parse parseRegion "parseRegion"
doParseLine l = case (doParseLine' l) of
Left _ -> error "Failed to parse line"
Right x -> x
in do
mapContent <- liftM lines $ readFile fp
return $ map doParseLine mapContent
The only thing that really is going on here is that the lines are passed from inside an IO monad into the Parser monad and then back again. After this I can try it out by:
*Main> getMemRegions 1
This produces a lot of output so while playing with it I limited the mapping to the four first lines by using take. The last line then becomes:
return $ map doParseLine (take 4 mapContent)
Now it’s easy to add a main that uses the first command line argument as the pid:
main = do
pid <- liftM (Prelude.read . (!! 0)) getArgs
regs <- getMemRegions pid
mapM_ (putStrLn . show) regs
Well, that concludes my first adventure in parsing
[Edit 27-05-2007 13:15]
I received an email asking for it so here are the import statements I ended up with:
import Control.Monad
import System
import System.FilePath
import Text.ParserCombinators.Parsec