TagSoup, meet Parsec!
Recently I began writing a tool to scrape some information off a web site for some off-line processing. After writing up the basics using TagSoup I showed what I had to a colleague. His first comment was “Can’t you use Parsec for that?” It took me a second to realise that he didn’t mean that I should write my own XML parser but rather that Parsec allows writing parsers of a list of anything. So I thought I’d see just what it’d take to create a parser for [Tag].
A look at the string parser shipped with Parsec offered a lot of inspiration.
First the basic type, TagParser:
type TagParser = GenParser Tag
The basic function of Parsec is tokenPrim, basically that’s what other basic parsers use. Taking a cue from the string parser implementation I defined a function called satisfy:
satisfy f = tokenPrim show (\ pos t _ -> updatePosTag pos t) (\ t -> if (f t) then Just t else Nothing)
The positioning in a list of tags simply an increase of column, irrespective of what tag is processed:
updatePosTag s _ = incSourceColumn s 1
Now I have enough to create the first Tag parser—one that accepts a single instance of the specified kind:
tag t = satisfy (~== t) <?> show t
It’s important to stick the supplied tag on the right of (~==). See its documentation for why that is. The second parser is one that accepts any kind of tag:
anyTag = satisfy (const True)
So far so good. The next parser to implement is one that accepts any kind of tag out of a list of tags. Here I want to make use of the convenient behaviour of (~==) so I’ll need to implement a custom version of elem:
l `elemTag` r = or $ l `elemT` r where l `elemT` [] = [False] l `elemT` (r:rs) = (l ~== r) : l `elemT` rs
With that in place it’s easy to implement oneOf and noneOf:
oneOf ts = satisfy (`elemTag` ts) noneOf ts = satisfy (\ t -> not (t `elemTag` ts))
So, as an example of what this can be used for here is a re-implementation of TagSoup’s partitions:
partitions t = liftM2 (:) (many $ noneOf [t]) (many $ liftM2 (:) (tag t) (many $ noneOf [t]))
Of course the big question is whether I’ll rewrite my original code using Parsec. Hmm, probably not in this case, but the next time I need to do some web page scraping it offers yet another option for doing it.
![[Digg]](http://therning.org/magnus/wp-content/plugins/bookmarkify/digg.png)
![[Reddit]](http://therning.org/magnus/wp-content/plugins/bookmarkify/reddit.png)
Corey O'Connor:
blink
9 August 2008, 4:01 amgears turn
Brilliant! That’s a sweet little hack.
Neil Mitchell:
That is awsome! I will be adding a link to this blog article from the user manual and from the website, well done
9 August 2008, 9:44 pmDmitry Golubovsky:
Interestingly enough, just couple days ago I came to the same combination of TagSoup and Parsec when I needed to parse XML that edoc (Erlang documentation tool) produces. The parser itself is too specialized to use such a large thing as HXT.
THe sample code is here:
http://code.haskell.org/yc2erl/Language/Edoc/Xml2Hs/Parser.hs
and the data structures it creates are defined here:
http://code.haskell.org/yc2erl/Language/Edoc/Xml2Hs/Type.hs
10 August 2008, 3:51 pmPepe Iborra:
Is Parsec lazy?
11 August 2008, 2:48 pmIf so, this has potential to be light years better than what the TagTree beta module provides. A teaser: