TagSoup, meet Parsec!

Recently I began writing a tool to scrape some information off a web site for some off-line processing. After writing up the basics using TagSoup I showed what I had to a colleague. His first comment was “Can’t you use Parsec for that?” It took me a second to realise that he didn’t mean that I should write my own XML parser but rather that Parsec allows writing parsers of a list of anything. So I thought I’d see just what it’d take to create a parser for [Tag].

A look at the string parser shipped with Parsec offered a lot of inspiration.

First the basic type, TagParser:

type TagParser = GenParser Tag

The basic function of Parsec is tokenPrim, basically that’s what other basic parsers use. Taking a cue from the string parser implementation I defined a function called satisfy:

satisfy f = tokenPrim
        show
        (\ pos t _ -> updatePosTag pos t)
        (\ t -> if (f t) then Just t else Nothing)

The positioning in a list of tags simply an increase of column, irrespective of what tag is processed:

updatePosTag s _ = incSourceColumn s 1

Now I have enough to create the first Tag parser—one that accepts a single instance of the specified kind:

tag t = satisfy (~== t) <?> show t

It’s important to stick the supplied tag on the right of (~==). See its documentation for why that is. The second parser is one that accepts any kind of tag:

anyTag = satisfy (const True)

So far so good. The next parser to implement is one that accepts any kind of tag out of a list of tags. Here I want to make use of the convenient behaviour of (~==) so I’ll need to implement a custom version of elem:

l `elemTag` r = or $ l `elemT` r
    where
        l `elemT` [] = [False]
        l `elemT` (r:rs) = (l ~== r) : l `elemT` rs

With that in place it’s easy to implement oneOf and noneOf:

oneOf ts = satisfy (`elemTag` ts)
noneOf ts = satisfy (\ t -> not (t `elemTag` ts))

So, as an example of what this can be used for here is a re-implementation of TagSoup’s partitions:

partitions t = liftM2 (:)
        (many $ noneOf [t])
        (many $ liftM2 (:) (tag t) (many $ noneOf [t]))

Of course the big question is whether I’ll rewrite my original code using Parsec. Hmm, probably not in this case, but the next time I need to do some web page scraping it offers yet another option for doing it.

Share

6 Comments

  1. Is Parsec lazy? If so, this has potential to be light years better than what the TagTree beta module provides. A teaser:

    childrenP = do
      open@(TagOpen name _) <- tagOpen
      content <- many (many1 tagText <|> childrenP)
      close <- tagCloseName name
      return (open : concat content ++ [close])
    

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>