<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>therning.org/ magnus &#187; tagsoup</title>
	<atom:link href="http://therning.org/magnus/archives/tag/tagsoup/feed" rel="self" type="application/rss+xml" />
	<link>http://therning.org/magnus</link>
	<description>Incoherent mumblings</description>
	<lastBuildDate>Thu, 12 Jan 2012 13:40:30 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>TagSoup, meet Parsec!</title>
		<link>http://therning.org/magnus/archives/367</link>
		<comments>http://therning.org/magnus/archives/367#comments</comments>
		<pubDate>Fri, 08 Aug 2008 22:48:09 +0000</pubDate>
		<dc:creator>Magnus</dc:creator>
				<category><![CDATA[Posts]]></category>
		<category><![CDATA[haskell]]></category>
		<category><![CDATA[parsec]]></category>
		<category><![CDATA[tagsoup]]></category>

		<guid isPermaLink="false">http://therning.org/magnus/?p=367</guid>
		<description><![CDATA[Recently I began writing a tool to scrape some information off a web site for some off-line processing. After writing up the basics using TagSoup I showed what I had to a colleague. His first comment was “Can&#8217;t you use Parsec for that?” It took me a second to realise that he didn&#8217;t mean that [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I began writing a tool to scrape some information off a web site for some off-line processing.  After writing up the basics using <a href="http://www-users.cs.york.ac.uk/~ndm/tagsoup/">TagSoup</a> I showed what I had to a colleague.  His first comment was “Can&#8217;t you use Parsec for that?”  It took me a second to realise that he didn&#8217;t mean that I should write my own XML parser but rather that Parsec allows writing parsers of a list of anything.  So I thought I&#8217;d see just what it&#8217;d take to create a parser for <code>[Tag]</code>.</p>

<p>A look at the string parser shipped with Parsec offered a lot of inspiration.</p>

<p>First the basic type, <code>TagParser</code>:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;"><span style="color: #06c; font-weight: bold;">type</span> TagParser <span style="color: #339933; font-weight: bold;">=</span> GenParser Tag</pre></div></div>


<p>The basic function of Parsec is <code>tokenPrim</code>, basically that&#8217;s what other basic parsers use.  Taking a cue from the string parser implementation I defined a function called <code>satisfy</code>:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;">satisfy f <span style="color: #339933; font-weight: bold;">=</span> tokenPrim
        <span style="font-weight: bold;">show</span>
        <span style="color: green;">&#40;</span>\ pos t <span style="color: #339933; font-weight: bold;">_</span> <span style="color: #339933; font-weight: bold;">-&gt;</span> updatePosTag pos t<span style="color: green;">&#41;</span>
        <span style="color: green;">&#40;</span>\ t <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="color: #06c; font-weight: bold;">if</span> <span style="color: green;">&#40;</span>f t<span style="color: green;">&#41;</span> <span style="color: #06c; font-weight: bold;">then</span> Just t <span style="color: #06c; font-weight: bold;">else</span> Nothing<span style="color: green;">&#41;</span></pre></div></div>


<p>The positioning in a list of tags simply an increase of column, irrespective of what tag is processed:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;">updatePosTag s <span style="color: #339933; font-weight: bold;">_</span> <span style="color: #339933; font-weight: bold;">=</span> incSourceColumn s <span style="color: red;">1</span></pre></div></div>


<p>Now I have enough to create the first <code>Tag</code> parser&#8212;one that accepts a single instance of the specified kind:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;">tag t <span style="color: #339933; font-weight: bold;">=</span> satisfy <span style="color: green;">&#40;</span><span style="color: #339933; font-weight: bold;">~==</span> t<span style="color: green;">&#41;</span> <span style="color: #339933; font-weight: bold;">&lt;?&gt;</span> <span style="font-weight: bold;">show</span> t</pre></div></div>


<p>It&#8217;s important to stick the supplied tag on the right of <code>(~==)</code>.  See <a href="http://hackage.haskell.org/packages/archive/tagsoup/0.6/doc/html/Text-HTML-TagSoup.html#v%3A~%3D%3D">its documentation</a> for why that is.  The second parser is one that accepts any kind of tag:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;">anyTag <span style="color: #339933; font-weight: bold;">=</span> satisfy <span style="color: green;">&#40;</span><span style="font-weight: bold;">const</span> True<span style="color: green;">&#41;</span></pre></div></div>


<p>So far so good.  The next parser to implement is one that accepts any kind of tag out of a list of tags.  Here I want to make use of the convenient behaviour of <code>(~==)</code> so I&#8217;ll need to implement a custom version of <code>elem</code>:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;">l `elemTag` r <span style="color: #339933; font-weight: bold;">=</span> <span style="font-weight: bold;">or</span> <span style="color: #339933; font-weight: bold;">$</span> l `elemT` r
    <span style="color: #06c; font-weight: bold;">where</span>
        l `elemT` <span style="color: green;">&#91;</span><span style="color: green;">&#93;</span> <span style="color: #339933; font-weight: bold;">=</span> <span style="color: green;">&#91;</span>False<span style="color: green;">&#93;</span>
        l `elemT` <span style="color: green;">&#40;</span>r:rs<span style="color: green;">&#41;</span> <span style="color: #339933; font-weight: bold;">=</span> <span style="color: green;">&#40;</span>l <span style="color: #339933; font-weight: bold;">~==</span> r<span style="color: green;">&#41;</span> : l `elemT` rs</pre></div></div>


<p>With that in place it&#8217;s easy to implement <code>oneOf</code> and <code>noneOf</code>:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;">oneOf ts <span style="color: #339933; font-weight: bold;">=</span> satisfy <span style="color: green;">&#40;</span>`elemTag` ts<span style="color: green;">&#41;</span>
noneOf ts <span style="color: #339933; font-weight: bold;">=</span> satisfy <span style="color: green;">&#40;</span>\ t <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="font-weight: bold;">not</span> <span style="color: green;">&#40;</span>t `elemTag` ts<span style="color: green;">&#41;</span><span style="color: green;">&#41;</span></pre></div></div>


<p>So, as an example of what this can be used for here is a re-implementation of TagSoup&#8217;s <a href="http://hackage.haskell.org/packages/archive/tagsoup/0.6/doc/html/Text-HTML-TagSoup.html#v%3Apartitions">partitions</a>:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;">partitions t <span style="color: #339933; font-weight: bold;">=</span> liftM2 <span style="color: green;">&#40;</span>:<span style="color: green;">&#41;</span>
        <span style="color: green;">&#40;</span>many <span style="color: #339933; font-weight: bold;">$</span> noneOf <span style="color: green;">&#91;</span>t<span style="color: green;">&#93;</span><span style="color: green;">&#41;</span>
        <span style="color: green;">&#40;</span>many <span style="color: #339933; font-weight: bold;">$</span> liftM2 <span style="color: green;">&#40;</span>:<span style="color: green;">&#41;</span> <span style="color: green;">&#40;</span>tag t<span style="color: green;">&#41;</span> <span style="color: green;">&#40;</span>many <span style="color: #339933; font-weight: bold;">$</span> noneOf <span style="color: green;">&#91;</span>t<span style="color: green;">&#93;</span><span style="color: green;">&#41;</span><span style="color: green;">&#41;</span></pre></div></div>


<p>Of course the big question is whether I&#8217;ll rewrite my original code using Parsec.  Hmm, probably not in this case, but the next time I need to do some web page scraping it offers yet another option for doing it.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Ftherning.org%2Fmagnus%2Farchives%2F367&amp;title=TagSoup%2C%20meet%20Parsec%21" id="wpa2a_2">Share/Bookmark</a></p>]]></content:encoded>
			<wfw:commentRss>http://therning.org/magnus/archives/367/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

