<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>therning.org/ magnus &#187; parsec</title>
	<atom:link href="http://therning.org/magnus/archives/tag/parsec/feed" rel="self" type="application/rss+xml" />
	<link>http://therning.org/magnus</link>
	<description>Incoherent mumblings</description>
	<lastBuildDate>Thu, 12 Jan 2012 13:40:30 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>TagSoup, meet Parsec!</title>
		<link>http://therning.org/magnus/archives/367</link>
		<comments>http://therning.org/magnus/archives/367#comments</comments>
		<pubDate>Fri, 08 Aug 2008 22:48:09 +0000</pubDate>
		<dc:creator>Magnus</dc:creator>
				<category><![CDATA[Posts]]></category>
		<category><![CDATA[haskell]]></category>
		<category><![CDATA[parsec]]></category>
		<category><![CDATA[tagsoup]]></category>

		<guid isPermaLink="false">http://therning.org/magnus/?p=367</guid>
		<description><![CDATA[Recently I began writing a tool to scrape some information off a web site for some off-line processing. After writing up the basics using TagSoup I showed what I had to a colleague. His first comment was “Can&#8217;t you use Parsec for that?” It took me a second to realise that he didn&#8217;t mean that [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I began writing a tool to scrape some information off a web site for some off-line processing.  After writing up the basics using <a href="http://www-users.cs.york.ac.uk/~ndm/tagsoup/">TagSoup</a> I showed what I had to a colleague.  His first comment was “Can&#8217;t you use Parsec for that?”  It took me a second to realise that he didn&#8217;t mean that I should write my own XML parser but rather that Parsec allows writing parsers of a list of anything.  So I thought I&#8217;d see just what it&#8217;d take to create a parser for <code>[Tag]</code>.</p>

<p>A look at the string parser shipped with Parsec offered a lot of inspiration.</p>

<p>First the basic type, <code>TagParser</code>:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;"><span style="color: #06c; font-weight: bold;">type</span> TagParser <span style="color: #339933; font-weight: bold;">=</span> GenParser Tag</pre></div></div>


<p>The basic function of Parsec is <code>tokenPrim</code>, basically that&#8217;s what other basic parsers use.  Taking a cue from the string parser implementation I defined a function called <code>satisfy</code>:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;">satisfy f <span style="color: #339933; font-weight: bold;">=</span> tokenPrim
        <span style="font-weight: bold;">show</span>
        <span style="color: green;">&#40;</span>\ pos t <span style="color: #339933; font-weight: bold;">_</span> <span style="color: #339933; font-weight: bold;">-&gt;</span> updatePosTag pos t<span style="color: green;">&#41;</span>
        <span style="color: green;">&#40;</span>\ t <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="color: #06c; font-weight: bold;">if</span> <span style="color: green;">&#40;</span>f t<span style="color: green;">&#41;</span> <span style="color: #06c; font-weight: bold;">then</span> Just t <span style="color: #06c; font-weight: bold;">else</span> Nothing<span style="color: green;">&#41;</span></pre></div></div>


<p>The positioning in a list of tags simply an increase of column, irrespective of what tag is processed:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;">updatePosTag s <span style="color: #339933; font-weight: bold;">_</span> <span style="color: #339933; font-weight: bold;">=</span> incSourceColumn s <span style="color: red;">1</span></pre></div></div>


<p>Now I have enough to create the first <code>Tag</code> parser&#8212;one that accepts a single instance of the specified kind:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;">tag t <span style="color: #339933; font-weight: bold;">=</span> satisfy <span style="color: green;">&#40;</span><span style="color: #339933; font-weight: bold;">~==</span> t<span style="color: green;">&#41;</span> <span style="color: #339933; font-weight: bold;">&lt;?&gt;</span> <span style="font-weight: bold;">show</span> t</pre></div></div>


<p>It&#8217;s important to stick the supplied tag on the right of <code>(~==)</code>.  See <a href="http://hackage.haskell.org/packages/archive/tagsoup/0.6/doc/html/Text-HTML-TagSoup.html#v%3A~%3D%3D">its documentation</a> for why that is.  The second parser is one that accepts any kind of tag:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;">anyTag <span style="color: #339933; font-weight: bold;">=</span> satisfy <span style="color: green;">&#40;</span><span style="font-weight: bold;">const</span> True<span style="color: green;">&#41;</span></pre></div></div>


<p>So far so good.  The next parser to implement is one that accepts any kind of tag out of a list of tags.  Here I want to make use of the convenient behaviour of <code>(~==)</code> so I&#8217;ll need to implement a custom version of <code>elem</code>:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;">l `elemTag` r <span style="color: #339933; font-weight: bold;">=</span> <span style="font-weight: bold;">or</span> <span style="color: #339933; font-weight: bold;">$</span> l `elemT` r
    <span style="color: #06c; font-weight: bold;">where</span>
        l `elemT` <span style="color: green;">&#91;</span><span style="color: green;">&#93;</span> <span style="color: #339933; font-weight: bold;">=</span> <span style="color: green;">&#91;</span>False<span style="color: green;">&#93;</span>
        l `elemT` <span style="color: green;">&#40;</span>r:rs<span style="color: green;">&#41;</span> <span style="color: #339933; font-weight: bold;">=</span> <span style="color: green;">&#40;</span>l <span style="color: #339933; font-weight: bold;">~==</span> r<span style="color: green;">&#41;</span> : l `elemT` rs</pre></div></div>


<p>With that in place it&#8217;s easy to implement <code>oneOf</code> and <code>noneOf</code>:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;">oneOf ts <span style="color: #339933; font-weight: bold;">=</span> satisfy <span style="color: green;">&#40;</span>`elemTag` ts<span style="color: green;">&#41;</span>
noneOf ts <span style="color: #339933; font-weight: bold;">=</span> satisfy <span style="color: green;">&#40;</span>\ t <span style="color: #339933; font-weight: bold;">-&gt;</span> <span style="font-weight: bold;">not</span> <span style="color: green;">&#40;</span>t `elemTag` ts<span style="color: green;">&#41;</span><span style="color: green;">&#41;</span></pre></div></div>


<p>So, as an example of what this can be used for here is a re-implementation of TagSoup&#8217;s <a href="http://hackage.haskell.org/packages/archive/tagsoup/0.6/doc/html/Text-HTML-TagSoup.html#v%3Apartitions">partitions</a>:</p>


<div class="wp_syntax"><div class="code"><pre class="haskell" style="font-family:monospace;">partitions t <span style="color: #339933; font-weight: bold;">=</span> liftM2 <span style="color: green;">&#40;</span>:<span style="color: green;">&#41;</span>
        <span style="color: green;">&#40;</span>many <span style="color: #339933; font-weight: bold;">$</span> noneOf <span style="color: green;">&#91;</span>t<span style="color: green;">&#93;</span><span style="color: green;">&#41;</span>
        <span style="color: green;">&#40;</span>many <span style="color: #339933; font-weight: bold;">$</span> liftM2 <span style="color: green;">&#40;</span>:<span style="color: green;">&#41;</span> <span style="color: green;">&#40;</span>tag t<span style="color: green;">&#41;</span> <span style="color: green;">&#40;</span>many <span style="color: #339933; font-weight: bold;">$</span> noneOf <span style="color: green;">&#91;</span>t<span style="color: green;">&#93;</span><span style="color: green;">&#41;</span><span style="color: green;">&#41;</span></pre></div></div>


<p>Of course the big question is whether I&#8217;ll rewrite my original code using Parsec.  Hmm, probably not in this case, but the next time I need to do some web page scraping it offers yet another option for doing it.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Ftherning.org%2Fmagnus%2Farchives%2F367&amp;title=TagSoup%2C%20meet%20Parsec%21" id="wpa2a_2">Share/Bookmark</a></p>]]></content:encoded>
			<wfw:commentRss>http://therning.org/magnus/archives/367/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Adventures in parsing, part 4</title>
		<link>http://therning.org/magnus/archives/296</link>
		<comments>http://therning.org/magnus/archives/296#comments</comments>
		<pubDate>Tue, 05 Jun 2007 07:19:25 +0000</pubDate>
		<dc:creator>Magnus</dc:creator>
				<category><![CDATA[Posts]]></category>
		<category><![CDATA[haskell]]></category>
		<category><![CDATA[parsec]]></category>

		<guid isPermaLink="false">http://therning.org/magnus/archives/296</guid>
		<description><![CDATA[I received a few comments on part 3 of this little mini-series and I just wanted to address them. While doing this I still want the main functions of the parser parseXxx to read like the maps file itself. That means I want to avoid &#8220;reversing order&#8221; like thenChar and thenSpace did in part2. I [...]]]></description>
			<content:encoded><![CDATA[<p>I received a few comments on <a href="http://therning.org/magnus/archives/295">part 3</a> of this little mini-series and I just wanted to address them. While doing this I still want the main functions of the parser <code>parseXxx</code> to read like the <code>maps</code> file itself. That means I want to  avoid &#8220;reversing order&#8221; like <code>thenChar</code> and <code>thenSpace</code> did in <a href="http://therning.org/magnus/archives/290">part2</a>. I also don&#8217;t want to hide things, e.g. I don&#8217;t want to introduce a function that turns <code>(a &lt;* char ' ') &lt;*&gt; b</code> into <code>a &lt;#&gt; b</code>.</p>

<p>So, first up is to do something about <code>hexStr2Int &lt;$&gt; many1 hexDigit</code> which appears all over the place. I made it appear in even more places by moving around a few parentheses; the following two functions are the same:</p>

<pre><code>foo = a &lt;$&gt; (b &lt;* c)
bar = (a &lt;$&gt; b) &lt;* c
</code></pre>

<p>Then I scrapped <code>hexStr2Int</code> completely and instead introduced <code>hexStr</code>:</p>

<pre><code>hexStr = Prelude.read . ("0x" ++) &lt;$&gt; many1 hexDigit
</code></pre>

<p>This means that <code>parseAddress</code> can be rewritten to:</p>

<pre><code>parseAddress = Address &lt;$&gt;
    hexStr &lt;* char '-' &lt;*&gt;
    hexStr
</code></pre>

<p>Rather than, as Conal suggested, introduce an infix operation that addresses the pattern <code>(a &lt;* char ' ') &lt;*&gt; b</code> I decided to do something about <code>a &lt;* char c</code>. I feel Conal&#8217;s suggestion, while shortening the code more than my solution, goes against my wish to not hide things. This is the definition of <code>&lt;##&gt;</code>:</p>

<pre><code>(&lt;##&gt;) l r = l &lt;* char r
</code></pre>

<p>After this I rewrote <code>parseAddress</code> into:</p>

<pre><code>parseAddress = Address &lt;$&gt;
    hexStr &lt;##&gt; '-' &lt;*&gt;
    hexStr
</code></pre>

<p>The pattern <code>(== c) &lt;$&gt; anyChar</code> appears three times in <code>parsePerms</code> so it got a name and moved down into the <code>where</code> clause. I also modified <code>cA</code> to use pattern matching. I haven&#8217;t spent much time considering error handling in the parser, so I didn&#8217;t introduce a pattern matching everything else.</p>

<pre><code>parsePerms = Perms &lt;$&gt;
    pP 'r' &lt;*&gt;
    pP 'w' &lt;*&gt;
    pP 'x' &lt;*&gt;
    (cA &lt;$&gt; anyChar)

    where
        pP c = (== c) &lt;$&gt; anyChar
        cA 'p' = Private
        cA 's' = Shared
</code></pre>

<p>The last change I did was remove a bunch of parentheses. I&#8217;m always a little hesitant removing parentheses and relying on precedence rules, I find I&#8217;m even more hesitant doing it when programming Haskell. Probably due to Haskell having <em>a lot</em> of infix operators that I&#8217;m unused to.</p>

<p>The rest of the parser now looks like this:</p>

<pre><code>parseDevice = Device &lt;$&gt;
    hexStr &lt;##&gt; ':' &lt;*&gt;
    hexStr

parseRegion = MemRegion &lt;$&gt;
    parseAddress &lt;##&gt; ' ' &lt;*&gt;
    parsePerms &lt;##&gt; ' ' &lt;*&gt;
    hexStr &lt;##&gt; ' ' &lt;*&gt;
    parseDevice &lt;##&gt; ' ' &lt;*&gt;
    (Prelude.read &lt;$&gt; many1 digit) &lt;##&gt; ' ' &lt;*&gt;
    (parsePath &lt;|&gt; string "")

    where
        parsePath = (many1 $ char ' ') *&gt; (many1 anyChar)
</code></pre>

<p>I think these changes address most of the comments Conal and Twan made on the previous part. Where they don&#8217;t I hope I&#8217;ve explained why I decided not to take their advice.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Ftherning.org%2Fmagnus%2Farchives%2F296&amp;title=Adventures%20in%20parsing%2C%20part%204" id="wpa2a_4">Share/Bookmark</a></p>]]></content:encoded>
			<wfw:commentRss>http://therning.org/magnus/archives/296/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Adventures in parsing, part 3</title>
		<link>http://therning.org/magnus/archives/295</link>
		<comments>http://therning.org/magnus/archives/295#comments</comments>
		<pubDate>Sun, 03 Jun 2007 00:24:08 +0000</pubDate>
		<dc:creator>Magnus</dc:creator>
				<category><![CDATA[Posts]]></category>
		<category><![CDATA[haskell]]></category>
		<category><![CDATA[parsec]]></category>

		<guid isPermaLink="false">http://therning.org/magnus/archives/295</guid>
		<description><![CDATA[I got a great many comments, at least by my standards, on my earlier two posts on parsing in Haskell. Especially on the latest one. Conal posted a comment on the first pointing me towards liftM and its siblings, without telling me that it would only be the first step towards &#8220;applicative style&#8221;. So, here [...]]]></description>
			<content:encoded><![CDATA[<p>I got a great many comments, at least by my standards, on my earlier <a href="http://therning.org/magnus/archives/290">two</a> <a href="http://therning.org/magnus/archives/289">posts</a> on parsing in Haskell. Especially on the latest one. Conal posted a comment on the first pointing me towards <code>liftM</code> and its siblings, without telling me that it would only be the first step towards &#8220;applicative style&#8221;. So, here I go again&#8230;</p>

<p>First off, importing <code>Control.Applicative</code>. Apparently <code>&lt;|&gt;</code> is defined in both <code>Applicative</code> and in <code>Parsec</code>. I do use <code>&lt;|&gt;</code> from <code>Parsec</code> so preventing importing it from <code>Applicative</code> seemed like a good idea:</p>

<pre><code>import Control.Applicative hiding ( (&lt;|&gt;) )
</code></pre>

<p>Second, Cale pointed out that I need to make an instance for <code>Control.Applicative.Applicative</code> for <code>GenParser</code>. He was nice enough to point out how to do that, leaving syntax the only thing I had to struggle with:</p>

<pre><code>instance Applicative (GenParser c st) where
    pure = return
    (&lt;*&gt;) = ap
</code></pre>

<p>I decided to take baby-steps and I started with <code>parseAddress</code>. Here&#8217;s what it used to look like:</p>

<pre><code>parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start &lt;- liftM hexStr2Int $ thenChar '-' $ many1 hexDigit
        end &lt;- liftM hexStr2Int $ many1 hexDigit
        return $ Address start end
</code></pre>

<p>On Twan&#8217;s suggestion I rewrote it using <code>where</code> rather than <code>let ... in</code> and since this was my first function I decided to go via the <code>ap</code> function (at the same time I broke out <code>hexStr2Int</code> since it&#8217;s used in so many places):</p>

<pre><code>parseAddress = do
    start &lt;- return hexStr2Int `ap` (thenChar '-' $ many1 hexDigit)
    end &lt;- return hexStr2Int `ap` (many1 hexDigit)
    return $ Address start end
</code></pre>

<p>Then on to applying some functions from <code>Applicative</code>:</p>

<pre><code>parseAddress = Address start end
    where
        start = hexStr2Int &lt;$&gt; (thenChar '-' $ many1 hexDigit)
        end = hexStr2Int &lt;$&gt; (many1 hexDigit)
</code></pre>

<p>By now the use of <code>thenChar</code> looks a little silly so I changed that part into <code>many1 hexDigit &lt;* char '-'</code> instead. Finally I removed the <code>where</code> part altogether and use <code>&lt;*&gt;</code> to string it all together:</p>

<pre><code>parseAddress = Address &lt;$&gt;
    (hexStr2Int &lt;$&gt; many1 hexDigit &lt;* char '-') &lt;*&gt;
    (hexStr2Int &lt;$&gt; (many1 hexDigit))
</code></pre>

<p>From here on I skipped the intermediate steps and went straight for the last form. Here&#8217;s what I ended up with:</p>

<pre><code>parsePerms = Perms &lt;$&gt;
    ( (== 'r') &lt;$&gt; anyChar) &lt;*&gt;
    ( (== 'w') &lt;$&gt; anyChar) &lt;*&gt;
    ( (== 'x') &lt;$&gt; anyChar) &lt;*&gt;
    (cA &lt;$&gt; anyChar)

    where
        cA a = case a of
            'p' -&gt; Private
            's' -&gt; Shared

parseDevice = Device &lt;$&gt;
    (hexStr2Int &lt;$&gt; many1 hexDigit &lt;* char ':') &lt;*&gt;
    (hexStr2Int &lt;$&gt; (many1 hexDigit))

parseRegion = MemRegion &lt;$&gt;
    (parseAddress &lt;* char ' ') &lt;*&gt;
    (parsePerms &lt;* char ' ') &lt;*&gt;
    (hexStr2Int &lt;$&gt; (many1 hexDigit &lt;* char ' ')) &lt;*&gt;
    (parseDevice &lt;* char ' ') &lt;*&gt;
    (Prelude.read &lt;$&gt; (many1 digit &lt;* char ' ')) &lt;*&gt;
    (parsePath &lt;|&gt; string "")

    where
        parsePath = (many1 $ char ' ') *&gt; (many1 anyChar)
</code></pre>

<p>I have to say I&#8217;m fairly pleased with this version of the parser. It reads about as easy as the first version and there&#8217;s none of the &#8220;reversing&#8221; that <code>thenChar</code> introduced.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Ftherning.org%2Fmagnus%2Farchives%2F295&amp;title=Adventures%20in%20parsing%2C%20part%203" id="wpa2a_6">Share/Bookmark</a></p>]]></content:encoded>
			<wfw:commentRss>http://therning.org/magnus/archives/295/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>More adventures in parsing</title>
		<link>http://therning.org/magnus/archives/290</link>
		<comments>http://therning.org/magnus/archives/290#comments</comments>
		<pubDate>Tue, 29 May 2007 22:50:52 +0000</pubDate>
		<dc:creator>Magnus</dc:creator>
				<category><![CDATA[Posts]]></category>
		<category><![CDATA[haskell]]></category>
		<category><![CDATA[parsec]]></category>

		<guid isPermaLink="false">http://therning.org/magnus/archives/290</guid>
		<description><![CDATA[I received an interesting comment from Conal Elliott on my previous post on parsing. I have to admit I wasn&#8217;t sure I understood him at first, I&#8217;m still not sure I do, but I think I have an idea of what he means Basically my code is very sequential in that I use the do [...]]]></description>
			<content:encoded><![CDATA[<p>I received an interesting comment from Conal Elliott on my <a href="http://therning.org/magnus/archives/289#comments">previous post on parsing</a>.  I have to admit I wasn&#8217;t sure I understood him at first, I&#8217;m still not sure I do, but I think I have an idea of what he means <img src='http://therning.org/magnus/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>

<p>Basically my code is very sequential in that I use the <code>do</code> construct everywhere in the parsing code. Personally I thought that makes the parser very easy to read since the code very much mimics the structure of the <code>maps</code> file. I do realise the code isn&#8217;t very &#8220;functional&#8221; though so I thought I&#8217;d take Conal&#8217;s comments to heart and see what the result would be.</p>

<p>Let&#8217;s start with observation that every entity in a line is separated by a space. However some things are separated by other characters. So the first thing I did was write a higher-order function that first reads something, then reads a character and returns the first thing that was read:</p>

<pre><code>thenChar c f = f &gt;&gt;= (\ r -&gt; char c &gt;&gt; return r)
</code></pre>

<p>Since space is used as a separator so often I added a short-cut for that:</p>

<pre><code>thenSpace  = thenChar ' '
</code></pre>

<p>Then I put that to use on <code>parseAddress</code>:</p>

<pre><code>parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start &lt;- thenChar '-' $ many1 hexDigit
        end &lt;- many1 hexDigit
        return $ Address (hexStr2Int start) (hexStr2Int end)
</code></pre>

<p>Modifying the other parsing functions using <code>thenChar</code> and <code>thenSpace</code> is straight forward.</p>

<p>I&#8217;m not entirely sure I understand what Conal meant with the part about <code>liftM</code> in his comment. I suspect his referring to the fact that I first read characters and then convert them in the &#8220;constructors&#8221;. By using <code>liftM</code> I can move the conversion &#8220;up in the code&#8221;. Here&#8217;s <code>parseAddress</code> after I&#8217;ve moved the calls to <code>hexStr2Int</code>:</p>

<pre><code>parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start &lt;- liftM hexStr2Int $ thenChar '-' $ many1 hexDigit
        end &lt;- liftM hexStr2Int $ many1 hexDigit
        return $ Address start end
</code></pre>

<p>After modifying the other parsing functions in a similar way I ended up with this:</p>

<pre><code>parsePerms = let
        cA a = case a of
            'p' -&gt; Private
            's' -&gt; Shared
    in do
        r &lt;- liftM (== 'r') anyChar
        w &lt;- liftM (== 'w') anyChar
        x &lt;- liftM (== 'x') anyChar
        a &lt;- liftM cA anyChar
        return $ Perms r w x a

parseDevice = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        maj &lt;- liftM hexStr2Int $ thenChar ':' $ many1 hexDigit
        min &lt;- liftM hexStr2Int $ many1 hexDigit
        return $ Device maj min

parseRegion = let
        hexStr2Int = Prelude.read . ("0x" ++)
        parsePath = (many1 $ char ' ') &gt;&gt; (many1 $ anyChar)
    in do
        addr &lt;- thenSpace parseAddress
        perm &lt;- thenSpace parsePerms
        offset &lt;- liftM hexStr2Int $ thenSpace $ many1 hexDigit
        dev &lt;- thenSpace parseDevice
        inode &lt;- liftM Prelude.read $ thenSpace $ many1 digit
        path &lt;- parsePath &lt;|&gt; string ""
        return $ MemRegion addr perm offset dev inode path
</code></pre>

<p>Is this code more &#8220;functional&#8221;? Is it easier to read? You&#8217;ll have to be the judge of that&#8230;</p>

<p>Conal, if I got the intention of your comment completely wrong then feel free to tell me I&#8217;m an idiot <img src='http://therning.org/magnus/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Ftherning.org%2Fmagnus%2Farchives%2F290&amp;title=More%20adventures%20in%20parsing" id="wpa2a_8">Share/Bookmark</a></p>]]></content:encoded>
			<wfw:commentRss>http://therning.org/magnus/archives/290/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Adventures in parsing</title>
		<link>http://therning.org/magnus/archives/289</link>
		<comments>http://therning.org/magnus/archives/289#comments</comments>
		<pubDate>Sun, 27 May 2007 01:32:50 +0000</pubDate>
		<dc:creator>Magnus</dc:creator>
				<category><![CDATA[Posts]]></category>
		<category><![CDATA[haskell]]></category>
		<category><![CDATA[parsec]]></category>

		<guid isPermaLink="false">http://therning.org/magnus/archives/289</guid>
		<description><![CDATA[I&#8217;ve long wanted to dip my toes in the Parsec water. I&#8217;ve made some attempts before, but always stumbled on something that put me in the doldrums for so long that I managed to repress all memories of ever having tried. A few files scattered in my ~/devo/test/haskell directory tells the story of my failed [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve long wanted to dip my toes in the <code>Parsec</code> water. I&#8217;ve made some attempts before, but always stumbled on something that put me in the doldrums for so long that I managed to repress all memories of ever having tried. A few files scattered in my <code>~/devo/test/haskell</code> directory tells the story of my failed attempts. Until now that is <img src='http://therning.org/magnus/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>

<p>I picked a nice and regular task for my first real attempt: parsing <code>/proc/&lt;pid&gt;/maps</code>. First a look at the man-page offers a good description of the format of a line:</p>

<pre><code>address           perms offset  dev   inode      pathname
08048000-08056000 r-xp 00000000 03:0c 64593      /usr/sbin/gpm
</code></pre>

<p>So, I started putting together some datatypes. First off the address range:</p>

<pre><code>data Address = Address { start :: Integer, end :: Integer }
    deriving Show
</code></pre>

<p>Then I decided that the &#8216;s&#8217;/'p&#8217; in the permissions should be called <code>Access</code>:</p>

<pre><code>data Access = Shared | Private
    deriving Show
</code></pre>

<p>The basic permissions (<code>rwx</code>) are simply represented as booleans:</p>

<pre><code>data Perms = Perms {
        read :: Bool,
        write :: Bool,
        executable :: Bool,
        access :: Access
    }
    deriving Show
</code></pre>

<p>The device is straightforward as well:</p>

<pre><code>data Device = Device { major :: Integer, minor :: Integer }
    deriving Show
</code></pre>

<p>At last I tie it all together in a final datatype that represents a memory region:</p>

<pre><code>data MemRegion = MemRegion {
        address :: Address,
        perms :: Perms,
        offset :: Integer,
        device :: Device,
        inode :: Integer,
        pathname :: String
    }
    deriving Show
</code></pre>

<p>All types derive <code>Show</code> (and receive default implementations of <code>show</code>, at least when using GHC) so that they are easy to print.</p>

<p>Now, on to the actual &#8220;parsec-ing&#8221;. Faced with the option of writing it top-down or bottom-up I chose the latter. However, since the format of a single line in the <code>maps</code> file is so simple it&#8217;s easy to imagine what the final function will look like. I settled on bottom-up since the datatypes provide me with such an obvious splitting of the line. First off, parsing the address range:</p>

<pre><code>parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start &lt;- many1 hexDigit
        char '-'
        end &lt;- many1 hexDigit
        return $ Address (hexStr2Int start) (hexStr2Int end)
</code></pre>

<p>Since the addresses themselves are in hexadecimal and always are of at least length 1 I use <code>many1 hexDigit</code> to read them. I think it would be safe to assume the addresses always are 8 characters (at least on a 32-bit machine) so it would be possible to use <code>count 8 hexDigit</code> but I haven&#8217;t tried it. I&#8217;ve found two ways of converting a string representation of a hexadecimal number into an <code>Integer</code>. Above I use the fact that <code>Prelude.read</code> interprets a string beginning with <code>0x</code> as a hexadecimal number. The other way I&#8217;ve found is the slightly less readable <code>fst . (!! 0) . readHex</code>. According to the man-page the addresses are separated by a single dash so I&#8217;ve hardcoded that in there.</p>

<p>Testing the function is fairly simple. Using <code>gchi</code>, first load the source file then use <code>parse</code>:</p>

<pre><code>*Main&gt; parse parseAddress "" "0-1"
Right (Address {start = 0, end = 1})
*Main&gt; parse parseAddress "hhh" "01234567-89abcdef"
Right (Address {start = 19088743, end = 2309737967})
</code></pre>

<p>Seems to work well enough. <img src='http://therning.org/magnus/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>

<p>Next up, parsing the permissions. This is so very straightforward that I don&#8217;t think I need to comment on it:</p>

<pre><code>parsePerms = let
        cA a = case a of
            'p' -&gt; Private
            's' -&gt; Shared
    in do
        r &lt;- anyChar
        w &lt;- anyChar
        x &lt;- anyChar
        a &lt;- anyChar
        return $ Perms (r == 'r') (w == 'w') (x == 'x') (cA a)
</code></pre>

<p>For parsing the device information I use the same strategy as for the address range above, this time however the separating charachter is a colon:</p>

<pre><code>parseDevice = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        maj &lt;- many1 digit
        char ':'
        min &lt;- many1 digit
        return $ Device (hexStr2Int maj) (hexStr2Int min)
</code></pre>

<p>Next is to tie it all together and create a MemRegion instance:</p>

<pre><code>parseRegion = let
        hexStr2Int = Prelude.read . ("0x" ++)
        parsePath = (many1 $ char ' ') &gt;&gt; (many1 $ anyChar)
    in do
        addr &lt;- parseAddress
        char ' '
        perm &lt;- parsePerms
        char ' '
        offset &lt;- many1 hexDigit
        char ' '
        dev &lt;- parseDevice
        char ' '
        inode &lt;- many1 digit
        char ' '
        path &lt;- parsePath &lt;|&gt; string ""
        return $ MemRegion addr perm (hexStr2Int offset) dev (Prelude.read inode) path
</code></pre>

<p>The only little trick here is that there are lines that lack the pathname. Here&#8217;s an example from the man-page:</p>

<pre><code>address           perms offset  dev   inode      pathname
08058000-0805b000 rwxp 00000000 00:00 0
</code></pre>

<p>It should be noted that it seems there is a space after the inode entry so I keep a <code>char ' '</code> in the main function. Then I try to parse the line for a path, if there is none that attempt will fail immediately and instead I parse for an empty string, <code>parsePath &lt;|&gt; string ""</code>. The pathname seems to be prefixed with a fixed number of spaces, but I&#8217;m lazy and just consume one or more. I&#8217;m not sure exactly what characters are allowed in the pathname itself so I&#8217;m lazy once more and just gobble up whatever I find.</p>

<p>To exercise what I had so far I decided to write a function that reads the <code>maps</code> file for a specific process, based on its <code>pid</code>, parses the contents and collects all the <code>MemRegion</code> instances in a list.</p>

<pre><code>getMemRegions pid = let
        fp = "/proc" &lt;/&gt; show pid &lt;/&gt; "maps"
        doParseLine' = parse parseRegion "parseRegion"
        doParseLine l = case (doParseLine' l) of
            Left _ -&gt; error "Failed to parse line"
            Right x -&gt; x
    in do
        mapContent &lt;- liftM lines $ readFile fp
        return $ map doParseLine mapContent
</code></pre>

<p>The only thing that really is going on here is that the lines are passed from inside an IO monad into the Parser monad and then back again. After this I can try it out by:</p>

<pre><code>*Main&gt; getMemRegions 1
</code></pre>

<p>This produces a lot of output so while playing with it I limited the mapping to the four first lines by using <code>take</code>. The last line then becomes:</p>

<pre><code>return $ map doParseLine (take 4 mapContent)
</code></pre>

<p>Now it&#8217;s easy to add a <code>main</code> that uses the first command line argument as the <code>pid</code>:</p>

<pre><code>main = do
    pid &lt;- liftM (Prelude.read . (!! 0)) getArgs
    regs &lt;- getMemRegions pid
    mapM_ (putStrLn . show) regs
</code></pre>

<p>Well, that concludes my first adventure in parsing <img src='http://therning.org/magnus/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>

<p><em>[Edit 27-05-2007 13:15]</em></p>

<p>I received an email asking for it so here are the import statements I ended up with:</p>

<pre><code>import Control.Monad
import System
import System.FilePath
import Text.ParserCombinators.Parsec
</code></pre>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Ftherning.org%2Fmagnus%2Farchives%2F289&amp;title=Adventures%20in%20parsing" id="wpa2a_10">Share/Bookmark</a></p>]]></content:encoded>
			<wfw:commentRss>http://therning.org/magnus/archives/289/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

