In this mini-series I look at building a minimally functional XML parser; I build one first of all using Haskell’s Parsec library and then in Part Two I use Perl to build an analogous XML parser.
Introduction
I have recently been working my way through the Real World Haskell book, and had a “fun” time with chapter 16, which introduces Parsec, Haskell’s generic parsing library — it turns out that the book covers an older version and I spent most of my brain energy learning why the examples didn’t work rather than what they were illustrating. This week I also went to a one day tutorial with Damian Conway on understanding Perl regular expressions. So, my challenge is to build a toy XML parser to learn Parsec a little better and then make one with Perl regexen using some of the ideas that Damian passed on.
This time I’ I build an XML parser with Parsec and Haskell, next time its Perl and regexen. It will be interesting to see how two very different approaches compare. Before we start, please note that this is just toy code for my own edification. I seriously wouldn’t use it in the wild. arbn on the #haskell channel on freenode says that Haskell XML Toolbox (HXT) is a good choice for parsing XML, saying It's relatively new, is designed pretty well, and there's a large family of additional modules that build on it
, and dibblego uses it for GPS track logs and OSM map data. So use HXT if you want to do cool stuff with XML and not just play with Parsec!
Getting started
I started out by thinking what an XML document was. Sure, I could have read the XML spec, and if I were implementing a proper parser that is what I’d do. But for a weekend project my conceptual model can be a little more informal. In particular I will treat all character data as Strings, and I won’t care about chardata blocks or doctype declarations. I will, however, deal with xml text declarations (i.e. <?xml version="x.y" encoding="ZZZ-Z" ?>) and self closing tags (i.e. <mytag />).
I started out by thinking what an XML document was
The conceptual model that I built from was that XML is organized as a document, which is made up of a text declaration or XML element, followed by any number of XML elements. An xml element is either a self closing tag with optional attributes, or a normal tag. Attributes are a series of key value pairs, seperated by an = and with the value being enclosed in double quotes. A normal tag may also have attributes and may also have child elements. The child elements of a normal tag may be either more xml elements, or a body. The body of the tag is the text between the start and end of a tag. The contents of a normal tag must be followed by a closing end tag. Normal tags, as you can see, are recursive, a tag can have child tags, which can have child tags and so on.
After importing Parsec and a couple of other libraries I will be needing later, I move on to defining my types. Haskell has a particularly powerful type system, that generally does what you'd like it to. In this case I use “type synonyms” for the keys and a data constructor with multiple value constructors for my core XML types — normal elements, tag bodies, self closing tags, and XML text declarations. Elements have an array of XML types as well as attributes. Haskell’s type system is fine with recursively defined types. I also make a simple Attribute datatype a list of pairs of my AttrName and AttrValue type synonyms. I call the module ReadXML. Because I lack imagination.
module ReadXML where
import Control.Applicative hiding ((<|>),many)
import Control.Monad
import Text.Parsec
import Text.Parsec.String
type AttrName = String
type AttrVal = String
data Attribute = Attribute (AttrName, AttrVal) deriving (Show)
data XML = Element String [Attribute] [XML]
| SelfClosingTag String [Attribute]
| Decl String
| Body String
deriving (Show)
Starting with Parsec
With the basic structure out of the way I move on to using parsec. I start with a top level document function. Each of the parsing functions implements part of the spec above. -- The top level document, used in parsing
document = do
spaces -- strip leading space
y <- (try xmlDecl <|> tag) -- we may start with an XML declaration or a tag
spaces
x <- many tag
spaces
return (y : x)
We start by telling Parsec to consume any leading space. Whilst it may look pretty to people, what can it profit the machine? These spaces may be followed by an XML text declaration. If that isn’t there then we consume the first tag. Note the use of Parsec’s try function, which tries the first parser and backtracks if it fails. In an earlier version I had written x <- many (try xmlDecl <|> tag). Because of the backtracking involved, using try can make things go slowly, the slightly more complex code in this version backtracks at most once, rather than for every tag. Anyhow, we consume our tag or XML text declaration and any trailing spaces. Then we consume as many tags as we can, then any trailing spaces. Finally we call return on a list of y followed by x. Imperitive programmers should note that this return is not the type of return they are used to — it injects a value into a monadic type, rather than passing it back to the caller, see Learn you a Haskell for Great Goodfor a better explanation.
Some more parsers
Now that I have an outline of my document I move on to developing the component parts. I start easy with an XML text declaration.-- XML declaration eg. <?xml version="1.0" encoding="UTF-8"?>
xmlDecl ::Parser XML
xmlDecl = do
string "<?xml"
decl <- many (noneOf "?>")
string "?>"
return (Decl decl)
We consume the exact string <?xml followed by anything that isn’t a > which we capture as decl. Next we consume ?> and return a new XML text declaration type, with decl as its associated string. I wasn’t that interested in text declarations or I might have captured the attributes more elegantly!
I promise that the next part of the code is as hard as it gets. It deals with tags. Originally I had used another try to distinguish between normal and self closing tags. That turned out to be massively slow, hence this slightly ugly implementation, which only has to try the closing character/s of a tag to distnguish between normal and self closing tags.
-- A normal or self-closing tag. Earlier version parsed these with seperate functions
-- But that made perfromance unbelievably slow! Normal tags can have sub-elements or
-- a text body (cf: elementBody)
tag = do
char '<'
spaces
name <- many (letter <|> digit)
spaces
attr <- many attribute
spaces
close <- try (string "/>" <|> string ">") -- trying just the closing string of the tag bought me
-- an enormous performance boost, enough to make the
-- difference between being usable and not!
if (length close) == 2
then return (SelfClosingTag name attr)
else do
elementBody <- many elementBody
endTag name
spaces
return (Element name attr elementBody)
Hopefully the Parsec parsers (spaces, char, string) are starting to look familiar to you by now. I wrote the attribute parser and it is listed below. I look for an opening < followed ny zero or more spaces (to allow for gnarly xml), followed by a tag name made up of letters or digits followed by some spaces followed by zero or more attributes, followed by some spaces. I check if the closing tag matches "/>" or just ">" and capture it into close. If close is two characters long, it is "/>" and I can return a SelfClosingTag otherwise I consume at the contents of the tag and capture them in elementBody. I look for an end tag with the same name as this tag, followed by some optional spaces. Then I return the whole lot. Whew!
Gettin’ recursive
The three functions used by the tag function are a little simpler on the surface.-- The body of an element, consumes any leading spaces; would be nice to not have the try here
elementBody = spaces *> try tag <|> text
-- End tag, assuming thatg we had a normal, non self-closing tag
endTag str = string "</" *> string str <* char '>'
-- Create a body XML element, from text up to the next tag
text = Body <$> many1 (noneOf "><")
elementBody is the most interesting of these. Recall that it is called from tag, but note too that it calls tag. This recursion allows us to have the body of our tags be tags in themselves. If the element body is not a tag, we try the text parser instead. Note that *> consumes throws away the parser to its left, so the call to spaces is just consuming any leading whitespace. endTag looks for the string "</", throws it away, captures any following characters up to a closing ">", which it throws away too. text maps consumes that doesn’t look like tag.
At this point the parser is, believe it or not, nearly done. As I said earlier, I need to implement a parser that will deal with an attribute. This is applied many times up above. -- Gets a single attrubute from within a tag, called with many. Bit verbose, but
-- people can be funny about their use of spaces.
attribute = do
name <- many (noneOf "= />")
spaces
char '='
spaces
char '"'
value <- many (noneOf ['"'])
char '"'
spaces
return (Attribute (name,value))
No surprises here. We look for a name which is anything except a closing character or a space or an equals sign. Then we look for ="value" and any trailing spaces. We construct an Attribute type with the name and value we have captured and we are done.
First tryout
I needed some test data I will use The Register’s headline feed which is nice and clean with no ugly CDATAs to mess us up. I fetch it with wget$ wget http://www.theregister.co.uk/headlines.rss
--2012-10-14 12:07:28-- http://www.theregister.co.uk/headlines.rss
Resolving www.theregister.co.uk... 92.52.96.89
Connecting to www.theregister.co.uk|92.52.96.89|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35327 (34K) [application/rss+xml]
Saving to: `headlines.rss'
100%[==============================================>] 35,327 118K/s in 0.3s
2012-10-14 12:07:28 (118 KB/s) - `headlines.rss' saved [35327/35327]
$ mv headlines.rss register.rss
I escape the quotes and remove newlines (for easy cut and pasting) with Perl$ cat register.rss | perl -pe 's/"/\\"/g;s/\n/ /;'
This results in a very big string, which I won’t post here. I did make a smaller, valid string for testing, though.
<?xml version=\"1.0\" encoding=\"UTF-8\"?> <rss version=\"2.0\"> <channel> <title>The Register</title> <link>http://www.theregister.co.uk/</link> <description>Biting the hand that feeds IT</description> <copyright>Copyright 2012, Situation Publishing</copyright> <image> <title>The Register</title> <width>88</width> <height>31</height> <url>http://www.theregister.co.uk/Design/graphics/Reg_default/The_Register_RSS.png</url> <link>http://www.theregister.co.uk/</link> </image> <language>en-GB</language> <webMaster>webmaster@theregister.co.uk</webMaster> <lastBuildDate>Fri, 12 Oct 2012 18:58:41 GMT</lastBuildDate> <ttl>120</ttl> <item><guid isPermaLink=\"false\">tag:theregister.co.uk,2005:story/2012/10/12/rbs_santander_incompatible_it_systems_collapse/</guid><title>Incompatible IT systems blamed for bank sale collapse</title><link>http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/rbs_santander_incompatible_it_systems_collapse/</link><description><h4>RBS £1.7bn branch sale to Santander is off</h4> <p><strong>Brief</strong> Royal Bank of Scotland's $1.7bn sale of 318 branches to Santander has gone titsup.…</p></description><pubDate>Fri, 12 Oct 2012 18:51:54 GMT</pubDate></item> </channel> </rss>
Parsec provides a simple parseTest function to try out your parser with — we run it over our test string.*ReadXML> parseTest document "<?xml version=\"1.0\" encoding=\"UTF-8\"?> <rss version=\"2.0\"> <channel> <title>The Register</title> <link>http://www.theregister.co.uk/</link> <description>Biting the hand that feeds IT</description> <copyright>Copyright 2012, Situation Publishing</copyright> <image> <title>The Register</title> <width>88</width> <height>31</height> <url>http://www.theregister.co.uk/Design/graphics/Reg_default/The_Register_RSS.png</url> <link>http://www.theregister.co.uk/</link> </image> <language>en-GB</language> <webMaster>webmaster@theregister.co.uk</webMaster> <lastBuildDate>Fri, 12 Oct 2012 18:58:41 GMT</lastBuildDate> <ttl>120</ttl> <item><guid isPermaLink=\"false\">tag:theregister.co.uk,2005:story/2012/10/12/rbs_santander_incompatible_it_systems_collapse/</guid><title>Incompatible IT systems blamed for bank sale collapse</title><link>http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/rbs_santander_incompatible_it_systems_collapse/</link><description><h4>RBS £1.7bn branch sale to Santander is off</h4> <p><strong>Brief</strong> Royal Bank of Scotland's $1.7bn sale of 318 branches to Santander has gone titsup.…</p></description><pubDate>Fri, 12 Oct 2012 18:51:54 GMT</pubDate></item> </channel> </rss>"
[Decl " version=\"1.0\" encoding=\"UTF-8\"",Element "rss" [Attribute ("version","2.0")] [Element "channel" [] [Element "title" [] [Body "The Register"],Element "link" [] [Body "http://www.theregister.co.uk/"],Element "description" [] [Body "Biting the hand that feeds IT"],Element "copyright" [] [Body "Copyright 2012, Situation Publishing"],Element "image" [] [Element "title" [] [Body "The Register"],Element "width" [] [Body "88"],Element "height" [] [Body "31"],Element "url" [] [Body "http://www.theregister.co.uk/Design/graphics/Reg_default/The_Register_RSS.png"],Element "link" [] [Body "http://www.theregister.co.uk/"]],Element "language" [] [Body "en-GB"],Element "webMaster" [] [Body "webmaster@theregister.co.uk"],Element "lastBuildDate" [] [Body "Fri, 12 Oct 2012 18:58:41 GMT"],Element "ttl" [] [Body "120"],Element "item" [] [Element "guid" [Attribute ("isPermaLink","false")] [Body "tag:theregister.co.uk,2005:story/2012/10/12/rbs_santander_incompatible_it_systems_collapse/"],Element "title" [] [Body "Incompatible IT systems blamed for bank sale collapse"],Element "link" [] [Body "http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/rbs_santander_incompatible_it_systems_collapse/"],Element "description" [] [Body "<h4>RBS \163\&1.7bn branch sale to Santander is off</h4> <p><strong>Brief</strong> Royal Bank of Scotland's $1.7bn sale of 318 branches to Santander has gone titsup.\8230</p>"],Element "pubDate" [] [Body "Fri, 12 Oct 2012 18:51:54 GMT"]]]]]
Sweet that all looks good.
Adding some convenience
Of course, as it stands this is a fairly useless parser, I would like to make it a little easier to use. I can’t be pushing stuff through Perl and copy pasting when I want to process XML! I’ll start by making it easier to create a document, from a string or from a file.-- Parse an XML document from a file
docFromFile filepath = do
xml <- readFile filepath
return (docFromString xml)
-- Parse XML document from a string, return in a more convenient form or empty list on failure
docFromString xml =
case (parse document "" xml) of
Right xml -> xml
Left err -> []
The docFromString function parses the string and pulls out the results from the Either, returning an empty string if we fail to parse or our tree of XML elements if we succeed. docFromFile reads a file and uses the content to call docFromString with its content. Note that the String is wrapped in the IO Monad and so it requires some additional syntax to be useful.
Now we can parse documents, it would be nice to have some helper functions. I have made a few, but many others could be implemented. I shall start with something on the complex side that the others use, it brings back a lit of all tags that match a name you specify and it will be used by a couple of our convenience functions.-- Given a name and a list of tags find all the tags with that name
fetchTagByNameFromElements name elements =
fetch elements
where fetch [] = []
fetch ((Element nm at xml):es)
| nm == name = (Element nm at xml) : (fetch xml) ++ (fetch es)
| otherwise = (fetch xml) ++ (fetch es)
fetch ((SelfClosingTag nm at):es)
| nm == name = (SelfClosingTag nm at) : (fetch es)
| otherwise = (fetch es)
fetch ((Decl _):es) = (fetch es)
fetch ((Body _):es) = (fetch es)
We recurse over our XML tree, pulling out elements that match name until we have nothing left over which to recurse, then we return a list of the retrieved elements. I found this code kind of ugly and it was a little hard to debug, mostly because I get confused between : and ++ !
We can give people a nicer interface for fetching elements according to whether they are working on a String or a File.-- Given a tag name and an xml file, return tags from the xml matching name
fetchTagByNameFromFile name path =
liftM (fetchTagByNameFromElements name) (docFromFile path)
-- Given a tag name and an xml string, return tags matching name
fetchTagByNameFromString name doc =
fetchTagByNameFromElements name (docFromString doc)
The liftM is required here to inject the fetchTagByNameFromElements into the IO monad, remember that the elements were fetched from a file, and Haskell keeps actions that can have side effects in the world isolated in Monads. Now I can get, say all the link tags from my demo file. I have truncated the results.fetchTagByNameFromFile "link" "register.rss"
[Element "link" [] [Body "http://www.theregister.co.uk/"],Element "link" [] [Body "http://www.theregister.co.uk/"],Element "link" [] [Body "http://go.theregister.com/feed/www.theregister.co.uk/2012/10/13/android_ondevice_malware_scanning/"],…
I often want to extract the content from particular elements, so I implement an extractBody function to allow me to do this.-- Extract Body from a tag if it has one
extractBody (Element _ _ (b:_)) = (bodyString b) -- I assume body will be the first (only element)
where bodyString (Body str) = str
extractBody _
So I can now extract all of the links from an RSS file into a string, again I’ve truncated and formatted the results a little.liftM (unlines <$> map (extractBody)) $ fetchTagByNameFromFile "link" "register.rss"
"http://www.theregister.co.uk/
http://www.theregister.co.uk/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/13/android_ondevice_malware_scanning/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/13/spacex_satellite_fail/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/13/apple_misappropriates_swiss_photographers_image/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/myhrvold_patent_3d_printing/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/ftc_nearing_google_antitrust_decision/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/endeavor_trundles_through_the_streets_of_los_angeles/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/apple_licenses_swiss_clock/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/win8_preorder_pricing_details/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/rbs_santander_incompatible_it_systems_collapse/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/apple_to_drop_samsung_for_tsmc/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/toshiba_mq01abf/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/netapp_strategy_guy/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/active_defence/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/microsoft_suing_google/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/turtle_research/
http://go.theregister.com/feed/www.theregister.co.uk/2012/10/12/linux_foundation_uefi_boot_windows_8/
…
I hardly had to think about Haskell’s type system at all — it “just worked”
Final thoughts
Using Haskell and Parsec I was able to build the shell of a half decent XML parsing library in a little over 100 lines. As a newcomer to both there are probably mistakes that I made along the way, and ReadXML is very much a toy. But the parser was straightforward to build. I hit some major performance problems with using try. This may be unavoidable, but took some serious recursive reasoning to work out. I hardly had to think about Haskell’s type system at all — it “just worked” — no mean feat for a static type system like Haskell’s. There were some hairy moments in writing the recursive search function and in making the IO code play with everything else, but on the whole it felt rather fun to work with Parsec and Haskell.
Next time I shall try and build something simlilar with Perl using regular expressions. But computer time is over for today and the sunshine beckons! You can grab the complete ReadXML code if you want to try it out.
Update: Part Two is now online.