Charlie Harvey

Muffet: A Perl Web Spider

Download/view

Muffet is a web spider written in Perl and Moose that I've put up on github, under GPL.

The problem that I was trying to solve was to spider newint.org to make a xapian index file, so its targetted at that usage. However, it can spit out xml for a google sitemap or raw text for debugging as well.

Bear in mind it doesn't respect robots.txt. However, you can use a xpath_noindex in pages you want nofollowed. You can also specify the skip_urls parameter, which does a regex match and skips matching urls. If people nag me I might add robots support using WWW::Mechanize::Polite or something. What do you reckon folks?