November 6, 2007

Parsing HTML with Haskell

Roy wrote himself some Ruby code to parse Wikipedia edits because he was being too much of a geek to pay attention to the Redskins game. That's all fine and dandy, except he then put out an invitation for someone to post a Haskell version – damn him.

Next thing I know I'm looking around for a decent html parser in Haskell. The best I could find was the latest darcs version of TagSoup by Neil Mitchell. It works better than the released 0.1 version – but there looks to be some room for improvement as its not quite as easy to use as the Hpricot library in Ruby.

Well, no worries though, it gets the job done. See the code below in all its Haskellific glory.

Wikipedia actually checks the User-Agent headers to verify if it wants to allow access to the edits page so the save function won't really work with the site. But rather than clutter up this code example with using the http library in Haskell, I opted to write the code to read from a cache file. Hit the wikipedia url in the code with your browser and save it to a file. Then run the code on it.

If everything worked, the output will be the editor name, a colon and their comment in parentheses. Blank comments get a default "no comment" filler.

  • Angus Lepper: (Reverted 1 edit by identified as vandalism to last revision by Staecker. using TW)
  • (no comment)
  • Staecker: (Reverted edits by (talk) to last version by Ruakh)
  • (no comment)
  • Ruakh: (→Examples - typofix)
  • (→Examples)
  • SieBot: (robot Adding: bg:Haskell)
  • Eklitzke: (add gofer as a dialect of haskell)
  • (→See also - + Curry (programming language))
  • Tobias Bergemann: (→Examples - Linked to function composition operator)
  • (no comment)
  • (added Hungarian wiki version)
  • Ruakh: (Undid vandalism by (talk))
  • (no comment)
  • Ruakh: (it doesn't make sense to say that Gofer influenced Haskell)
  • (→History)

Tags: haskell