A FP Web Scraper DSL + Engine

On July 2014 i decided to take a couple months off of working for anyone and spend some time with my family and do some coding for myself, not for a client or a company. I got back into coding cool things for the fun of it. Over the time that ellapsed i found myself to be the most productive i've been in a very long time. I cranked out a custom FRP library in Scala that was loosely modeled after Netflix's RxJava and Microsoft's Reactive Extensions. I then implemented an asynchronous HTTP server/client using the FRP library and on top of that i built a complete MVC web framework. I wrote a JSON Parser using parser combinators and even implemented a custom CRDT library. It was seriously fun stuff and it made me love programming all over again.

However, one of the things that i worked on that i'm probably most proud of i never open sourced. It's an FP web scraping execution engine with a declaritive DSL for building scraper state machines in Scala. I Opened my personal github account over the weekend (I have a seperate one for work >_< i'll never make that mistake again) for probably the first time this entire month and saw the private repository for it.

Locked Jabba Repo

I had almost completely forgotten all about it! I think most of the scrapers i wrote for it are a bit out of date by now, but perusing the source just made me feel like there was some legitimately good stuff locked away here that i just had to get released and share with people. So that's what i'm doing with this blog post and i plan on finally unlocking the repo as soon as this blog post is published.

The Engine in 15 Seconds

  • Jabba is the main wrapper class
  • Jabba contains a Transactor, Ledger and a set of Scrapers
  • The Jabba wrapper class acts as a continuous reducer that folds the Transactor over the Ledger
  • The Transactor visits all new URLs in the Ledger and attempts to apply the page results to each scraper
  • Each Scraper will self select which URLs are relevant to it or not and apply it's Scraper logic when appropriate.
  • Every Scraper has a set of extractors that extract attributes/features from the page in addition to more links to crawl.
  • The links that are extracted are fed back into the ledger to be crawled and scraped by possibly other scrapers.
  • Some Scraper's exist only to feed links to other Scraper's (IE: Feed Scraper's)

Scraper State Machines

Scraper's are implemented as state machines through a declaritive scraper DSL. Scraper's can be in 1 of 3 states at any given moment, Pending, Running and Completed. All Scraper's start off in the pending state where the wait to fire of some initial requests to a set of initial URLs to start their scraping at.

What follows is an example "feed scraper" for scraping links to posts (nodeLinks) and subsequent pages of the feed itself (nextLinks).

Svbtle Feed Scraper:

object SvbtleFeed {

  val machine = ScraperStateMachine(
    name    = "Svbtle_Feed",
    pending = PendingScraper(
      initialUrls = URLs(
        "http://500hats.com/",
        "http://justinkan.com/",
        "http://daltoncaldwell.com/"
      )
    ),
    running = RunningScraper(
      sleep  = 150.seconds,
      scrape = FeedScraper(
        nodeLinks  = CssSelectorNodes(".article_title a"),
        nodeTarget = SvbtleNode(),
        nextLinks  = RelNextLink(),
        nextTarget = SvbtleFeed()
      )
    ),
    completed  = CompletedScraper(),
    assertions = ScraperAssertions(
      MustContainTargets,
      MustContainTargetsFor(SvbtleNode())
    )
  )

  def apply(): ScraperStateMachine = machine

}

Here we can see that a scraper is a ScraperStateMachine object and that it contains a unique name and then a specific set of scraper rules to apply when it's in any given state along with some scraper assertions to validate that the scraper is still up to date and capable of scraping the targeted pages.

The pending scraper configuration has an optional set of URLs to begin it's scraping at. These will all fire off before the scraper is transitioned into the running state.

The most interesting configuration is the running state configuration. It has a configured sleep duration so that we don't overwhelm our scraping targets. Most importantly it also has a scrape parameter which takes either a FeedScraper or a NodeScraper. This instance takes a FeedScraper which extracts nodeLinks to send to other scrapers (SvbtleNode in this case) and also extracts nextLinks which it will feed recursively to itself.

Since this scraper is feeding links to SvbtleNode let's check that one out next.

object SvbtleNode {

  val machine = ScraperStateMachine(
    name = "Svbtle_Node",
    pending = PendingScraper(),
    running = RunningScraper(
      sleep  = 15.seconds,
      scrape = NodeScraper(
        TextScraper("title", CssSelectorNodes(".article_title a")),
        AttributeScraper("date", "datetime", CssSelectorNodes(".article_time")),
        OptionalScraper(MetaContentScraper("twitter_author", CssSelectorNodes("meta[property='twitter:creator']"))),
        TimedScraper()
      )
    ),
    completed = CompletedScraper(),
    assertions = MustContainData
  )

  def apply(): ScraperStateMachine = machine

}

This node is an example of a node/content style scraper. It's meant to be used to extract metadata and targeted content from specific pages.

It's worth noting that the pending scraper state configuration for this node doesn't contain any initialUrls. Instead this scraper expects to be fed links from other scrapers.

The other important difference with this scraper is that it takes a NodeScraper for it's running state configuration. A node scraper takes an arbitrary set of extractors that will pull things from the contents URL on the page and store them in a key/value relation on the ledger. We have a simple composable DSL for writing these extractors.

State of the System

At the time i left this system alone all the scrapers were functioning and it would be able to run to feed completion. Some notable things i never got around was making the feed scrapers reentrant such that they would periodically cycle through initialUrls (or root urls form the ledger) after exhausting them or encountering where they already left off at. It would also be nice to have a smarter retry mechanism that learned how often content was updated in specific feeds and adjusted polling appropriately and automatically instead of by hand.

For development purposes the ledger used was an in memory append only log, which is nice for rapid iteration but wouldn't work that well in a production environment. The code is currently structured in such a way that this can easily be substituted for a persistent ledger such as Kafka or Cassandra leveraging wide row support and CQL via run of the mill polymorphic substitution.

The now public repository can be found here: https://github.com/JosephMoniz/jabba

Things I Learned

1.) A surprising amount of content websites out there actually implement Twitter Cards as part of the HTML payloads for their content as well as Schema dot org content. However, almost no one implements either of these consistently. It was nice to have all this extra meta-data available about the content and the authors when it was available but it was less than fun deciphering what exactly was meant on a per website basis.

2.) Blogs and publishing sites are structured well enough to crank out scrapers for traversing their feeds and their content within 10-15 minutes of hacking around.

3.) Scraping from sites that are javascript heavy can be tricky and probably isn't always worth it at times. Had to make lots of tradeoffs between performance and accuracy in this case and went back and forth between using Selenium with real browsers, selenium with headless browsers and doing it all without a selenium.