Spade

Overview

Spade strives to be a simple and elegant service that takes away the cruft that you deal with when you want to scrape, or reuse, existing Web content.

Feel free to drop us a line at hello@spade.io and tell us about your experience with the docs.

Scraping: The Problem

We spend countless hours trying to grab content from the Web to build new products that aim to re-model information, extract insights and create better experience based on content that’s out there.

We’re used to view-source’ing Web pages and building custom scrapers, but then the worst happens: the layout changes.

Spade: The Solution

Spade will throw away useless bits such as ads, irrelevant images, irrelevant text and HTML pieces.

We aim to intelligently turn nonstructured mess into a semantic object structures that you can use with a simple API call.

Technology

Our stack is built on Scala, Clojure, and Node.js. Feel free to shoot us an email if you want to know more.

For the API transport protocol, we examined ZeroMQ, Thrift and plain HTTP. We chose HTTP for simplicity.

Authentication

Every call you make must be authenticated for it to succeed.

HTTP Basic Auth

We’re using HTTP Basic Auth because it is simple. We’re skipping the OAuth1/2 drama.

  • Provide your key as a username.
  • For your safety, all calls should be made over HTTPS.
  • You can also use HTTP for reduced latency1.

$ curl http://<MY-AUTH-KEY>:@api.spade.io/v1/scrape?url=<URL>

  1. Using clear HTTP and Basic Auth is generally insecure, don’t use it if you deal with sensitive data.

Errors

We’re fans of properly using HTTP status codes for the purposes they were made. This is a list of errors and responses we’ll use:

  • Assume anything other than 200 OK is an error, specifically
    • 5xx class errors is a server error. Shouldn’t normally happen.
    • 4xx class errors means you’re probably doing a wrong API call.
  • A healthy response is 200 OK.

One last tip: when an error occurs, you can always ignore the HTTP body. However we’ll try our best to provide interesting troubleshooting instruction in the error body should you need it.

Lastly, drop us a line at support@spade.io if you need assistance.

Buckets

We know that when you want to scrape a page, you’re looking for a high-value asset: the main image; then maybe a title, a description, and so on.

A single Spade call will fetch all of the important images, the main article title, and the article’s description.

With a Bucket, you could get the objects into one of many predefined templates. Here’s a previewbox bucket:

{ title: {obj}, thumb: {obj}, description: {obj}, src: {obj} }

By using the previewbox bucket, you just made a “Preview before posting” box like Facebook and Google+ has.

See how to do this in Scraping.

GET /scrape Scrape a page

Perform a simple scrape on a target URL.

$ curl https://api.spade.io/v1/scrape\ ?url=http%3A%2F%2Ftravel.cnn.com/-[ ... ]olf-holiday \ -u 7wlr7ys6qvds3bwxkks0qwsvm0:

API Spec

GET https://api.spade.io/v1/scrape
   
parameters  
url required
bucket optional
   
response  
200 OK Collection of detected objects
   
errors  
500. Server error  
401. Unauthorized  

Parameters

url

Required. The target URL you would like to scrape.

bucket

Optional. The Bucket type you would like your data in. Whats a bucket?

Response

Sends back a collection of detected objects on the requested target URL.

[ { "objects": [ { "format": "jpg", "height": 360, "kind": "image", "orientation": "landscape", "ratio": 0.5625, "source": "http://edition.cnn.com/2013/04/21/travel/earth-day-best-wildlife-sites/index.html?hpt=hp_bn5", "uri": "http://i2.cdn.turner.com/cnn/dam/assets/130419173956-earth-day-nature-hikes-gaylor-lake-yosemite-story-top.jpg", "width": 640, "x-algo": "ogilansky" }, { "format": "jpg", "height": 360, "kind": "image", "orientation": "landscape", "ratio": 0.5625, "source": "http://edition.cnn.com/2013/04/21/travel/earth-day-best-wildlife-sites/index.html?hpt=hp_bn5", "uri": "http://i2.cdn.turner.com/cnn/dam/assets/130419173956-earth-day-nature-hikes-gaylor-lake-yosemite-horizontal-gallery.jpg", "width": 640, "x-algo": "spivak" }, ... more results below ... } ]

Errors

For errors responses, see the general errors documentation.

Examples

For more examples, see Quickstarts.

Scraping with curl

Let’s build a share preview box service using curl.

First, a checklist

  • An API key: 7wlr7ys6qvds3bwxkks0qwsvm0
  • curl or a command line equivalent (wget, for example)
  • An example page you’d like to scrape: CNN on Google Glass

For a preview box we need the following bits:

  • A title
  • A source link
  • An image
  • A bit of text or description

Issuing a call

Even though we can just call:

http://api.spade.io/v1?url=our_url

And then we have multiple images, title candidates and description candidates, all of which made it through our validation algorithms and are good to use.

This time we’ll let Spade narrow it even further and pick the title, link, image and text for us.

We want this:

http://api.spade.io/v1?url=our_url&bucket=preview

So, utilizing curl, we can now issue the request pretty easily:

$ curl https://api.spade.io/v1/scrape \ ?url=http://edition.cnn.com/2013/05/01/opinion \ /chertoff-wearable-devices/index.html?hpt=hp_c4 \ -u 7wlr7ys6qvds3bwxkks0qwsvm0:

Bind directly onto the result object

Because you’ve used a bucket, you can continue from here by using Ember.js or Angular.js (or what ever bindng-friendly framework you prefer) to have a quick binding onto the object with hooking up title, image, link and text directly within your template.