# Webhook receiving

In order to receive the pushed data from your crawler, you will need to create a webhook endpoint in your server.

Your server webhook should...

  • Be publicly reachable from ProxyCrawl servers
  • Be ready to receive POST calls and respond within 200ms
  • Respond within 200ms with a status code 200, 201 or 204 without content

The way the data is structured will depend on the format you specified when pushing the url with the format parameter, &format=html (which is the default) or &format=json.

The Crawler engine will send the data back to your callback endpoint via POST method with gzip compression.

Note: If you are using Zapier webhooks, the Crawler does not send the data compressed. Zapier hooks do not work with Gzip compression.

# Request examples

Find here examples of what you can expect to receive from ProxyCrawl Crawler to your server webhook.

# Format HTML

This will come when you call the API with the &format=html.

Headers:
  "Content-Type" => "text/plain"
  "Content-Encoding" => "gzip"
  "Original-Status" => 200
  "PC-Status" => 200
  "rid" => "The RID you received in the push call"
  "url" => "The URL which was crawled"

Body:
  The HTML of the page

# Format JSON

This will come when you call the API with the &format=json.

Headers:
  "Content-Type" => "gzip/json"
  "Content-Encoding" => "gzip"

Body:
{
  pc_status: 200,
  original_status: 200,
  rid: "The RID you received in the push call",
  url: "The URL which was crawled",
  body: "The HTML of the page"
}

Please note that pc_status and original_status must be checked. Your can read more about them here and here respectively.

# Testing integration

When creating your webhook, it can be helpful to see the exact response for a specific url.

To help testing you can configure ProxyCrawl Storage in your crawlers for testing purposes. You can see it here (opens new window).

# Monitoring bot

The Crawler will monitor your webhook url to know its status, if the webhook is down the Crawler will pause and it will resume automatically when your webhook goes up again.

Our monitoring bot will keep sending requests to your webhook endpoint. Make sure to ignore those requests with a 200 status code.

  • Monitoring requests come as POST request with json body as you will receive with the non monitoring calls.
  • Monitoring requests come with user agent ProxyCrawl Monitoring Bot 1.0 so you can easily ignore them with status 200.

# Protecting your webhook

If you use some random endpoint like yourdomain.com/2340JOiow43djoqe21rjosi it will unlikely be discovered but in any case, you can protect the webhook endpoint with the following methods (or several of them combined):

  • Send a custom header on your request with some token that you check for its existence in your webhook.
  • Use some url parameter in your url and check its existence on the webhook request, like: yourdomain.com/2340JOiow43djoqe21rjosi?token=1234
  • Only accept POST requests.
  • Check for some of the expected headers (for example Pc-Status, Original-Status, rid, etc).

We don't recommend IP whitelisting as our crawlers can push from many different IPs and the IPs can change without prior notification.