# Parameters

The API has the following parameters, only the token and url are mandatory, the rest are optional.

# token

  • Required
  • Type string

This parameter is required for all calls

This is your authentication token. You have two tokens; one for normal requests and another one for JavaScript requests.

Use the JavaScript token when the content you need to crawl is generated via JavaScript, either because it's a JavaScript built page (React, Angular, etc.) or because the content is dynamically generated on the browser.

Normal token

_USER_TOKEN_

JavaScript token

_JS_TOKEN_

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"

# url

  • Required
  • Type string

This parameter is required for all calls

You will need a url to crawl. Make sure it starts with http or https and that is fully encoded.

For example, in the following url: https://github.com/crawlbase?tab=repositories the url should be encoded when calling the API like the following: https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"

# format

  • Optional
  • Type string

Indicates the response format, either json or html. Defaults to html.

If format html is used, crawlbase will send you back the response parameters in the headers (see HTML response below).

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories&format=json"

# pretty

  • Optional
  • Type boolean

If you're expecting a json response, you can optimize its readability by employing &pretty=true.

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories&format=json&pretty=true"

# user_agent

  • Optional
  • Type string

If you want to make the request with a custom user agent, you can pass it here and our servers will forward it to the requested url.

We recommend to NOT use this parameter and let our artificial intelligence handle this.

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&user_agent=Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10_12_5%29+AppleWebKit%2F603.2.4+%28KHTML%2C+like+Gecko%29+Version%2F10.1.1+Safari%2F603.2.4&url=https%3A%2F%2Fpostman-echo.com%2Fheaders"

# page_wait

  • Optional
  • Type number

If you are using the JavaScript token, you can optionally pass page_wait parameter to wait an amount of milliseconds before the browser captures the resulting html code.

This is useful in cases where the page takes some seconds to render or some ajax needs to be loaded before the html is being captured.

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_JS_TOKEN_&page_wait=1000&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"

# ajax_wait

  • Optional
  • Type boolean

If you are using the JavaScript token, you can optionally pass ajax_wait parameter to wait for the ajax requests to finish before getting the html response.

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_JS_TOKEN_&ajax_wait=true&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"

# css_click_selector

  • Optional
  • Type string

If you are using the JavaScript token, you can optionally pass css_click_selector parameter to click an element on the page before the browser captures the resulting html code.

This parameter requires a fully specified and valid CSS selector. For instance, you could use an ID selector like #some-button, a class selector such as .some-other-button, or an attribute selector like [data-tab-item="tab1"]. Just ensure that the CSS selector is appropriately encoded.

Please note that the request will fail with pc_status 595 if the selector is not found in the page. If you want to still get the response even if the selector is not found in the page, consider using a selector which is always found appending for example body. Here is an example: #some-button,body

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_JS_TOKEN_&css_click_selector=%5Bdata-tab-item%3D%22overview%22%5D&page_wait=1000&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"

# device

  • Optional
  • Type string

Optionally, if you don't want to specify a user_agent but you want to have the requests from a specific device, you can use this parameter.

There are two options available: desktop and mobile.

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&device=mobile&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"

# get_cookies

  • Optional
  • Type boolean

Optionally, if you need to get the cookies that the original website sets on the response, you can use the &get_cookies=true parameter.

The cookies will come back in the header (or in the json response if you use &format=json) as original_set_cookie.

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&get_cookies=true&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"

# get_headers

  • Optional
  • Type boolean

Optionally, if you need to get the headers that the original website sets on the response, you can use the &get_headers=true parameter.

The headers will come back in the response as original_header_name by default. When &format=json is passed, the header will come back as original_headers.

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&get_headers=true&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"

# request_headers

  • Optional
  • Type string

Optionally, if you need to send request headers to the original website, you can use the &request_headers=EncodedRequestHeaders parameter.

Example request headers: accept-language:en-GB|accept-encoding:gzip

Example encoded: &request_headers=accept-language%3Aen-GB%7Caccept-encoding%3Agzip

Please note that not all request headers are allowed by the API. We recommend that you test the headers sent using this testing url: https://postman-echo.com/headers

If you need to send some additional headers which are not allowed by the API, please let us know the header names and we will authorize them for your token.

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&request_headers=accept-language%3Aen-GB%7Caccept-encoding%3Agzip&url=https%3A%2F%2Fpostman-echo.com%2Fheaders"

# set_cookies

  • Optional
  • Type string

Optionally, if you need to send cookies to the original website, you can use the &cookies=EncodedCookies parameter.

Example cookies: key1=value1; key2=value2; key3=value3

Example encoded: &cookies=key1%3Dvalue1%3B%20key2%3Dvalue2%3B%20key3%3Dvalue3

We recommend that you test the cookies sent using this testing url: https://postman-echo.com/cookies

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&cookies=key1%3Dvalue1%3B%20key2%3Dvalue2%3B%20key3%3Dvalue3&url=https%3A%2F%2Fpostman-echo.com%2Fcookies"

# cookies_session

  • Optional
  • Type string

If you need to send the cookies that come back on every request to all subsequent calls, you can use the &cookies_session= parameter.

The &cookies_session= parameter can be any value. Simply send a new value to create a new cookies session (this will allow you to send the returned cookies from the subsequent calls to the next API calls with that cookies session value). The value can be a maximum of 32-characters and sessions expire in 300 seconds after the last API call.

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&cookies_session=1234abcd&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"

# screenshot

  • Optional
  • Type boolean

If you are using the JavaScript token, you can optionally pass &screenshot=true parameter to get a screenshot in the JPEG format of the whole crawled page.

crawlbase will send you back the screenshot_url in the response headers (or in the json response if you use &format=json).
The screenshot_url expires in one hour.

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_JS_TOKEN_&screenshot=true&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"

# store

  • Optional
  • Type boolean

Optionally pass &store=true parameter to store a copy of the API response in the crawlbase Cloud Storage (opens new window).

crawlbase will send you back the storage_url in the response headers (or in the json response if you use &format=json).

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&store=true&url=https%3A%2F%2Fgithub.com%2Fcrawlbase%3Ftab%3Drepositories"

# scraper

  • Optional
  • Type string

Returns back the information parsed according to the specified scraper. Check the list of all the available data scrapers (opens new window) list of all the available data scrapers to see which one to choose.

The response will come back as JSON.

Please note: Scraper is an optional parameter. If you don't use it, you will receive back the full HTML of the page so you can scrape it freely.

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&scraper=amazon-product-details&url=https%3A%2F%2Fwww.amazon.com%2Fdp%2FB0B7CBZZ16"

# async

  • Optional
  • Type boolean
  • Currently only linkedin.com is supported using this parameter. Talk to us if you require other domains on async mode.

Optionally pass &async=true parameter to crawl the requested URL asynchronously. crawlbase will store the resulted page in the crawlbase Cloud Storage (opens new window).

As a result of doing a call with async=true, crawlbase will send you back the request identifier rid in the json response. You will need to store the RID to retrieve the document from the storage. With the RID, you can then use the Cloud Storage API (opens new window) to retrieve the resulted page.

You can use the async=true parameter in combination with other API parameter like for example &async=true&autoparse=true.

Example of request with async=true call:

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&async=true&url=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fcrawlbase"

Example of response with async=true call:

{ "rid": "1e92e8bff32c31c2728714d4" }

# autoparse

  • Optional
  • Type boolean

Optionally, if you need to get the scraped data of the page that you requested, you can pass &autoparse=true parameter.

The response will come back as JSON. The structure of the response varies depending on the URL that you sent.

Please note: &autoparse=true is an optional parameter. If you don't use it, you will receive back the full HTML of the page so you can scrape it freely.

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&autoparse=true&url=https%3A%2F%2Fwww.amazon.com%2Fdp%2FB0B7CBZZ16"

# country

  • Optional
  • Type string

If you want your requests to be geolocated from a specific country, you can use the &country= parameter, like &country=US (two-character country code).

Please take into account that specifying a country can reduce the number of successful requests you get back, so use it wisely and only when geolocation crawls are required.

Also note that some websites like Amazon are routed via different special proxies and all countries are allowed regardless of being in the list or not.

You have access to the following countries

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&country=US&url=https%3A%2F%2Fpostman-echo.com%2Fip"

# tor_network

  • Optional
  • Type boolean

If you want to crawl onion websites over the Tor network, you can pass the &tor_network=true parameter.

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_USER_TOKEN_&tor_network=true&url=https%3A%2F%2Fwww.facebookcorewwwi.onion%2F"

# scroll

  • Optional
  • Type boolean

If you are using the JavaScript token, you can optionally pass &scroll=true to the API this will by default scroll for a scroll_interval of 10 seconds.

If you want to scroll more than 10 seconds please send the &scroll=true&scroll_interval=20. Those parameters will instruct the browser to scroll for 20 seconds after loading the page. The maximum scroll interval is 60 seconds, after 60 seconds of a scroll, the system captures the data and brings it back to you.

The default scroll interval is 10 seconds. Every 5 seconds of successful scroll counts as extra JS request on the Crawling API, so let us assume you send a scroll_interval 20, our system tries to scroll the page for a maximum of 20 seconds, if it only was able to scroll for 10 seconds, only 2 extra requests are consumed instead of 4.

Note: Please make sure to keep your connection open up to 90 seconds if you are intending to scroll for 60 seconds.

  • curl
  • ruby
  • node
  • php
  • python
  • go
curl "https://api.crawlbase.com/?token=_JS_TOKEN_&scroll=true&url=https%3A%2F%2Fwww.reddit.com%2Fsearch%2F%3Fq%3Dcrawlbase"