React js, Angular, Vue, Meteor or any other website which is built dynamically or that uses ajax to loads its content.
This is a hands-on article, so if you want to follow it, make sure that you have an account in ProxyCrawl. It’s straightforward to obtain it, and free. So go ahead and create one here.
Upon registering in ProxyCrawl, you will see that we don’t have any complex interface where you add the URLs that you want to crawl. We created a simple and easy to use API that you can call at any time.
So let’s say we want to crawl and scrape the information of the following page which is created entirely in React js. This will be the URL that we will use for demo purposes:
If you try to load that URL from your console or terminal, you will see that you don’t get all the HTML code from the page. That is because the code is rendered on the client side by React, so with a regular curl command, where there is no browser, that code is not being executed.
You can do the test with the following command in your terminal:
For this tutorial, we will use the following demo token:
5aA5rambtJS2 but if you are following the tutorial, make sure to get yours from the my account page.
First, we need to make sure that we escape the URL so if there is any special character, it won’t collide with the rest of the API call.
For example, if we are using Ruby, we could do the following:
This will bring back the following:
The ProxyCrawl API will do that for us. We just have to do a request to the following URL:
So you will need to replace YOUR_TOKEN with your token :) (remember, for this tutorial, we will use the following:
5aA5rambtJS2) and THE_URL will have to be replaced by the URL we just encoded.
Let’s do it in ruby!
<html lang="en" class="gr__ahfarmer_github_io">
There is now, only one part missing which is extracting the actual content from the html.
This can be done in many different ways, and it depends on the language you are using to code your application. We always suggest using one of the many available libraries that are out there.
Here you have some open source libraries that can help you do the scraping part with the returned HTML:
We hope you enjoyed this tutorial and we hope to see you soon in ProxyCrawl. Happy crawling!