Are you planning to build your own web scraper? Do you already have any idea how to start or are you still looking for the right tools for the job? Then, there is no need to search any further. In this article, our aim is to guide you on how you can build a reliable web scraper from scratch by using Node.js which is one of the best tools for building a scraper.

Why use Node.js?

So first, why do we recommend Node.js for web scraping? To answer that, let us talk about what exactly is Node.js and what are its advantages over other programming languages.

Node.js in a nutshell is an open-source, JavaScript runtime environment that can be used outside a web browser. Basically, the creators of Node.js took JavaScript which is mostly restricted to a web browser and allow it to run on your personal computer. With the help of Google chrome’s v8 engine, we are now able to run Javascript on our local machine which enables us to access files, listen to network traffic, and even listen to HTTP requests your machine gets and send back a file. Database can also be accessed directly much like anything you can do with PHP or Ruby on Rails.

When it comes to coding, it is impossible to not know Javascript. It is the most popular programming language that is used as a development tool in the client-side in almost 95 percent of the existing websites nowadays. However, with the launch of Node.js, JavaScript has now become a versatile full-stack development programming language.

There are a lot of reasons why Node.js has become the industry standard. Companies like Netflix, eBay, and PayPal, to name a few, has integrated Node.js to its core. So, to give you a much broader sense why you may want to use Node.js, we have list down some of its advantages:

Processing speed - Node.js is considerably fast due mostly to Chrome’s v8 engine. Instead of using an interpreter, it utilizes the engine to compile JavaScript into machine code. The performance is further enhanced by processing concurrent requests which use an event loop in just a single thread. Since it is modeled for non-blocking input-output, this results in less CPU usage when processing multiple requests at once.

Lightweight and Highly scalable - its capability to cope and perform well in an expanding workload makes it favorable for most developers. Node.js makes it easier to update and maintain applications by decoupling each part while you add new or even fix existing architectures without the need to change or adjust other parts of your project or application. In terms of development, it is also possible to reuse and share codes through modules that are like individual blocks of code.

Packages/Libraries - You will not get disappointed with the abundance of packages that can be used with Node.js. Very few programming languages enjoy such a lush ecosystem. Literally, thousands of tools and libraries are available for JavaScript development ready at your disposal via NPM, which is an online repository for the publishing of open-source projects. With steady support from a community that is always growing, you are almost guaranteed to find new packages that can help with your specific needs.

Community support - Naturally, an open-source project like Node.js will have a massive community of developers providing solutions and guidance all over the internet. Whether you go to Github and search for repositories or seek answers through an online community like Stack Overflow, you will always have a clear route to resolve any issues you might experience along the way.

Why use ProxyCrawl for your web scraper

You can write the best code in town but your scraper will only be as good as your proxies. If you are into web scraping, you must know by now that a vast proxy pool should be an integral part of a crawler. Using a pool of proxies will significantly increase your geolocation options, number of concurrent requests, and most importantly, your crawling reliability. However, this might prove difficult if you have a limited budget. But luckily, ProxyCrawl comes as an affordable and reliable option for you. By using the Crawling API, you will have instant access to thousands of residential and data center proxies. Combine this with Artificial Intelligence, and you have the best proxy solution for your project.

Building a web scraper using Node.js and ProxyCrawl

Now, we are on the best part. Building your scraper in Node.js is easier than you think, we only need to prepare a few things first before we dive into coding. So without further ado, let us go through the steps:

  1. Create a free ProxyCrawl account to use the Crawling API service.

  2. Open Node.js and create a new project.

  3. Install the ProxyCrawl module through the terminal by executing the following command:
    npm i proxycrawl

  4. Create a new .js file where we will write our code.

  5. Open the .js file and make good use of the ProxyCrawl Node library.

For the first two lines, make sure to bring all the dependencies by requiring the necessary API and initializing your ProxyCrawl request token as shown below:

1
2
const { CrawlingAPI } = require('proxycrawl');
const api = new CrawlingAPI({ token: 'YOUR_TOKEN' });
  1. Perform a GET request to pass the URL that you wish to scrape and add any options you need from the available parameters in the Crawling API documentation.

The code should now look like this:

1
2
3
4
5
6
7
8
9
10
11
const { CrawlingAPI } = require('proxycrawl');
const api = new CrawlingAPI({ token: 'YOUR_TOKEN' });

api
.get('https://www.ebay.com/sch/i.html?_nkw=ryzen+cpu')
.then((response) => {
if (response.statusCode === 200 && response.pcStatus === 200) {
console.log(response.body);
}
})
.catch((error) => console.error);

You can also use any of the available data scrapers from ProxyCrawl so you can get back the scraped content of the page:

1
2
3
4
5
6
7
8
9
10
11
const { CrawlingAPI } = require('proxycrawl');
const api = new CrawlingAPI({ token: 'YOUR_TOKEN' });

api
.get('https://www.ebay.com/sch/i.html?_nkw=ryzen+cpu', { scraper: 'ebay-serp' })
.then((response) => {
if (response.statusCode === 200 && response.pcStatus === 200) {
console.log(response.body);
}
})
.catch((error) => console.error);

The code is complete and you can run it by pressing F5 on Windows.

This simple code will crawl any URL while using the Crawling API which is built on top of millions of proxies and will return the results in JSON format. However, this guide will not be complete without showing you how to properly scrape pieces of information using other available packages in Node.js.

So, let us build another version of the scraper, but this time, we will integrate Cheerio which is a module available for Node that is specifically built for web scraping. With it, we can have more freedom in selecting specific things out of a website using jQuery.

In this example, we will try to get the product name and current price of a product on Newegg.

  1. Let us start by installing the Cheerio package: npm i cheerio

  2. At this point, you may choose to overwrite your previous code or create a new .js file and declare the constants again.

1
2
3
4
const { CrawlingAPI } = require('proxycrawl');
const cheerio = require('cheerio');

const api = new CrawlingAPI({ token: 'YOUR_TOKEN' });
  1. Pass your target URL again to the API by doing a GET request with an if/else statement to properly set a condition.
1
2
3
4
5
6
7
8
9
10
api
.get('https://www.newegg.com/samsung-860-evo-series-500gb/p/N82E16820147670?Item=9SIA12K6U07909')
.then((response) => {
if (response.statusCode === 200 && response.pcStatus === 200) {
parseHtml(response.body);
} else {
console.log('Failed: ', response.statusCode, response.originalStatus, response.pcStatus);
}
})
.catch((error) => console.error);
  1. Lastly, create a function for Cheerio to parse the HTML and find the specific CSS selector for the product name and price.
1
2
3
4
5
6
7
8
9
10
11
function parseHtml(html) {
const $ = cheerio.load(html);
// Find product name
const Product = $('.product-wrap');
const Title = Product.find('h1').text();
console.log('Product name:', Title);
// Find current price
const pPrice = $('.product-price');
const currentPrice = pPrice.find('.product-price > ul:nth-child(1) > li:nth-child(3)').text();
console.log('Discounted Price:', currentPrice);
}

The complete scraper should now be like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
const { CrawlingAPI } = require('proxycrawl');
const cheerio = require('cheerio');

const api = new CrawlingAPI({ token: 'YOUR_TOKEN' });

function parseHtml(html) {
const $ = cheerio.load(html);
// Find product name
const Product = $('.product-wrap');
const Title = Product.find('h1').text();
console.log('Product name:', Title);
// Find current price
const pPrice = $('.product-price');
const currentPrice = pPrice.find('.product-price > ul:nth-child(1) > li:nth-child(3)').text();
console.log('Discounted Price:', currentPrice);
}

api
.get('https://www.newegg.com/samsung-860-evo-series-500gb/p/N82E16820147670?Item=9SIA12K6U07909')
.then((response) => {
if (response.statusCode === 200 && response.pcStatus === 200) {
parseHtml(response.body);
} else {
console.log('Failed: ', response.statusCode, response.originalStatus, response.pcStatus);
}
})
.catch((error) => console.error);

Execute the code to get your results:

Conclusion

Hopefully, this article made it clear that Node.js fits perfectly well for Web scraping. The simple scraper that we have built demonstrated how optimized the v8 engine is when doing an HTTP request and the fast-processing speed of each of your crawls will save you precious time when scraping for content. The language itself is very lightweight and can easily be handled by most modern machines. It is also suitable for any size of project, from a single scraping instruction like what we have here, to huge projects and infrastructures used by enterprises.

Cheerio is just one out of the thousands of packages and libraries available in Node which ensures you will always have the right tool for any project you choose. You can use the example here to build your own scraper and get whatever content you need from any website you want. The Node ecosystem will give you the freedom and boundless possibilities. Perhaps, the only limitation right now is your creativity and willingness to learn.

Lastly, if you want an effective and efficient web scraper, it is best to use proxies so you can avoid blocks, CAPTCHAs, and any connection issue that you may encounter when crawling different websites. Using the crawling and scraping tools from ProxyCrawl will save you countless hours finding solutions to bypass blocked requests so you can focus on your main goal. And with the help of ProxyCrawl’s Artificial Intelligence, you are ensured that each of your requests sent to the API will provide the best data outcome possible.