Python has been around for more than 20 years and is now one of the most popular programming languages out there. It is a general-purpose language that is object-oriented and interpreted, which means all errors are checked during the runtime. Python is also open-source, and a high-level language that can be used in a wide array of tasks including Web development, Artificial Intelligence, big data, scripting, and more.

You do not need years of experience to start utilizing Python. It is considerably easy to comprehend that is why most software engineers recommend it as a starting point if you wish to learn to code. Developed with the user experience in mind, programmers often find Python codes are simpler to read and allow to execute commands with fewer lines of code compared to other languages.

Java

1
2
3
4
5
public class Sample {
public static void main(String args[]) {
System.out.println("Hello World!");
}
}

Python

1
print("Hello World!")

Why use Python for web scraping?

So, if you are planning to crawl and scrape a certain website but don’t know what programming language to use, then using Python is the best way to start. If you are somehow still not convinced, here are some of the key features which makes Python preferable for web scraping:

  1. More task with fewer codes. We cannot stress this enough. Writing a code in Python is so much simpler and if you are looking to scrape large amounts of data, surely you do not want to spend more time writing your code. With Python, you can do more with less.
  2. Community support. Since Python is popular and widely considered as a dependable language for scraping, you can easily seek help if you encountered any technical issues through thousands of community members on forums and most social media platforms.
  3. A multitude of libraries. It has a large selection of libraries especially for web scraping including Selenium, BeautifulSoup, and of course, ProxyCrawl.

However, web scraping can be tricky at times since some websites can block your requests or even ban your IP. Writing a simple scraper in Python may not be enough without using proxies. So, to properly scrape sensible data on the web, you will need ProxyCrawl’s Crawling API which will allow you to easily scrape most websites by avoiding blocked requests and CAPTCHAs.

Scraping websites with Python using ProxyCrawl

Now that we have provided you some of the reasons why you should use Python and ProxyCrawl for web scraping, let us continue with a guide on how you can actually start building your own scraping tool.

First, here are the pre-requisites for our simple scraping tool:

  1. ProxyCrawl account
  2. PyCharm or any code editor that you preferred
  3. Python 3.x
  4. ProxyCrawl Python Library

Make sure to take note of your ProxyCrawl token which will serve as your authentication key to use the Crawling API service.

Let us start by installing the library that we will use for this project. You can run the following command on your console:

1
pip install proxycrawl

Once everything is set, it is now time to write some code. First, import the ProxyCrawl API:

1
from proxycrawl import CrawlingAPI

Then initialize the API and enter your authentication token:

1
api = CrawlingAPI({'token': 'USER_TOKEN'})

After that, get your target URL or any website that you would like to scrape. For this guide, we will use Amazon as an example.

1
targetURL = 'https://www.amazon.com/AMD-Ryzen-3800XT-16-Threads-Processor/dp/B089WCXZJC'

The next part of our code will allow us to download the full HTML source code of the URL and if successful, will display the result on your console or terminal:

1
2
3
response = api.get(targetURL)
if response['status_code'] == 200:
print(response['body'])

As you can see, every request to ProxyCrawl comes with a response. Our code will only show you the crawled HTML if the status is 200 or success. Any other response, let’s say 503 or 404, will mean that the web crawler failed. However, the API is using thousands of proxies worldwide which should deliver the best data results possible.

Now, we have successfully built a crawler. But what we want is a scraper tool, right? So, we will use the most convenient way to scrape a website that will return parsed data in JSON format.

One great feature of the Crawling API is that we can use the built-in data scrapers for supported sites, and luckily, Amazon is one of them.

To use the data scraper, simply pass it as a parameter on our GET request. Our full code should now look like this:

1
2
3
4
5
6
7
8
9
from proxycrawl import CrawlingAPI

api = CrawlingAPI({'token': 'USER_TOKEN'})

targetURL = 'https://www.amazon.com/AMD-Ryzen-3800XT-16-Threads-Processor/dp/B089WCXZJC'

response = api.get(targetURL, {'autoparse': 'true'})
if response['status_code'] == 200:
print(response['body'])

If everything goes well, you will get a response like the example below:

Scraping with BeautifulSoup and ProxyCrawl

Now, what if you want to get more specific data, say, just the product name and price? As mentioned earlier, Python has a large collection of libraries including those that are specifically meant for scraping. BeautifulSoup is one of them and is a popular package on Python for parsing HTML and XML data. It is also simpler to use which beginners can take advantage of.

So, let us go ahead and try building a simple scraper using the Crawling API and BeautifulSoup this time. Since we are using Python version 3.xx let us install the latest BeautifulSoup package which is simply called BS4:

1
pip install beautifulsoup4

Since we have already installed the ProxyCrawl library earlier, you can just create a new Python file and import BS4 or ProxyCrawl.

1
2
from bs4 import BeautifulSoup
from proxycrawl import CrawlingAPI

Then, the same thing as before, make sure to initialize the API and use the GET request to crawl your target URL:

1
2
3
4
5
api = CrawlingAPI({ 'token': 'USER_TOKEN' })

targetURL = 'https://www.amazon.com/AMD-Ryzen-3800XT-16-Threads-Processor/dp/B089WCXZJC'

response = api.get(targetURL)

Next, we need to pass the HTML source code to BeautifulSoup so we can get an object and parse out specific data with the lxml parser.

1
2
if response['status_code'] == 200:
b_soup = BeautifulSoup(response['body'], 'lxml')

In this example, we will try to get the product name and price from an Amazon product page. The easiest way to do this is by using the find method and pass in an argument to scrape the specific texts we need. To learn more how you can select a specific HTML element, you may check the BeautifulSoup documentation.

1
2
product_name = b_soup.find('span', class_='a-size-large product-title-word-break').text
product_price = b_soup.find('span', class_='a-size-medium a-color-price priceBlockBuyingPriceString').text

After that, we simply need to write a command to print the output.

1
2
print('name:', product_name)
print('price:', product_price)

The full code should now look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from bs4 import BeautifulSoup
from proxycrawl import CrawlingAPI

api = CrawlingAPI({ 'token': 'USER_TOKEN' })

targetURL = 'https://www.amazon.com/AMD-Ryzen-3800XT-16-Threads-Processor/dp/B089WCXZJC'

response = api.get(targetURL)

if response['status_code'] == 200:
b_soup = BeautifulSoup(response['body'], 'lxml')

product_name = b_soup.find('span', class_='a-size-large product-title-word-break').text
product_price = b_soup.find('span', class_='a-size-medium a-color-price priceBlockBuyingPriceString').text
print('name:', product_name)
print('price:', product_price)

Sample output:

Conclusion

As simple as that. With just 12 lines of code, our scraping tool is complete and is now ready to use. Of course, you can utilize what you have learned here however you want and it will provide all sorts of data already parsed. With the help of the Crawling API, you won’t need to worry about website blocks or CAPTCHAs, so you can focus on what is important for your project or business.

Remember that this is just a very basic scraping tool. Python can be used in various ways and in a much larger scale. Go ahead and experiment with different applications and modules. Perhaps you may want to search and download Google images, monitor product pricing on shopping sites for changes every day, or even provide services to your clients that require data extraction.

The possibilities are endless, and using ProxyCrawl’s crawling and scraping API will ensure that your scraper will stay effective and reliable at all times.