In this Web scraping Java tutorial we will enter deep crawling: an advanced form of web scraping. This comprehensive guide on web scraping in Java will use deep crawling with the Java Spring Boot to scrape the web.

Through deep crawling, even the most secluded sections of a website become accessible, revealing data that might otherwise go unnoticed.

What’s even more remarkable is that we’re not just talking theory – we will show you how to do it. Using Java Spring Boot and the Crawlbase Java library, we’ll teach you how to make deep crawling a reality. We’ll help you set up your tools, explain the difference between shallow and deep crawling (it’s not as complicated as it sounds!), and show you how to extract information from different website pages and store them on your side.

To understand the coding part of web scraping Java, you must have a basic understanding of Java Spring Boot and MySQL database. Let’s get started on how to build a web scraper in Java.

Table of Contents:

  1. Understanding Deep Crawling: The Gateway to Web Data
  2. Why do you need to build a Java Web Scraper
  3. How to do Web Scraping in Java
  4. Setting the Stage: Preparing Your Environment
  5. Simplify Spring Boot Project Setup with Spring Initializr
  6. Importing the Starter Project into Spring Tool Suite
  7. Understanding Your Project’s Blueprint: A Peek into Project Structure
  8. Starting the Coding Journey
  9. Running the Project and Initiating Deep Crawling
  10. Analyzing Output in the Database
  11. Conclusion
  12. Frequently Asked Questions

Deep Crawling in Java.

Deep crawling, also known as web scraping, is like digging deep into the internet to find lots of valuable information. In this part, we’ll talk about what deep crawling is, how it’s different from just skimming the surface of websites, and why it’s important for getting data.

Basically, deep crawling is a smart way of looking through websites and grabbing specific information from different parts of those sites. Unlike shallow crawling, which only looks at the surface stuff, deep crawling digs into the layers of websites to find hidden gems of data. This lets us gather all sorts of info, like prices of products, reviews from users, financial stats, and news articles with web scraping using Java.

Deep crawling helps us get hold of a bunch of structured and unstructured data that we wouldn’t see otherwise. By carefully exploring the internet, we can gather data that can help with business decisions, support research, and spark new ideas with Java web scraping.

Differentiating Between Shallow and Deep Crawling

Shallow crawling is like quickly glancing at the surface of a pond, just seeing what’s visible. It usually only looks at a small part of a website, like the main page or a few important ones. But it misses out on lots of hidden stuff.

On the other hand, deep crawling is like diving deep into the ocean, exploring every nook and cranny. It checks out the whole website, clicking through links and finding hidden gems tucked away in different sections. Deep crawling is super useful for businesses, researchers, and developers because it digs up a ton of valuable data that’s otherwise hard to find.

Shallow vs Deep Crawl

Exploring the Scope and Significance of Deep Crawling

The scope of deep crawling extends far beyond data extraction; it’s a gateway to understanding the web’s dynamics and uncovering insights that drive decision-making. From e-commerce platforms that want to monitor product prices across competitors’ sites to news organizations aiming to analyze sentiment across articles, the applications of deep crawling are as diverse as the data it reveals.

In research, deep crawling is like the base for analyzing data to understand new trends, how people use the internet, and what content they like. It’s also important for following laws and rules, because companies need to think about the right way to gather data and follow the rules of the websites they’re getting it from.

In this tutorial, we will dig deep into web scraping Java.

Why do you need to build a Java Web Scraper

You need a Java web scraper to gather and utilize website information. One such web scraper is Crawlbase Crawler, but what exactly is Crawlbase Crawler, and how does it work its magic?

What Is Crawlbase Crawler?

Crawlbase Crawler is a dynamic web data extraction tool that offers a modern and intelligent approach to collecting valuable information from websites. Unlike traditional scraping methods that involve constant polling, Crawlbase Crawler operates asynchronously. This means it can independently process requests to extract data, delivering it in real-time without the need for manual monitoring.

The Workflow: How Crawlbase Crawler Operates

Crawlbase Crawler operates on a seamless and efficient workflow that can be summarized in a few key steps:

  1. URLs Submission: As a user, you initiate the process by submitting URLs to the Crawlbase Crawler using the Crawling API.
  2. Request Processing: The Crawler receives these requests and processes them asynchronously. This means it can handle multiple requests simultaneously without any manual intervention.
  3. Data Extraction: The Crawler visits the specified URLs, extracts the requested data, and packages it for delivery.
  4. Webhook Integration: Crawlbase Crawler integrates with webhook instead of requiring manual polling. This webhook serves as a messenger that delivers the extracted data directly to your server’s endpoint in real time.
  5. Real-Time Delivery: The extracted data is delivered to your server’s webhook endpoint as soon as it’s available, enabling immediate access without delays.
  6. Fresh Insights: By receiving data in real-time, you gain a competitive edge in making informed decisions based on the latest web content.

The Benefits: Why Choose Crawlbase Crawler

While a crawler allows instant web scraping with Java , it also has some other benefits:

  1. Efficiency: Asynchronous processing eliminates the need for continuous monitoring, freeing up your resources for other tasks.
  2. Real-Time Insights: Receive data as soon as it’s available, allowing you to stay ahead of trends and changes.
  3. Streamlined Workflow: Webhook integration replaces manual polling, simplifying the data delivery process.
  4. Timely Decision-Making: Instant access to freshly extracted data empowers timely and data-driven decision-making.

To access Java web crawler, you must create it within your Crawlbase account dashboard. You can opt for the TCP or JavaScript Crawler based on your specific needs. The TCP Crawler is ideal for static pages, while the JavaScript Crawler suits content generated via JavaScript, as in JavaScript-built pages or dynamically rendered browser content. Read here to know more about Crawlbase Crawler.

During the creation, it will ask you to give your webhook address. So, we will create it after we successfully create a webhook in our Spring Boot project. In the upcoming section, we’ll dive deeper into the coding stuff and develop the required component to complete our project.

How to do Web Scraping in Java

Follow the steps below to learn web scraping in Java.

Setting the Stage: Preparing Your Environment

Before we embark on our journey into deep crawling, it’s important to set the stage for success. This section guides you through the essential steps to ensure your development environment is ready to tackle the exciting challenges ahead.

Installing Java on Ubuntu and Windows

Java is the backbone of our development process, and we have to make sure that it’s available on our system. If you don’t have Java installed on your system, you can follow the steps below as per your operating system.

Installing Java on Ubuntu:

  1. Open the Terminal by pressing Ctrl + Alt + T.
  2. Run the following command to update the package list:
1
sudo apt update
  1. Install the Java Development Kit (JDK) by running:
1
sudo apt install default-jdk
  1. Verify the JDK installation by typing:
1
java -version

Installing Java on Windows:

  1. Visit the official Oracle website and download the latest Java Development Kit (JDK).
  2. Follow the installation wizard’s prompts to complete the installation. Once installed, you can verify it by opening the Command Prompt and typing:
1
java -version

Installing Spring Tool Suite (STS) on Ubuntu and Windows:

Spring Tool Suite (STS) is an integrated development environment (IDE) specifically designed for developing applications using the Spring Framework, a popular Java framework for building enterprise-level applications. STS provides tools, features, and plugins that enhance the development experience when working with Spring-based projects; follow the steps below to install them.

  1. Visit the official Spring Tool Suite website at spring.io/tools.
  2. Download the appropriate version of Spring Tool Suite for your operating system (Ubuntu or Windows).

On Ubuntu:

  1. After downloading, navigate to the directory where the downloaded file is located in the Terminal.
  2. Extract the downloaded archive:
1
2
# Replace <version> and <platform> as per the archive name
tar -xvf spring-tool-suite-<version>-<platform>.tar.gz
  1. Move the extracted directory to a location of your choice :
1
2
# Replace <bundle> as per extracted folder name
mv sts-<bundle> /your_desire_path/

On Windows:

  1. Run the downloaded installer and follow the on-screen instructions to complete the installation.

Installing MySQL on Ubuntu and Windows

Setting up a reliable database management system is paramount to kick-start your journey into deep crawling and web data extraction. MySQL, a popular open-source relational database, provides the foundation for securely storing and managing the data you’ll gather through your crawling efforts. Here’s a step-by-step guide on how to install MySQL on both Ubuntu and Windows platforms:

Installing MySQL on Ubuntu:

  1. Open a terminal and run the following commands to ensure your system is up-to-date:
1
2
sudo apt update
sudo apt upgrade
  1. Run the following command to install the MySQL server package:
1
sudo apt install mysql-server
  1. After installation, start the MySQL service:
1
sudo systemctl start mysql.service
  1. Check if MySQL is running with the command:
1
sudo systemctl status mysql

Installing MySQL on Windows:

  1. Visit the official MySQL website and download the MySQL Installer for Windows.
  2. Run the downloaded installer and choose the “Developer Default” setup type. This will install MySQL Server and other related tools.
  3. During installation, you’ll be asked to configure MySQL Server. Set a strong root password and remember it.
  4. Follow the installer’s prompts to complete the installation.
  5. After installation, MySQL should start automatically. You can also start it manually from Windows’s “Services” application.

Verifying MySQL Installation:

Regardless of your platform, you can verify the MySQL installation by opening a terminal or command prompt and entering the following command:

1
mysql -u root -p

You’ll be prompted to enter the MySQL root password you set during installation. If the connection is successful, you’ll be greeted with the MySQL command-line interface.

Now that you have Java and STS ready, you’re all set for the next phase of your deep crawling adventure. In the upcoming step, we’ll guide you through creating a Spring Boot starter project, setting the stage for your deep crawling endeavors. Let’s dive into this exciting phase of the journey!

Simplify Spring Boot Project Setup with Spring Initializr

Imagine setting up a Spring Boot project is like navigating a tricky maze of settings. But don’t worry, Spring Initializr is here to help! It’s like having a smart helper online that makes the process way easier. You could do it manually, but that’s like a puzzle that takes a lot of time. Spring Initializr comes to the rescue by making things smoother right from the start. Follow the following Steps to create Spring Boot Project with Spring Initializr.

  1. Go to the Spring Initializr Website

Open your web browser and go to the Spring Initializr website. You can find it at start.spring.io.

  1. Choose Your Project Details

Here’s where you make important choices for your project. You have to chose the type of the Project and Language you are going to use. We have to choose Maven as a Project type and JAVA as its language. For Spring Boot version, go for a stable one (like 3.1.2). Then, add details about your project, like its name and what it’s about. It’s easy – just follow the example in the picture.

  1. Add the Cool Stuff

Time to add special features to your project! It’s like giving it superpowers. Include Spring Web (that’s important for Spring Boot projects), Spring Data JPA, and the MySQL Driver if you’re going to use a database. Don’t forget Lombok – it’s like a magic tool that saves time. We’ll talk more about these in the next parts of the blog.

  1. Get Your Project

After picking all the good stuff, click “GENERATE.” Your Starter project will download as a zip file. Once it’s done, open the zip file to see the beginning of your project.

Spring Initializr Settings

By following these steps, you’re ensuring your deep crawling adventure starts smoothly. Spring Initializr is like a trusty guide that helps you set up. In the upcoming section, we’ll guide you through importing your project into the Spring Tool Suite you’ve installed. Get ready to kick-start this exciting phase of your deep crawling journey!

Importing the Starter Project into Spring Tool Suite

Alright, now that you’ve got your Spring Boot starter project all setup and ready to roll, the next step is to import it into Spring Tool Suite (STS). It’s like inviting your project into a cozy workspace where you can work your magic. Here’s how you do it:

  1. Open Spring Tool Suite (STS)

First things first, fire up your Spring Tool Suite. It’s your creative hub where all the coding and crafting will happen.

  1. Import the Project

Navigate to the “File” menu and choose “Import.” A window will pop up with various options – select “Existing Maven Projects” and click “Next.”

  1. Choose Project Directory

Click the “Browse” button and locate the directory where you unzipped your Starter project. Select the project’s root directory and hit “Finish.”

  1. Watch the Magic

Spring Tool Suite will work its magic and import your project. It appears in the “Project Explorer” on the left side of your workspace.

  1. Ready to Roll

That’s it! Your Starter project is now comfortably settled in Spring Tool Suite. You’re all set to start building, coding, and exploring.

Import in STS

Bringing your project into Spring Tool Suite is like opening the door to endless possibilities. Now you have the tools and space to make your project amazing. The following section will delve into the project’s structure, peeling back the layers to reveal its components and inner workings. Get ready to embark on a journey of discovery as we unravel what lies within!

Understanding Your Project’s Blueprint: A Peek into Project Structure

Now that your Spring Boot starter project is comfortably nestled within Spring Tool Suite (STS) let’s take a tour of its inner workings. It’s like getting to know the layout of your new home before you start decorating it.

Maven and pom.xml

At the core of your project lies a powerful tool called Maven. Think of Maven as your project’s organizer – it manages libraries, dependencies, and builds. The file named pom.xml is where all the project-related magic happens. It’s like the blueprint that tells Maven what to do and what your project needs. As in our case, currently, we will have this in the pom.xml project.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>3.1.2</version>
<relativePath/> <!-- lookup parent from repository -->
</parent>
<groupId>com.example</groupId>
<artifactId>crawlbase</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>Crawlbase Crawler With Spring Boot</name>
<description>Demo of using Crawlbase Crawler with Spring Boot and how to do Deep Crawling</description>
<properties>
<java.version>17</java.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>com.mysql</groupId>
<artifactId>mysql-connector-j</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>

<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<excludes>
<exclude>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
</exclude>
</excludes>
</configuration>
</plugin>
</plugins>
</build>
</project>

Java Libraries

Remember those special features you added when creating the project? They’re called dependencies, like magical tools that make your project more powerful. You were actually adding these libraries when you included Spring Web, Spring Data JPA, MySQL Driver, and Lombok from the Spring Initializr. You can see those in the pom.xml above. They bring pre-built functionality to your project, saving you time and effort.

  • Spring Web: This library is your ticket to building Spring Boot web applications. It helps with things like handling requests and creating web controllers.
  • Spring Data JPA: This library is your ally if you’re dealing with databases. It simplifies database interactions and management, letting you focus on your project’s logic.
  • MySQL Driver: When you’re using MySQL as your database, this driver helps your project communicate with the database effectively.
  • Lombok: Say goodbye to repetitive code! Lombok reduces the boilerplate code you usually have to write, making your project cleaner and more concise.

Understand the Project Structure

As you explore your project’s folders, you’ll notice how everything is neatly organized. Your Java code goes into the src/main/java directory, while resources like configuration files and static assets reside in the src/main/resources directory. You’ll also find the application.properties file here – it’s like the control center of your project, where you can configure settings.

Project Structure

In the src/main/java directory we will find a package containing a Java Class with main function. This file act as the starting point while execution of Spring Boot Project. In our case, we will have CrawlbaseApplication.java file with following code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
package com.example.crawlbase;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
// Add this to enable asynchronous in the project
@EnableAsync
public class CrawlbaseApplication {

public static void main(String[] args) {
SpringApplication.run(CrawlbaseApplication.class, args);
}

}

Now that you’re familiar with the essentials, you can confidently navigate your project’s landscape. Before starting with the coding, we’ll dive into Crawlbase and try to understand how it works and how we can use it in our project. So, get ready to uncover the true power of crawler.

Starting the Coding Journey to Java Scraping

Now that you have Java web scraping framework, Java web scraping library and Java web scraper set up, it’s time to dive into coding of Java web scraping tutorial. This section outlines the essential steps to create controllers, services, repositories, and update properties files. Before getting into the nitty-gritty of coding, we need to lay the groundwork and introduce key dependencies that will empower our project.

Since we’re using the Crawlbase Crawler, it’s important to ensure that we can easily use it in our Java project. Luckily, Crawlbase provides a Java library that makes this integration process simpler. To add it to our project, we just need to include the appropriate Maven dependency in the project’s pom.xml file.

1
2
3
4
5
<dependency>
<groupId>com.crawlbase</groupId>
<artifactId>crawlbase-java-sdk-pom</artifactId>
<version>1.0</version>
</dependency>

After adding this dependency, a quick Maven Install will ensure that the Crawlbase Java library is downloaded from the Maven repository and ready for action.

Integrating JSoup Dependency

Given that we’ll be diving deep into HTML content, having a powerful HTML parser at our disposal is crucial. Enter JSoup, a robust and versatile HTML parser for Java. It offers convenient methods for navigating and manipulating HTML structures. To leverage its capabilities, we need to include the JSoup library in our project through another Maven dependency:

1
2
3
4
5
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.16.1</version>
</dependency>

Setting Up the Database

Before we proceed further, let’s lay the foundation for our project by creating a database. Follow these steps to create a MySQL database:

  1. Open the MySQL Console: If you’re using Ubuntu, launch a terminal window. On Windows, open the MySQL Command Line Client or MySQL Shell.
  2. Log In to MySQL: Enter the following command and input your MySQL root password when prompted:
1
mysql -u root -p
  1. Create a New Database: Once logged in, create a new database with the desired name:
1
2
# Replace database_name with your chosen name
CREATE DATABASE database_name;

Planning the Models

Before diving headfirst into model planning, let’s understand what the crawler returns when URLs are pushed to it and what response we receive at our webhook. When we send URLs to the crawler, it responds with a Request ID, like this:

1
{ "rid": "1e92e8bff32c31c2728714d4" }

Once the crawler has effectively crawled the HTML content, it forwards the output to our webhook. The response will look like this:

1
2
3
4
5
6
7
8
9
10
11
12
Headers:
"Content-Type" => "text/plain"
"Content-Encoding" => "gzip"
"Original-Status" => 200
"PC-Status" => 200
"rid" => "The RID you received in the push call"
"url" => "The URL which was crawled"

Body:
The HTML of the page

// Body will be gzip encoded

So, taking this into the account, we can consider the following database structure.

Database Schema

We don’t need to create the database tables directly as we will make our Spring Boot Project to automatically initialize the tables when we run it. We will make Hibernate to do this for us.

Designing the Model Files

With the groundwork laid in the previous section, let’s delve into the creation of our model files. In the com.example.crawlbase.models package, we’ll craft two essential models: CrawlerRequest.java and CrawlerResponse.java. These models encapsulate the structure of our database tables, and to ensure efficiency, we’ll employ Lombok to reduce boilerplate code.

CrawlerRequest Model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
package com.example.crawlbase.models;

import jakarta.persistence.CascadeType;
import jakarta.persistence.Entity;
import jakarta.persistence.FetchType;
import jakarta.persistence.GeneratedValue;
import jakarta.persistence.Id;
import jakarta.persistence.OneToOne;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;

@Entity
@Data
@NoArgsConstructor
@AllArgsConstructor
@Builder(toBuilder = true)
public class CrawlerRequest {

@Id
@GeneratedValue
private Long id;

private String url;
private String type;
private Integer status;
private String rid;

@OneToOne(mappedBy = "crawlerRequest", cascade = CascadeType.ALL, fetch = FetchType.LAZY)
private CrawlerResponse crawlerResponse;

}

CrawlerResponse Model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
package com.example.crawlbase.models;

import jakarta.persistence.Column;
import jakarta.persistence.Entity;
import jakarta.persistence.GeneratedValue;
import jakarta.persistence.Id;
import jakarta.persistence.JoinColumn;
import jakarta.persistence.OneToOne;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;

@Entity
@Data
@NoArgsConstructor
@AllArgsConstructor
@Builder(toBuilder = true)
public class CrawlerResponse {

@Id
@GeneratedValue
private Long id;

private Integer pcStatus;
private Integer originalStatus;

@Column(columnDefinition = "LONGTEXT")
private String pageHtml;

@OneToOne
@JoinColumn(name = "request_id")
private CrawlerRequest crawlerRequest;

}

Establishing Repositories for Both Models

Following the creation of our models, the next step is to establish repositories for seamless interaction between our project and the database. These repository interfaces serve as essential connectors, leveraging the JpaRepository interface to provide fundamental functions for data access. Hibernate, a powerful ORM tool, handles the underlying mapping between Java objects and database tables.

Create a package com.example.crawlbase.repositories and within it, create two repository interfaces, CrawlerRequestRepository.java and CrawlerResponseRepository.java.

CrawlerRequestRepository Interface:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
package com.example.crawlbase.repositories;

import java.util.List;
import org.springframework.data.jpa.repository.JpaRepository;
import org.springframework.data.jpa.repository.Query;
import org.springframework.data.repository.query.Param;

import com.example.crawlbase.models.CrawlerRequest;

public interface CrawlerRequestRepository extends JpaRepository<CrawlerRequest, Long> {

// Find by column Name and value
List<CrawlerRequest> findByRid(String value);
}

CrawlerResponseRepository Interface:

1
2
3
4
5
6
7
8
package com.example.crawlbase.repositories;

import org.springframework.data.jpa.repository.JpaRepository;
import com.example.crawlbase.models.CrawlerResponse;

public interface CrawlerResponseRepository extends JpaRepository<CrawlerResponse, Long> {

}

Planing APIs and Request Body Mapper Classes

Harnessing the Crawlbase Crawler involves designing two crucial APIs: one for pushing URLs to the crawler and another serving as a webhook. To begin, let’s plan the request body structures for these APIs.

Push URL request body:

1
2
3
4
5
6
{
"urls": [
"http://www.3bfluidpower.com/",
.....
]
}

As for the webhook API’s request body, it must align with the Crawler’s response structure, as discussed earlier. You can read more about it here.

In line with this planning, we’ll create two request mapping classes in the com.example.crawlbase.requests package:

CrawlerWebhookRequest Class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
package com.example.crawlbase.requests;

import lombok.Builder;
import lombok.Data;

@Data
@Builder
public class CrawlerWebhookRequest {

private Integer pc_status;
private Integer original_status;
private String rid;
private String url;
private String body;

}

ScrapeUrlRequest Class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
package com.example.crawlbase.requests;

import lombok.Builder;
import lombok.Data;

@Data
@Builder
public class CrawlerWebhookRequest {

private Integer pc_status;
private Integer original_status;
private String rid;
private String url;
private String body;

}

Creating a ThreadPool to optimize webhook

If we don’t optimize our webhook to handle large amount of requests, it will cause hidden problems. This is where we can use multi-threading. In JAVA, ThreadPoolTaskExecutor is used to manage a pool of worker threads for executing asynchronous tasks concurrently. This is particularly useful when you have tasks that can be executed independently and in parallel.

Create a new package com.example.crawlbase.config and create ThreadPoolTaskExecutorConfig.java file in it.

ThreadPoolTaskExecutorConfig Class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
package com.example.crawlbase.config;

import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor;

@Configuration
public class ThreadPoolTaskExecutorConfig {

@Bean(name = "taskExecutor")
public ThreadPoolTaskExecutor taskExecutor() {
int cores = Runtime.getRuntime().availableProcessors();
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(cores);
executor.setMaxPoolSize(cores);
executor.setQueueCapacity(Integer.MAX_VALUE);
executor.setThreadNamePrefix("Async-");
executor.initialize();
return executor;
}
}

Creating the Controllers and their Services

Since we need two APIs and there business logic is much different, we will implement them in the separate controllers. Separate Controllers mean we will have separate services. Let’s first create a MainController.java and its service as MainService.java. We will implement the API you push URL on the Crawler in this controller.

Create a new package com.example.crawlbase.controllers for controllers and com.example.crawlbase.services for services in the project.

MainController Class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
package com.example.crawlbase.controllers;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import com.example.crawlbase.requests.ScrapeUrlRequest;
import com.example.crawlbase.services.MainService;

import lombok.extern.slf4j.Slf4j;

@RestController
@RequestMapping("/scrape")
@Slf4j
public class MainController {

@Autowired
private MainService mainService;

@PostMapping("/push-urls")
public ResponseEntity<Void> pushUrlsToCawler(@RequestBody ScrapeUrlRequest request) {
try {
if(!request.getUrls().isEmpty()) {
// Asynchronously Process The Request
mainService.pushUrlsToCrawler(request.getUrls(), "parent");
}
return ResponseEntity.status(HttpStatus.OK).build();
} catch (Exception e) {
log.error("Error in pushUrlsToCrawler function: " + e.getMessage());
return ResponseEntity.status(HttpStatus.BAD_REQUEST).build();
}
}

}

As you can see above we have created a restful API “@POST /scrape/push-urls” which will be responsible for handling the request for pushing URLs to the Crawler.

MainService Class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
package com.example.crawlbase.services;

import java.util.*;
import com.crawlbase.*;
import com.example.crawlbase.models.CrawlerRequest;
import com.example.crawlbase.repositories.CrawlerRequestRepository;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;

import lombok.extern.slf4j.Slf4j;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.scheduling.annotation.Async;
import org.springframework.stereotype.Service;

@Slf4j
@Service
public class MainService {

@Autowired
private CrawlerRequestRepository crawlerRequestRepository;

// Inject the values from the properties file
@Value("${crawlbase.token}")
private String crawlbaseToken;
@Value("${crawlbase.crawler}")
private String crawlbaseCrawlerName;

private final ObjectMapper objectMapper = new ObjectMapper();

@Async
public void pushUrlsToCrawler(List<String> urls, String type) {
HashMap<String, Object> options = new HashMap<String, Object>();
options.put("callback", "true");
options.put("crawler", crawlbaseCrawlerName);
options.put("callback_headers", "type:" + type);

API api = null;
CrawlerRequest req = null;
JsonNode jsonNode = null;
String rid = null;

for(String url: urls) {
try {
api = new API(crawlbaseToken);
api.get(url, options);
jsonNode = objectMapper.readTree(api.getBody());
rid = jsonNode.get("rid").asText();
if(rid != null) {
req = CrawlerRequest.builder().url(url).type(type).
status(api.getStatusCode()).rid(rid).build();
crawlerRequestRepository.save(req);
}
} catch(Exception e) {
log.error("Error in pushUrlsToCrawler function: " + e.getMessage());
}
}
}

}

In the above service, we created an Async method to process the request asynchronously. pushUrlsToCrawler function uses the Crawlbase library to push URLs to the Crawler and then save the received RID and other attributes into the crawler_request table. To push URLs to the Crawler, we must use the “crawler” and “callback” parameters. We are also using “callback_headers” to send a custom header “type,” which we will use to know whether the URL is the one we gave or it is scraped while deep crawling. You can read more about these parameters and many others here.

Now we have to implement the API we will use a a webhook. For this create WebhookController.java in the com.example.crawlbase.controllers package and WebhookService.java in the com.example.crawlbase.services package.

WebhookController Class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
package com.example.crawlbase.controllers;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpHeaders;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestHeader;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import com.example.crawlbase.services.WebhookService;

import lombok.extern.slf4j.Slf4j;

@RestController
@RequestMapping("/webhook")
@Slf4j
public class WebhookController {

@Autowired
private WebhookService webhookService;

@PostMapping("/crawlbase")
public ResponseEntity<Void> crawlbaseCrawlerResponse(@RequestHeader HttpHeaders headers, @RequestBody byte[] compressedBody) {
try {
if(!headers.getFirst(HttpHeaders.USER_AGENT).equalsIgnoreCase("Crawlbase Monitoring Bot 1.0") &&
"gzip".equalsIgnoreCase(headers.getFirst(HttpHeaders.CONTENT_ENCODING)) &&
headers.getFirst("pc_status").equals("200")) {
// Asynchronously Process The Request
webhookService.handleWebhookResponse(headers, compressedBody);
}
return ResponseEntity.status(HttpStatus.OK).build();
} catch (Exception e) {
log.error("Error in crawlbaseCrawlerResponse function: " + e.getMessage());
return ResponseEntity.status(HttpStatus.BAD_REQUEST).build();
}
}

}

In the above code, you can see that we have created a restful API, “@POST /webhook/crawlbase”, which will be responsible for receiving the response from the output request from the Crawler. You can notice in the code that we ignore the calls with USER_AGENT as “Crawlbase Monitoring Bot 1.0” because Crawler Monitoring Bot requests this user agent to check if the callback is live and accessible. So, no need to process this request. Just return a successful response to the Crawler.

While working with Crawlbase Crawler, Your server webhook should…

  • Be publicly reachable from Crawlbase servers
  • Be ready to receive POST calls and respond within 200ms
  • Respond within 200ms with a status code 200, 201 or 204 without content

WebhookService Class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
package com.example.crawlbase.services;

import java.io.ByteArrayInputStream;
import java.io.InputStreamReader;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.zip.GZIPInputStream;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpHeaders;
import org.springframework.scheduling.annotation.Async;
import org.springframework.stereotype.Service;

import com.example.crawlbase.models.CrawlerRequest;
import com.example.crawlbase.models.CrawlerResponse;
import com.example.crawlbase.repositories.CrawlerRequestRepository;
import com.example.crawlbase.repositories.CrawlerResponseRepository;
import com.example.crawlbase.requests.CrawlerWebhookRequest;

import lombok.extern.slf4j.Slf4j;

@Slf4j
@Service
public class WebhookService {

@Autowired
private CrawlerRequestRepository crawlerRequestRepository;
@Autowired
private CrawlerResponseRepository crawlerResponseRepository;
@Autowired
private MainService mainService;

@Async("taskExecutor")
public void handleWebhookResponse(HttpHeaders headers, byte[] compressedBody) {
try {
// Unzip the gziped body
GZIPInputStream gzipInputStream = new GZIPInputStream(new ByteArrayInputStream(compressedBody));
InputStreamReader reader = new InputStreamReader(gzipInputStream);

// Process the uncompressed HTML content
StringBuilder htmlContent = new StringBuilder();
char[] buffer = new char[1024];
int bytesRead;
while ((bytesRead = reader.read(buffer)) != -1) {
htmlContent.append(buffer, 0, bytesRead);
}

// The HTML String
String htmlString = htmlContent.toString();

// Create the request object
CrawlerWebhookRequest request = CrawlerWebhookRequest.builder()
.original_status(Integer.valueOf(headers.getFirst("original_status")))
.pc_status(Integer.valueOf(headers.getFirst("pc_status")))
.rid(headers.getFirst("rid"))
.url(headers.getFirst("url"))
.body(htmlString).build();

// Save CrawlerResponse Model
List<CrawlerRequest> results = crawlerRequestRepository.findByRid(request.getRid());
CrawlerRequest crawlerRequest = !results.isEmpty() ? results.get(0) : null;
if(crawlerRequest != null) {
// Build CrawlerResponse Model
CrawlerResponse crawlerResponse = CrawlerResponse.builder().pcStatus(request.getPc_status())
.originalStatus(request.getOriginal_status()).pageHtml(request.getBody()).crawlerRequest(crawlerRequest).build();
crawlerResponseRepository.save(crawlerResponse);
}

// Only Deep Crawl Parent Url
if(headers.getFirst("type").equalsIgnoreCase("parent")) {
deepCrawlParentResponse(request.getBody(), request.getUrl());
}
} catch (Exception e) {
log.error("Error in handleWebhookResponse function: " + e.getMessage());
}

}

private void deepCrawlParentResponse(String html, String baseUrl) {
Document document = Jsoup.parse(html);
Elements hyperLinks = document.getElementsByTag("a");
List<String> links = new ArrayList<String>();

String url = null;
for (Element hyperLink : hyperLinks) {
url = processUrl(hyperLink.attr("href"), baseUrl);
if(url != null) {
links.add(url);
}
}

mainService.pushUrlsToCrawler(links, "child");
}

private String processUrl(String href, String baseUrl) {
try {
if (href != null && !href.isEmpty()) {
baseUrl = normalizeUrl(baseUrl);
String processedUrl = normalizeUrl(href.startsWith("/") ? baseUrl + href : href);
if (isValidUrl(processedUrl) &&
!processedUrl.replace("http://", "").replace("https://", "").equals(baseUrl.replace("http://", "").replace("https://", "")) &&
// Only considering the URLs with same hostname
Objects.equals(new URI(processedUrl).getHost(), new URI(baseUrl).getHost())) {

return processedUrl;
}
}
} catch (Exception e) {
log.error("Error in processUrl function: " + e.getMessage());
}
return null;
}

private boolean isValidUrl(String string) {
String urlRegex = "((http|https)://)(www.)?"
+ "[a-zA-Z0-9@:%._\\+~#?&//=]"
+ "{2,256}\\.[a-z]"
+ "{2,6}\\b([-a-zA-Z0-9@:%"
+ "._\\+~#?&//=]*)";
Pattern pattern = Pattern.compile(urlRegex);
Matcher matcher = pattern.matcher(string);
return matcher.matches();
}

private String normalizeUrl(String url) throws URISyntaxException {
url = url.replace("//www.", "//");
url = url.split("#")[0];
url = url.endsWith("/") ? url.substring(0, url.length() - 1) : url;
return url;
}
}

The WebhookService class serves a crucial role in efficiently handling webhook responses and orchestrating the process of deep crawling. When a webhook response is received, the handleWebhookResponse method is invoked asynchronously from the WebhookController’s crawlbaseCrawlerResponse function. This method starts by unzipping the compressed HTML content and extracting the necessary metadata and HTML data. The extracted data is then used to construct a CrawlerWebhookRequest object containing details like status, request ID (rid), URL, and HTML content.

Next, the service checks if there’s an existing CrawlerRequest associated with the request ID. If found, it constructs a CrawlerResponse object to encapsulate the pertinent response details. This CrawlerResponse instance is then persisted in the database through the CrawlerResponseRepository.

However, what sets this service apart is its ability to facilitate deep crawling. If the webhook response type indicates a “parent” URL, the service invokes the deepCrawlParentResponse method. In this method, the HTML content is parsed using the Jsoup library to identify hyperlinks within the page. These hyperlinks, representing child URLs, are processed and validated. Only URLs belonging to the same hostname and adhering to a specific format are retained.

The MainService is then employed to push these valid child URLs into the crawling pipeline, using the “child” type as a flag. This initiates a recursive process of deep crawling, where child URLs are further crawled, expanding the exploration to multiple levels of interconnected pages. In essence, the WebhookService coordinates the intricate dance of handling webhook responses, capturing and preserving relevant data, and orchestrating the complicated process of deep crawling by intelligently identifying and navigating through parent and child URLs.

Updating application.properties File

In the final step, we will configure the application.properties file to define essential properties and settings for our project. This file serves as a central hub for configuring various aspects of our application. Here, we need to specify database-related properties, Hibernate settings, Crawlbase integration details, and logging preferences.

Ensure that your application.properties file includes the following properties:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Database Configuration
spring.datasource.url=jdbc:mysql://localhost:3306/<database_name>
spring.datasource.username=<MySQL_username>
spring.datasource.password=<MySQL_password>

spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver
spring.jpa.hibernate.ddl-auto=update

# Crawlbase Crawler Integration
crawlbase.token=<Your_Crawlbase_Normal_Token>
crawlbase.crawler=<Your_TCP_Crawler_Name>

logging.file.name=logs/<log-file-name>.log

You can find your Crawlbase TCP (normal) token here. Remember to replace the placeholders in above code with your actual values, as determined in the previous sections. This configuration is vital for establishing database connections, synchronizing Hibernate operations, integrating with the Crawlbase API, and managing logging for your application. By carefully adjusting these properties, you’ll ensure seamless communication between different components and services within your project.

Running the Project and Initiating Deep Crawling

With the coding phase complete, the next step is to set the project in motion. Spring Boot, at its core, employs an embedded Apache Tomcat build that caters to smooth transitions from development to production and integrates seamlessly with prominent platforms-as-a-service. Executing the project within Spring Tool Suite (STS) involves a straightforward process:

  • Right-click the project in the STS project structure tree.
  • Navigate to the “Run As” menu. and
  • Select “Spring Boot App”.

This action triggers the project to launch on localhost, port 8080.

Spring Boot Server Running

Making the Webhook Publicly Accessible

Since the webhook we’ve established resides locally on our system at localhost, port 8080, we need to grant it public accessibility. Enter Ngrok, a tool that creates secure tunnels, granting remote access without the need to manipulate network settings or router ports. Ngrok is executed on port 8080 to render our webhook publicly reachable.

Ngrok Server

Ngrok conveniently provides a public Forwarding URL, which we will later utilize with Crawlbase Crawler.

Creating the Crawlbase Crawler

Recall our earlier discussion on Crawlbase Crawler creation via the Crawlbase dashboard. Armed with a publicly accessible webhook through Ngrok, crafting the crawler becomes effortless.

Create New Crawler

In the depicted instance, the ngrok forwarding URL collaborates with the webhook address “/webhook/crawlbase” as a callback. This fusion yields a fully public webhook address. We christen our crawler as “test-crawler,” a name that will be incorporated into the project’s application.properties file. The selection of TCP Crawler aligns with our choice. Upon hitting the “Create Crawler” button, the crawler takes shape, configured according to the specified parameters.

Initiating Deep Crawling by Pushing URLs

Following the creation of the crawler and the incorporation of its name into the application.properties file, we’re poised to interact with the “@POST /scrape/push-urls” API. Through this API, we send URLs to the crawler, triggering the deep crawl process. Let’s exemplify this by pushing the URL http://www.3bfluidpower.com/.

Postman Request

With this proactive approach, we set the wheels of deep crawling in motion, utilizing the power of Crawlbase Crawler to delve into the digital landscape and unearth valuable insights.

Analyzing Output in the Database

Upon initiating the URL push to the Crawler, a Request ID (RID) is returned—a concept elaborated on in prior discussions—marking the commencement of the page’s crawling process on the Crawler’s end. This strategic approach eliminates the wait time typically associated with the crawling process, enhancing the efficiency and effectiveness of data acquisition. Once the Crawler concludes the crawling, it seamlessly transmits the output to our webhook.

The Custom Headers parameter, particularly the “type” parameter, proves instrumental in our endeavor. Its presence allows us to distinguish between the URLs we pushed and those discovered during deep crawling. When the type is designated as “parent,” the URL stems from our submission, prompting us to extract fresh URLs from the crawled HTML and subsequently funnel them back into the Crawler—this time categorized as “child.” This strategy ensures that only the URLs we introduced undergo deep crawling, streamlining the process.

In our current scenario, considering a singular URL submission to the Crawler, the workflow unfolds as follows: upon receiving the crawled HTML, the webhook service stores it in the crawler_response table. Subsequently, the deep crawling of this HTML takes place, yielding newly discovered URLs that are then pushed to the Crawler.

crawler_request Table:

Crawler Request Table

As you can see above, at our webhook service, we found 16 new URLs from the HTML of the page who’s URL we push to the Crawler in the previous section, which we save in the database with “type: parent”. We push all the new URLs found to the crawler to deep crawl the given URL. Crawler will crawl all of them and push the output on our webhook. We are saving the crawled HTML in the crawler_response table.

crawler_response Table:

Crawler Response Table

As you can see in the above table view, all the information we get at our webhook is saved in the table. Once you have the HTML at your webhook, we can scrape any information we want. This detailed process highlights how deep crawling works, allowing us to discover important information from web content.

Conclusion

Throughout this exploration of web scraping with Java and Spring Boot, we have navigated the critical steps of setting up a Java environment tailored for web scraping, selecting the appropriate libraries, and executing both simple and sophisticated web scraping projects. This journey underscores Java’s versatility and robustness in extracting data from the web, highlighting tools such as JSoup, Selenium, and HtmlUnit for their unique strengths in handling both static and dynamic web content. By equipping readers with the knowledge to tailor their web scraping endeavors to project-specific requirements, this article serves as a comprehensive guide to the complexities and possibilities of web scraping with Java.

As we conclude, it’s clear that mastering web scraping in Java opens up a plethora of opportunities for data extraction, analysis, and application. Whether the goal is to monitor market trends, aggregate content, or gather insightful data from across the web, the techniques and insights provided here lay a solid foundation for both novices and experienced developers alike. While challenges such as handling dynamic content and evading security measures persist, the evolving nature of Java web scraping tools promises continual advancements. Therefore, staying informed and adaptable will be key to harnessing the full potential of web scraping technologies in the ever-evolving landscape of the internet.

Thank you for joining us on this journey. You can find the full source code of the project on GitHub here. May your web data endeavors be as transformative as the tools and knowledge you’ve gained here. As the digital landscape continues to unfold, remember that the power to innovate is in your hands.

For more tutorials like these follow our blog, here are some java tutorial guides you might be interested in

E-commerce website crawling

Web Scrape Expedia

Web Scrape Booking.com

How to Scrape G2 Product Reviews

Playwright Web Scraping

Scrape Yahoo finance

Frequently Asked Questions

Q: Do I need to use JAVA to use the Crawler?

No, you do not need to use JAVA exclusively to use the Crawlbase Crawler. The Crawler provides multiple libraries for various programming languages, enabling users to interact with it using their preferred language. Whether you are comfortable with Python, JavaScript, Java, Ruby, or other programming languages, Crawlbase has you covered. Additionally, Crawlbase offers APIs that allow users to access the Crawler’s capabilities without relying on specific libraries, making it accessible to a wide range of developers with different language preferences and technical backgrounds. This flexibility ensures that you can seamlessly integrate the Crawler into your projects and workflows using the language that best suits your needs.

Q: Can you use Java for web scraping?

Yes, Java is a highly capable programming language that has been used for a variety of applications, including web scraping. It has evolved significantly over the years and supports various tools and libraries specifically for scraping tasks.

Q: Which Java library is most effective for web scraping?

For web scraping in Java, the most recommended libraries are JSoup, HtmlUnit, and Selenium WebDriver. JSoup is particularly useful for extracting data from static HTML pages. For dynamic websites that utilize JavaScript, HtmlUnit and Selenium WebDriver are better suited.

Q: Between Java and Python, which is more suitable for web scraping?

Python is generally preferred for web scraping over Java. This preference is due to Python’s simplicity and its rich ecosystem of libraries such as BeautifulSoup, which simplifies parsing and navigating HTML and XML documents.

Q: What programming language is considered the best for web scraping?

Python is considered the top programming language for web scraping tasks. It offers a comprehensive suite of libraries and tools like BeautifulSoup and Scrapy, which are designed to facilitate efficient and effective web scraping.