Among the plethora of search engines available today, Yandex stands as a prominent player, particularly in Russia and neighboring countries. Just as Google dominates many parts of the world, Yandex holds a significant market share in Russia, with estimates suggesting it captures over 50% of the search engine market in the country. Beyond search, Yandex’s ecosystem encompasses services, with over 20 products and services [including maps, mail, and cloud storage] catering to diverse user needs. As of the latest statistics, Yandex processes billions of search queries every month, making it a prime target for those looking to scrape Yandex for data, whether for businesses, researchers, or data enthusiasts.

However, manually accessing and analyzing this wealth of data can take time and effort. This is where web scraping Yandex comes into play. By utilizing Python and the Crawlbase Crawling API, we can automate the process of gathering and scraping Yandex search results, providing valuable insights and data points that can drive decision-making and research.

In this guide, we’ll delve deep into the world of web scraping, focusing specifically on how to scrape Yandex search results. Whether you want to understand Yandex’s structure, set up your environment to scrape Yandex efficiently or store the scraped data for future analysis, this guide covers it.

Table Of Contents

  1. Why Scrape Yandex Search Results?
  • Benefits of Yandex Search Results
  • Practical Uses of Yandex Data
  1. Understanding Yandex’s Structure
  • Layout and Structure of Yandex Search Pages
  • Key Data Points to Extract
  1. Setting up Your Environment
  • Installing Python and required libraries
  • Choosing the Right Development IDE
  • Crawlbase Registration and API Token
  1. Fetching and Parsing Search Results
  • Crafting the URL for Targeted Scraping
  • Making HTTP Requests using the Crawling API
  • Inspecting HTML to Get CSS Selectors
  • Extracting Search Result Details
  • Handling Pagination
  1. Storing the Scraped Data
  • Storing Scraped Data in CSV File
  • Storing Scraped Data in SQLite Database
  1. Final Words
  2. Frequently Asked Questions

Why Scrape Yandex Search Results?

You get many results when you type something into Yandex and hit search. Have you ever wondered what more you could do with these results? That’s where scraping comes in. Let’s dive into why scraping Yandex can be a game-changer.

Benefits of Yandex Search Results

Benefits of Yandex Search Results
  1. Spotting Trends: Yandex gives us a window into what’s buzzing online. We can determine what topics or products are becoming popular by studying search patterns. For example, if many people search for “winter jackets” in October, it hints that winter shopping trends are starting early.
  2. Knowing Your Competition: If you have a business or a website, you’d want to know how you stack up against others. By scraping Yandex, you can see which websites appear often in searches related to your field. This gives insights into what others are doing right and where you might need to catch up.
  3. Content Creation: Are you a blogger, vlogger, or writer? Knowing what people are looking for on Yandex can guide your content creation. If “easy cookie recipes” are trending, maybe it’s time to share your favorite cookie recipe or make a video about it.
  4. Boosting Your Own Website: Every website wants to appear on the first page of Yandex. Website owners can tweak their content by understanding search patterns and popular keywords. This way, they have a better chance of appearing when someone searches for related topics.

Practical Uses of Yandex Data

  • Comparing Prices: Many people check prices on different websites before buying something. You can gather this price data and make informed decisions by scraping Yandex.
  • Research and Learning: For students, teachers, or anyone curious, Yandex search data can be a goldmine. You can learn about the interests, concerns, and questions of people in different regions.
  • News and Reporting: Journalists and news outlets can use Yandex data to understand what news topics are gaining traction. This helps them prioritize stories and deliver content that resonates with readers.

To summarize, Yandex search results are more than just a list. They offer valuable insights into what people think, search, and want online. By scraping and analyzing this data, we can make smarter decisions, create better content, and stay ahead in the digital game.

Understanding Yandex’s Structure

When you visit Yandex and type in a search, the page you see isn’t random. It’s designed in a specific way. Let’s briefly examine how Yandex’s search page is put together and the essential things we can pick out from it.

Layout and Structure of Yandex Search Pages

Imagine you’re looking at a newspaper. There are headlines at the top, main stories in the middle, and some ads or side stories on the sides. Yandex’s search page is a bit like that.

Yandex Search Results
  • Search Bar: This is where you type what you’re looking for.
  • Search Results: After typing, you get a list of websites related to your search. These are the main stories, like the main news articles in a newspaper.
  • Side Information: Sometimes, there are extra bits on the side. These could be ads, related searches, or quick answers to common questions.
  • Footer: At the bottom, there might be links to other Yandex services or more information about privacy and terms.

Key Data Points to Extract

Now that we know how Yandex’s page looks, what information can we take from it?

  1. Search Results: This is the main thing we want. It’s a list of websites related to our search. If we’re scraping, we’d focus on getting these website links.
  2. Title of Websites: Next to each link is a title. This title gives a quick idea of what the website is about.
  3. Website Description: Under the title, there’s usually a small description or snippet from the website. This can tell us more about the website’s content without clicking on it.
  4. Ads: Sometimes, the first few results might be ads. These are websites that paid Yandex to show up at the top. Knowing which results are ads and which are organic (not paid for) is good.
  5. Related Searches: At the bottom of the page, other search suggestions might be related to what you typed. These can give ideas for more searches or related topics.

Understanding Yandex’s structure helps us know where to look and what to focus on when scraping. Knowing the layout and key data points allows us to gather the information we need more efficiently.

Setting up Your Environment

Before scraping Yandex search results, we must ensure our setup is ready. We must install the tools and libraries needed, pick the right IDE, and get the critical API credentials.

Installing Python and Required Libraries

  • The first step in setting up your environment is to ensure you have Python installed on your system. If you still need to install Python, download it from the official website at python.org.

  • Once you have Python installed, the next step is to make sure you have the required libraries for this project. In our case, we’ll need three main libraries:

    • Crawlbase Python Library: This library will be used to make HTTP requests to the Yandex search page using the Crawlbase Crawling API. To install it, you can use pip with the following command:
    1
    pip install crawlbase
    • Beautiful Soup 4: Beautiful Soup is a Python library that makes it easy to scrape and parse HTML content from web pages. It’s a critical tool for extracting data from the web. You can install it using pip:
    1
    pip install beautifulsoup4
    • Pandas: Pandas is a powerful data manipulation and analysis library in Python. We’ll use it to store and manage the scraped data. Install pandas with pip:
    1
    pip install pandas

Choosing the Right Development IDE

An Integrated Development Environment (IDE) provides a coding environment with features like code highlighting, auto-completion, and debugging tools. While you can write Python code in a simple text editor, an IDE can significantly improve your development experience.

Here are a few popular Python IDEs to consider:

  1. PyCharm: PyCharm is a robust IDE with a free Community Edition. It offers features like code analysis, a visual debugger, and support for web development.

  2. Visual Studio Code (VS Code): VS Code is a free, open-source code editor developed by Microsoft. Its vast extension library makes it versatile for various programming tasks, including web scraping.

  3. Jupyter Notebook: Jupyter Notebook is excellent for interactive coding and data exploration. It’s commonly used in data science projects.

  4. Spyder: Spyder is an IDE designed for scientific and data-related tasks. It provides features like a variable explorer and an interactive console.

Crawlbase Registration and API Token

To use the Crawlbase Crawling API for making HTTP requests to Yandex, you must sign up for an account on the Crawlbase website. Now, let’s get you set up with a Crawlbase account. Follow these steps:

  1. Visit the Crawlbase Website: Open your web browser and navigate to the Crawlbase website Signup page to begin the registration process.
  2. Provide Your Details: You’ll be asked to provide your email address and create a password for your Crawlbase account. Fill in the required information.
  3. Verification: After submitting your details, you may need to verify your email address. Check your inbox for a verification email from Crawlbase and follow the instructions provided.
  4. Login: Once your account is verified, return to the Crawlbase website and log in using your newly created credentials.
  5. Access Your API Token: You’ll need an API token to use the Crawlbase Crawling API. You can find your tokens here.

Note: Crawlbase offers two types of tokens, one for static websites and another for dynamic or JavaScript-driven websites. Since we’re scraping Yandex, we’ll opt for the Normal Token. Crawlbase generously offers an initial allowance of 1,000 free requests for the Crawling API, making it an excellent choice for our web scraping project.

With Python and the required libraries installed, the IDE of your choice set up, and your Crawlbase token in hand, you’re well-prepared to start scraping Yandex search results.

Fetching and Parsing Search Results

The process involves multiple steps when scraping Yandex search results, from crafting the right URL to handling dynamic content. This section will walk you through each step, ensuring you have a clear roadmap to fetch and parse Yandex search results successfully.

Crafting the URL for Targeted Scraping

Yandex, like many search engines, provides a straightforward method to structure URLs for specific search queries. By understanding this structure, you can tailor your scraping process to fetch exactly what you need.

  • Basic Structure: A typical Yandex search URL starts with the main domain followed by the search parameters. For instance:
1
2
# Replace your_search_query_here with the desired search term.
https://yandex.ru/search/?text=your_search_query_here
  • Advanced Parameters: Yandex offers various parameters that allow for more refined searches. Some common parameters include:
    • &lr=: Add the “lr” parameter followed by the language code to only display results in that language.
    • &p=: For pagination, allowing you to navigate through different result pages.
  • Encoding: Ensure that the search query is properly encoded. This is crucial, especially if your search terms contain special characters or spaces. You can use Python libraries like urllib.parse to handle this encoding seamlessly.

By mastering the art of crafting URLs for targeted scraping on Yandex, you empower yourself to extract precise and relevant data, ensuring that your scraping endeavors yield valuable insights.

Making HTTP Requests using the Crawling API

Once we have our URL, the next step is to fetch the HTML content of the search results page. Platforms like Yandex monitor frequent requests from the same IP, potentially leading to restrictions or bans. This is where the Crawlbase Crawling API shines, offering a solution with its IP rotation mechanism.

Let’s use “Winter Jackets” as our target search query. Below is a code snippet illustrating how to leverage the Crawling API:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from crawlbase import CrawlingAPI
from urllib.parse import quote

API_TOKEN = 'YOUR_CRAWLBASE_TOKEN'
crawling_api = CrawlingAPI({'token': API_TOKEN})

url = f'https://yandex.com/search/?text=${quote("Winter Jackets")}'

response = crawling_api.get(url)

if response['headers']['pc_status'] == '200':
html_content = response['body'].decode('utf-8')
print(html_content)
else:
print(f"Failed to fetch the page. Crawlbase status code: {response['headers']['pc_status']}")

Executing the Script

After ensuring your environment is set up and the necessary dependencies are installed, running the script becomes a breeze:

  1. Save the script with a .py extension, e.g., yandex_scraper.py.
  2. Launch your terminal or command prompt.
  3. Navigate to the directory containing the script.
  4. Execute the script using: python yandex_scraper.py.

By executing this script, it will interact with Yandex, search for “Winter Jackets,” and display the HTML content in your terminal.

Output HTML Snapshot

Inspecting HTML to Get CSS Selectors

With the HTML content obtained from the search results page, the next step is to analyze its structure and pinpoint the location of pricing data. This task is where web and browser developer tools come to our rescue. Let’s outline how you can inspect the HTML structure and unearth those precious CSS selectors:

Yandex Search Results Inspect
  1. Open the Web Page: Navigate to the Yandex search URL you intend to scrape and open it in your web browser.
  2. Right-Click and Inspect: Employ your right-clicking prowess on an element you wish to extract and select “Inspect” or “Inspect Element” from the context menu. This mystical incantation will conjure the browser’s developer tools.
  3. Locate the HTML Source: Within the confines of the developer tools, the HTML source code of the web page will lay bare its secrets. Hover your cursor over various elements in the HTML panel and witness the corresponding portions of the web page magically illuminate.
  4. Identify CSS Selectors: To liberate data from a particular element, right-click on it within the developer tools and gracefully choose “Copy” > “Copy selector.” This elegant maneuver will transport the CSS selector for that element to your clipboard, ready to be wielded in your web scraping incantations.

Once you have these selectors, you can proceed to structure your Yandex scraper to extract the required information effectively.

Extracting Search Result Details

Python provides handy tools to navigate and understand web content, with BeautifulSoup being a standout choice.

Previously, we pinpointed specific codes, known as CSS selectors, acting like markers, directing our program precisely to the data we need on a webpage. For instance, we might want details like the title, URL, and search result description. Additionally, while we can’t directly scrape the position of a search result, we can certainly note it down for reference. Here’s how we can update our previous script and extract these details using BeautifulSoup:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
from urllib.parse import quote
import json

# Initialize the CrawlingAPI with your API token
API_TOKEN = 'YOUR_CRAWLBASE_TOKEN'
crawling_api = CrawlingAPI({'token': API_TOKEN})

def fetch_page_html(url):
response = crawling_api.get(url)
if response['headers']['pc_status'] == '200':
return response['body'].decode("utf-8")
else:
print(f"Request failed with crawlbase status code {response['headers']['pc_status']}")
return None

def scrape_yandex_search(html_content):
soup = BeautifulSoup(html_content, 'html.parser')

# Extracting search result details
search_results = []
for position, result in enumerate(soup.select('.serp-item'), start=1):
title_element = result.select_one('h2.organic__url-text')
url_element = result.select_one('a.organic__url')
description_element = result.select_one('div.organic__content-wrapper')

search_result = {
'position': position,
'title': title_element.get_text(strip=True) if title_element else None,
'url': url_element['href'] if url_element else None,
'description': description_element.get_text(strip=True) if description_element else None,

}
search_results.append(search_result)

return search_results

def main():
search_query = "Winter Jackets"
url = f'https://yandex.com/search/?text={quote(search_query)}'
html_content = fetch_page_html(url)

if html_content:
search_results = scrape_yandex_search(html_content)
print(json.dumps(search_results, ensure_ascii=False, indent=2))

if __name__ == "__main__":
main()

The fetch_page_html function sends an HTTP GET request to Yandex’s search results page using the CrawlingAPI library and a specified URL. If the response status code is 200, indicating success, it decodes the UTF-8 response body and returns the HTML content; otherwise, it prints an error message and returns None.

Meanwhile, the scrape_yandex_search function utilizes BeautifulSoup to parse the HTML content of the Yandex search results page. The function iterates through the search results, structures the extracted information, and appends it to the search_results list. Finally, the function returns the compiled list of search results.

The main function is like a control center, starting the process of getting and organizing Yandex search results for a particular search query. It then shows the gathered results in an easy-to-read JSON-style format.

Example Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
[
{
"position": 1,
"title": "BestWinterJacketsof 2024 | Switchback Travel",
"url": "https://www.switchbacktravel.com/best-winter-jackets",
"description": "Patagonia Tres 3-in-1 parka (winterjacket) Category: Casual Fill: 4.2 oz. of 700-fill-power down Weight: 2 lb."
},
{
"position": 2,
"title": "Winterjacket— купить по низкой цене на Яндекс Маркете",
"url": "https://market.yandex.ru/search?text=winter%20jacket",
"description": "Купитьwinterjacket- 97 предложений - низкие цены, быстрая доставка от 1-2 часов, возможность оплаты в рассрочку...Куртка ASICS Lite ShowWinterJacket. 15 338 ₽."
},
{
"position": 3,
"title": "Amazon.com:WinterJackets",
"url": "https://www.amazon.com/Winter-Jackets/s?k=Winter+Jackets",
"description": "CAMEL CROWN Men's Mountain Snow Waterproof SkiJacketDetachable Hood Windproof Fleece Parka RainJacketWinterCoat."
},
{
"position": 4,
"title": "19 BestWinterJacketsfor Men and Women (2023 MASSIVE...",
"url": "https://www.thebrokebackpacker.com/best-winter-jackets/",
"description": "Quick Answer: These are the BestWinterJacketsof 2023. BestWinterJacketsof 2023. #1 – Best OverallWinterJacketfor Men."
},
{
"position": 5,
"title": "The 29 Best LuxuryWinterJacketBrands (2024)",
"url": "https://www.irreverentgent.com/best-luxury-winter-jacket-brands/",
"description": "If you’re ready to finally look and feel unstoppable during the cooler months, then read on to discover the absolute best luxury and designerwinterjacketbrands."
},
{
"position": 6,
"title": "The 14 BestWinterJacketsfor Extreme Cold in... - PureWow",
"url": "https://www.purewow.com/fashion/best-winter-jackets-for-extreme-cold",
"description": "For such occasions, you’ll definitely want to be sporting one of these ultra-warm coats . These high-tech toppers are the absolute bestwinterjacketsfor extreme cold..."
},
{
"position": 7,
"title": "The BestWinterJacketsof 2024",
"url": "https://gearjunkie.com/apparel/best-winter-jackets",
"description": "New for the 2023-2024winterseason, the Patagonia Stormshadow Parka ($899) is our new favorite all-aroundwinterjacket."
},
{
"position": 8,
"title": "Зимний пуховик длинныйWINTERJACKET173622027...",
"url": "https://www.WildBerries.ru/catalog/173622027/detail.aspx",
"description": "Похожие. Следующий слайд. Зимний пуховик длинныйWINTERJACKET....холлофайбер. Все характеристики и описание переехали сюда.WINTERJACKET.Читать ещёПохожие. Следующий слайд. Зимний пуховик длинныйWINTERJACKET. Цвет черный. Похожие. ... холлофайбер. Все характеристики и описание переехали сюда.WINTERJACKET. Зимний пуховик длинный. 240 оценок.СкрытьЦена47614 761₽"
},
{
"position": 9,
"title": "Мужские зимние куртки — купить в интернет-магазине...",
"url": "https://www.Lamoda.ru/c/3816/clothes-men-winter-jackets/",
"description": "Мужские зимние куртки с бесплатной доставкой в интернет-магазине Ламода, актуальные цены, в наличии большой ассортимент моделей."
},
{
"position": 10,
"title": "27 Best Men’sWinterJacketsof 2024, Tested and Reviewed",
"url": "https://www.esquire.com/style/mens-fashion/g2014/best-winter-coats/",
"description": "Whether you're in the market for a weather-ready parka, a warm puffer, or a cozy bomber we found the the 27 best and most stylishwintercoats to buy in 2024."
},
{
"position": 11,
"title": "The Best Men’sWinterJacketsof 2024, Tested and Reviewed",
"url": "https://www.travelandleisure.com/style/best-mens-winter-jackets-and-coats",
"description": "Our expert outdoor enthusiasts tested a range ofwinterjacketsto find the best ones on the market."
},
{
"position": 12,
"title": "Интернет-магазин Made-in-China. Поставщики из Китая",
"url": "https://yabs.yandex.ru/count/WY0ejI_zOoVX2Ldb0PKG09DUSoOQbKgbKga4mOJVzd9dpvPERUREdOVQ-VeThpVSuJu0WmY71aD94b2LKlIGkC8eyXawmKA8-aA24KZOC9gHHEA6ab0D8QKWAWaYew0Hr2AX6K5raL1jhLA9HYmXeTW1KpOjV4b0Zy9MwJTigwHfy0467SSfd681sYaSbGAqqfn7ca-V-7G20Mmvur_ELyQULm-7HlDdzFc4IgNFwTDsubFd3RNjMlfqHczS6hsTzmrrIkqASB0y040anWelv_CD4ysO9GhJbxEpBd1Q3Ri3kgxhDNgLdDqh3XxgOdD5ugemQ4CKcKAdfjMVx2nicLaPdSBMKz1X9fGf85E98ME0qdKOSXD6_7MOxsGGAZNd1HksCDqGJJuK7PMqNLDlkUVL_DACJIxDsDnfAta_F65n_s3wCEv_VTlXz_gsmy_rROT_dy0NvMkPZdu_otJqOvezTv5P_bLIc4-aFB6Y6sbwV03S6-DTdO_2AHhiWHJfetSDabDWz7QfwCl-B8h8e2Vx8KR2iFoOzc03eF1LrVal4bhuYIDQ-4dMQtqRbnoOp6T3JCagakMrX_ZcTO-4aHi5ZWXONpxcIgLIQHMYmiwEqJ7GKvJV3OK8JOf-P6115FmOHOD-1Di4B0LMGBcwoDiu0PU_avPIsIX-yOculirKKE8GdZPMV7mKURvKt9lzxy-YsW_y45FEE6RWxamqY9C9~2?etext=2202.PsLaMsAcnPGXqEYDfwE_3RxouxkWkc56UM4sxhHfKtZicGpncGZuY3d2Y25lc2pi.0b887a5f2d86e1deffcb91bcf2515e203905de3b&from=yandex.com%3Bsearch%26%23x2F%3B%3Bweb%3B%3B0%3B&q=winter+jackets",
"description": "Доставка прямо с китайского завода. Большой выбор. Гарантия качества. Заходите!·Различные способы оплаты. Поддержка. Качественная продукция. Гарантия"
}
]

Handling Pagination

Navigating through multiple pages is a common challenge when scraping Yandex search results. Understanding the structure of the HTML elements indicating page navigation is crucial. Typically, you’ll need to dynamically craft URLs for each page, adjusting parameters like page numbers accordingly. Implementing a systematic iteration through pages in your script ensures comprehensive data extraction. To optimize efficiency and prevent overloading Yandex’s servers, consider introducing delays between requests, adhering to responsible web scraping practices. Here’s how we can update our previous script to handle pagination:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import time

# Initialize the CrawlingAPI with your API token
API_TOKEN = 'YOUR_CRAWLBASE_TOKEN'
crawling_api = CrawlingAPI({'token': API_TOKEN})
position_start = 1

def fetch_page_html(url):
response = crawling_api.get(url)
if response['headers']['pc_status'] == '200':
return response['body'].decode("utf-8")
else:
print(f"Request failed with crawlbase status code {response['headers']['pc_status']}")
return None

def scrape_yandex_search(html_content):
global position_start
soup = BeautifulSoup(html_content, 'html.parser')

# Extracting search result details
search_results = []
for position, result in enumerate(soup.select('.serp-item'), start=position_start):
title_element = result.select_one('h2.organic__url-text')
url_element = result.select_one('a.organic__url')
description_element = result.select_one('div.organic__content-wrapper')

search_result = {
'position': position,
'title': title_element.get_text(strip=True) if title_element else None,
'url': url_element['href'] if url_element else None,
'description': description_element.get_text(strip=True) if description_element else None,

}
search_results.append(search_result)

position_start = position + 1

return search_results

def main():
base_url = f'https://yandex.com/search/?text=${quote("Winter Jackets")}&p='
page_number = 0
all_search_results = []

# Limiting pagination depth to 6 pages
# You can change limit as per your needs
while page_number <= 5:
url = base_url + str(page_number)
html_content = fetch_page_html(url)

if html_content:
search_results = scrape_yandex_search(html_content)
all_search_results.extend(search_results)

page_number += 1
# Introduce a delay to respect the website's server
time.sleep(2)

# further process the all_search_results

if __name__ == "__main__":
main()

The script iterates through multiple pages using a while loop, fetching the HTML content for each page. To respect the website’s server, a 2-second delay is introduced between requests. Search results are then extracted and aggregated in the all_search_results list. This systematic approach ensures the script navigates through various pages, retrieves the HTML content, and accumulates search results, effectively handling pagination during the scraping process.

Storing the Scraped Data

After successfully scraping data from Yandex’s search results, the next crucial step is storing this valuable information for future analysis and reference. In this section, we will explore two common methods for data storage: saving scraped data in a CSV file and storing it in an SQLite database. These methods allow you to organize and manage your scraped data efficiently.

Storing Scraped Data in CSV File

CSV is a widely used format for storing tabular data. It’s a simple and human-readable way to store structured data, making it an excellent choice for saving your scraped Yandex search results data.

We’ll extend our previous web scraping script to include a step for saving the scraped data into a CSV file using the popular Python library, pandas. Here’s an updated version of the script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import pandas as pd
import time

# Initialize the CrawlingAPI with your API token
API_TOKEN = 'YOUR_CRAWLBASE_TOKEN'
crawling_api = CrawlingAPI({'token': API_TOKEN})
position_start = 1

def fetch_page_html(url):
# ... (unchanged)

def scrape_yandex_search():
# ... (unchanged)

def main():
base_url = f'https://yandex.com/search/?text=${quote("Winter Jackets")}&p='
page_number = 0
all_search_results = []

# Limiting pagination depth to 6 pages
# You can change limit as per your needs
while page_number <= 5:
url = base_url + str(page_number)
html_content = fetch_page_html(url)

if html_content:
search_results = scrape_yandex_search(html_content)
all_search_results.extend(search_results)

page_number += 1
# Introduce a delay to respect the website's server
time.sleep(2)

# Save scraped data as a CSV file
df = pd.DataFrame(all_search_results)
df.to_csv('yandex_search_results.csv', index=False)

if __name__ == "__main__":
main()

In this updated script, we’ve introduced pandas, a powerful data manipulation and analysis library. After scraping and accumulating the search results in the all_search_results list, we create a pandas DataFrame from this data. Then, we use the to_csv method to save the DataFrame to a CSV file named “yandex_search_results.csv” in the current directory. Setting index=False ensures that we don’t save the DataFrame’s index as a separate column in the CSV file.

yandex_search_results.csv File Snapshot:

Output CSV File Snapshot

Storing Scraped Data in SQLite Database

If you prefer a more structured and query-friendly approach to data storage, SQLite is a lightweight, serverless database engine that can be a great choice. You can create a database table to store your scraped data, allowing for efficient data retrieval and manipulation. Here’s how you can modify the script to store data in an SQLite database:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup
import sqlite3
import time

# Initialize the CrawlingAPI with your API token
API_TOKEN = 'YOUR_CRAWLBASE_TOKEN'
crawling_api = CrawlingAPI({'token': API_TOKEN})
position_start = 1

def fetch_page_html(url):
# ... (unchanged)

def scrape_yandex_search():
# ... (unchanged)

def initialize_database():
# Create or connect to the SQLite database
conn = sqlite3.connect('search_results.db')
cursor = conn.cursor()

# Create a table to store the search results
cursor.execute('''
CREATE TABLE IF NOT EXISTS search_results (
title TEXT,
url TEXT,
description TEXT,
position INTEGER
)
''')

# Commit changes and close the database connection
conn.commit()
conn.close()

def insert_search_results(result_list):
# Create or connect to the SQLite database
conn = sqlite3.connect('search_results.db')
cursor = conn.cursor()

# Create a list of tuples from the data
data_tuples = [(result['title'], result['url'], result['description'], result['position']) for result in result_list]

# Insert data into the search_results table
cursor.executemany('''
INSERT INTO search_results (title, url, description, position)
VALUES (?, ?, ?, ?)
''', data_tuples)

conn.commit()
conn.close()

def main():
base_url = f'https://yandex.com/search/?text=${quote("Winter Jackets")}&p='
page_number = 0
all_search_results = []

# Initialize the database
initialize_database()

# Limiting pagination depth to 6 pages
# You can change limit as per your needs
while page_number <= 5:
url = base_url + str(page_number)
html_content = fetch_page_html(url)

if html_content:
search_results = scrape_yandex_search(html_content)
all_search_results.extend(search_results)

page_number += 1
# Introduce a delay to respect the website's server
time.sleep(2)

# Insert scraped data into the SQLite database
insert_search_results(all_search_results)

if __name__ == "__main__":
main()

Functions, initialize_database() and insert_search_results(result_list), deal with managing a SQLite database. The initialize_database() function is responsible for creating or connecting to a database file named search_results.db and defining a table structure to store the search results. The insert_search_results(result_list) function inserts the scraped search results into this database table name as search_results.

search_results Table Snapshot:

Search Results Table Snapshot

Final Words

This guide has provided the necessary insights to scrape Yandex search results utilizing Python and the Crawlbase Crawling API. As you continue your web scraping journey, remember the versatility of these skills extends beyond Yandex. Explore our additional guides for platforms like Google and Bing, broadening your search engine scraping expertise.

Here are some other web scraping python guides you might want to look at:

📜 How to scrape Expedia

📜 How to scrape Yelp

📜 How to scrape Producthunt

📜 How to scrape Images from DeviantArt

We understand that web scraping can present challenges, and it’s important that you feel supported. Therefore, if you require further guidance or encounter any obstacles, please do not hesitate to reach out. Our dedicated team is committed to assisting you throughout your web scraping endeavors.

Frequently Asked Questions

Q. What is Yandex?

Yandex is a leading search engine called the “Google of Russia.” It’s not just a search engine; it’s a technology company that provides various digital services, including but not limited to search functionality, maps, email services, and cloud storage. Originating from Russia, Yandex has expanded its services to neighboring countries and has become a significant player in the tech industry.

Q. Why would someone want to scrape Yandex search results?

There can be several reasons someone might consider scraping Yandex search results. Researchers might want to analyze search patterns, businesses might want to gather market insights, and developers might want to integrate search results into their applications. By scraping search results, one can understand user behavior, track trends, or create tools that rely on real-time search data.

The legality of web scraping depends on various factors, including the website’s terms of service. Yandex, like many other search engines, has guidelines and terms of service in place. It’s crucial to review and understand these terms before scraping. Always ensure that the scraping activity respects Yandex’s robots.txt file, doesn’t overload their servers, and doesn’t violate any copyrights or privacy laws. If in doubt, seeking legal counsel or using alternative methods to obtain the required data is advisable.

Q. How can I prevent my IP from getting blocked while scraping?

Getting your IP blocked is a common challenge when scraping websites. Tools like the Crawlbase Crawling API come in handy to mitigate this risk. The API offers IP rotation, automatically switching between multiple IP addresses. This feature ensures that you only send a few requests from a single IP in a short period, reducing the chances of triggering security measures like IP bans. Additionally, it’s essential to incorporate delays between requests, use user-agents, and respect any rate-limiting rules the website sets to maintain a smooth scraping process