How to store scraped data on the cloud
Oct 5, 202011 mins read
Saving your important files, be it personal, work, or business, on a local disk drive may look convenient at first. However, if it starts to pile up, or if you need to transfer files between machines, it may start getting troublesome and could end up giving you more work than necessary. Not only that, what if something happened to your local storage? Power issues, firmware corruption, human error, are just a list of few things that can cause a hard disk failure. Issues like these may ultimately cost you countless unrecoverable work hours and will have a significant negative impact on your business.
The “cloud” revolutionizes how we store our data and how we access them on a day to day basis. Cloud storage is a type of computer storage in which the data is copied over the internet and into a data server. These database servers are actual physical computers where companies store your files on multiple hard drives. Online storage solutions have come to replace the conventional local disk storage. Instead of the traditional single storage hard disk drives, cloud storage keeps your data safe from being lost since your file technically is backed up in another location which is often referred to as redundancy.
So, what are the advantages of cloud storage especially from a professional and business standpoint? Well, there’s a lot, and here are a few things to consider:
Information Management - Centralizing storage in the cloud creates an immense leverage point for new use cases. Multiple people can quickly access data seamlessly, making it convenient to share files with your colleagues. Also, you can perform robust information management tasks including automated ranking or locking down data in support of compliance requirements.
Ownership cost - With cloud storage, there’s no need to purchase physical hardware or even require provisioning of storage. You can add capacity on-demand and just pay for the storage that you or your business actually use.
Time of Deployment - For enterprises, or businesses that deal with huge data, infrastructures should never slow them down. Cloud storage enables the development team to quickly deploy storage solutions with just the exact amount needed, right when it is needed. This will allow your team to focus on solving other important issues instead of consuming your time to manage the storage system.
With all these being said, what are the real benefits of cloud storage in terms of web crawling and scraping?
If you’re a beginner trying out web scraping, you’ll notice that over time, storing your scraped data may become a problem that you will need to deal with your own solution by purchasing an extra hard drive to ensure that the data being stored is safely backed up to prevent loss of your precious scraped data. This can take your time and resources which you could have invested in other important things like doing actual scraping or learning new ways to scrape data effectively. The same scenario can happen in small or big businesses when maintaining their own database, that is why online storage solutions are an integral part of any business that deals with data nowadays.
The scalability, worry-free nature of cloud storage which gives a major advantage in most cases is simply hard to ignore.
ProxyCrawl Cloud Storage handles scaling, backing up, and managing cloud space securely so you and your team can redirect your time and effort on what really matters for your business. It is an easy to use API where you can save your crawled or scraped data and screenshots on the cloud. Here, you can also effortlessly do a full-text search and add or delete data.
To access cloud storage, ProxyCrawl created an API that will quickly send your data straight to our servers securely. This can be used with most of the ProxyCrawl products such as the Crawling API, Screenshot API, or even configure it with your Crawler by using the Storage webhook endpoint.
If you already have a ProxyCrawl account and are using the Crawling API to crawl and scrape web pages, you are probably familiar with how to make a simple call and how to pass parameters. For starters, you simply need to add the parameter
&store=true to send a copy of the data to your storage.
You can refer to the sample code below:
from urllib2 import urlopen
For the example codes given above, just make sure to use your own token and replace the URL of the page that you wish to crawl.
In some use cases, taking a screenshot of a webpage you are crawling will be more efficient to keep track of the visual changes. ProxyCrawl has an API dedicated just for that, and those screenshots can be sent directly to the cloud storage as well.
With these few examples, you can see how simple it is to send data to the cloud, that is why corporate clients will also be able to quickly deploy this solution since the API is scalable and can be easily integrated into any existing app or program.
Of course, sending the data is just one part of it and the convenience and flexibility of cloud storage don’t stop there. It can be managed without difficulty via the API or through the user’s web account. From the user’s account, a dedicated dashboard for stored data will let you search for any saved data and show all of the requests sent from the Crawling API, Crawler, and Screenshot API that includes the requests headers with a quick view of each request.
If going to the dashboard is not your thing, or at least not possible with your workflow, ProxyCrawl has prepared some parameters that will let you manage your storage via the API.
Any requests sent to the Storage API should start with the following base part:
Each saved request will have two identifiers, URL and RID, in which both of them can be used to manage (view or delete) your data.
To view or retrieve a crawled page (HTML or JSON), do an API call as shown below:
Without looking at the dashboard, you can retrieve the request headers that will contain the URL and RID by passing the parameter
&format= which will accept HTML or JSON as a value.
Since the storage is limited, you may sometimes want to delete unwanted or old data from the cloud. This can be done quickly by sending a DELETE request with the correct RID and token.
curl -X DELETE https://api.proxycrawl.com/storage?token=_USER_TOKEN_&rid=RID
You will get the response below if it was deleted properly:
"success": "The Storage item has been deleted successfully"
If you want to check the total count or the actual number of data saved on your storage, you may send this GET request that includes your private token:
By default, a maximum of 10,000 documents can be stored in the cloud with a retention of up to 14 days which is currently for free upon signing up. It should be plenty for starters or for clients who needed to test the service. However, if you need to store more data and have longer data retention, you can opt for the Developer or Business plan. You can learn more about ProxyCrawl’s cloud storage pricing here.
To summarize, cloud storage has its obvious advantages over local storage in terms of usability and accessibility. Your files will not just be easier to access from anywhere, it will also be the perfect backup plan for any project or business since these files are stored at a different location and they can be retrieved at any given time. It is a great platform that does not require any huge investment in both time and money. Users can ensure additional cost savings because storage management, hardware purchase, and extra computational resources are not needed for storing data.
In this article, it was shown that the Storage API can be used in conjunction with most of the ProxyCrawl products including the Crawling API, the Crawler, and Screenshot API. You’ve seen how easy it is to save HTML, JSON, or even JPEG results to the cloud with just a few lines of code. We’ve also tackled how straightforward it is to manage the storage using the dashboard or through the API.
With ProxyCrawl’s cloud storage solution, you can always stay ahead of rapid storage growth propelled by new data sources and evolving technologies.