The web content provides a review of four powerful data collection tools suitable for various user needs in 2022.
Abstract
The article discusses the importance of data collection in the modern digital landscape, drawing an analogy from the HBO series "Silicon Valley" where a fictional app called "SeeFood" required extensive data to function. It then introduces four data collection tools: ScrapingAnt, ScraperAPI, Octoparse, and Bright Data, highlighting their features, ease of use, team support, and pricing models. ScrapingAnt and ScraperAPI are API-based solutions that support multiple programming languages and offer free plans. Octoparse caters to non-coders with its user-friendly interface and auto-detection features. Bright Data provides a comprehensive suite of data extraction and management tools, including a data collector and pre-collected datasets. The article also suggests that outsourcing data collection to services like ScrapeIt can be time-saving for businesses.
Opinions
The author finds ScrapingAnt comfortable for developers, citing its client library for Node.js and ease of integration.
Team support is emphasized for ScrapingAnt, with the team being described as supportive and capable of creating custom scraping scripts.
ScraperAPI is compared to ScrapingAnt, noting its different pricing models and ease of customization with API requests.
Octoparse is recommended for beginners in web scraping and for those who prefer a GUI-based tool over writing code.
The author experienced a learning curve with Octoparse for complex scraping tasks but suggests that seeking team support could mitigate this.
Bright Data is praised for its extensive data collection infrastructure and services, though it's noted that their Data Collector is limited to publicly available data.
The author suggests that while these tools are valuable, outsourcing to a service like ScrapeIt might be preferable for those looking to save time and effort.
The article concludes with an invitation to join the author's newsletter for exclusive content related to engineering, technology, and leadership.
Web Scraping
4 Data Collection Tools You Should Not Miss In 2022
Information is a key for both survival and growth. Here are some of the best tools in the market to get it.
Erlich pitches Jian-Yang’s app as a “Shazam for food” — HBO’s Silicon Valley (source)
While I was watching the episodes of HBO’s Silicon Valley, I enjoyed the drama and comedy in that series. But what caught my attention the most was some ingenious product ideas such as SeeFood.
During his pitch to a VC firm, Coleman Blair, Erlich has realized that he was facing a serious problem. One of his programmers, Jian Yang, wants to speak about an app that promotes his grandmother’s 8 secret recipes of Octopus (SeaFood), which does not meet the VC firm’s assumption of a camera-based app.
“Like you take a photo of food, the app returns nutritional information or recipes or how it was sourced,” said the venture capitalist, explaining his expectation.
After hearing this last sentence, Erlich felt inspired. Without hesitation, he announced that their project is in fact SeeFood and not SeaFood: it’s “Shazam for food” like a food that you can see.
That announcement has saved them some hassle. But later, they noticed they needed lots of data to train the machine-learning algorithm that will allow the app to recognize and identify the meal. They planned a large-scale scraping task for images from the web, but they struggled to achieve it with the budget and the short time they have.
If they have used the appropriate tools and infrastructure, they could have succeeded in their mission.
You’re probably not going to build the next “Shazam for food,” but just like Erlich and Jiang Yang, you may have faced or going to face situations where you need to collect and analyze public data and get meaningful insights from it to make some important decisions or to automate and boost your business.
To help you with such an endeavor, I’m going to share with you, in this post, my review of four powerful data collection products that offer distinct features to cover different end-users needs.
ScrapingAnt is a scraping API that runs a cluster of hundreds of Chrome browsers under the hood and offers a large proxy pool — including rotation and residential proxies — from all around the world to allow users to scrape all websites without being blocked. It can extract data much faster than with the use of single computer power and scrape thousands of pages in parallel.
It supports Python, JavaScript, and any programming language that can make API calls.
As a developer, I find ScrapingAnt very comfortable. I have used it by installing its client library scrapingant-client in my Node.js project, then implementing the code on my IntelliJ editor and running it locally on my terminal with the node command:
nodemy-scraper.js
In this video, you see an example of how I used it:
Team support
The team is very supportive and friendly and can even create custom scraping scripts for you if you’re not a developer. You need just to send an email and communicate your requirements to them.
Price
Free plan with 10,000 API credits. This plan includes JavaScript rendering, custom cookies, output preprocessing, basic email support, documentation.
Enthusiast for $19/month: 100K API calls
Startup for $49/month: 500K API calls, expert assistance.
Business for $249/month: 3M API calls, expert guidance.
ScraperAPI is comparable to ScrapingAnt with slightly different pricing models. It supports Python, JavaScript, PHP, Ruby, and Java and offers an API that is easy to customize.
To enable JS rendering, IP geolocation, residential proxies, and more feature, all you need is just to add parameters such as &render=true, &country_code=us, and &premium=true, to your request.
Hobby for $29/month: 250K API calls, 10 concurrent threads.
Startup for $99/month: 1M API calls, 25 concurrent threads, geotargeting.
Business for $249/month: 3M API calls, 50 concurrent threads, geotargeting, residential proxies, JS rendering.
Custom pricing model for enterprise: unlimited API calls and concurrent threads, geotargeting, residential proxies, JS rendering, custom anti-bot bypasses.
There is a 10% discount for a sign-up with the coupon WEBENIUS10.
Octoparse is a great tool to start your web scraping journey, especially if you’re not a coder. If you’re a developer, it could save you time, which it did for me for some scraping tasks.
The tool is user-friendly and has a powerful auto-detection feature and a lot of guiding messages that are automatically adjusted after each action you take while you’re creating your scraping workflow.
If you have a complex scraping scenario, you may need more training with the tool to know how to correctly set up your data collection workflow and learn more details like:
How to scrape data that requires authentication.
How to add required fields that are not auto-detected.
How to auto-convert, format, and transform some fields with the “clean data” feature.
and so on
I had such a complex case, and after struggling for two or three days to fix my scraping task — without asking the support of the team —, I’ve noticed that it could be much easier and less time-consuming if I have used a scraping API, like ScrapingAnt, and wrote the required scraping code by myself.
If you find yourself in such a situation and you can’t afford to invest time to learn, don’t make my mistake and ask the team to help you.
Team support
The team is supportive. When I needed their help, I’ve sent them messages with the tool itself and they get back to me per email, and the rest of the questions/answers, till my issue was solved, were per email.
Bright Data (formerly Luminaty) is the World’s #1 web data platform. It allows you to scrape public web data by offering a pool of tools and services that includes a data collection infrastructure and ready-made datasets.
Bright Data’s products could be divided into four main categories: each category has its own characteristics and serves as the technological basis for the categories above:
This category contains an extensive network of IP addresses and 4 types of proxies: each type is unique and is suitable for specific tasks.
2. Data Extraction and Management
web unlocker, proxy manager, search engine crawler
These products allow you to manage your proxy and collect web data using unique technologies.
3. Data Collector
To use this automated service to collect data, you can:
Choose and run one of the existing web scraping' templates,
Request a new custom collector,
Or implement a new one by yourself in the code editor on their web app. There you notice that the source code is divided into different views, which looked for me not as comfortable as using the ScrapingAnt API on my local code editor, which is more natural for me as a developer.
This video shows a scenario of scraping the list of Top Writers from Medium using Bright Data’s Data Collector service.
Note: The Data Collector can retrieve only publicly available data that does not require authentication. If you need to get a list of experts from LinkedIn, for example, and LinkedIn asks you to give your username and password before accessing that list, unlike the other mentioned providers, you may not be able to enter them to get what you need with Bright Data’s collector.
4. Datasets
Datasets are pre-collected, large-scale snapshots of entire websites that cover a wide range of data, like companies, people, and social media influencers. This is useful for many use cases, such as identifying and analyzing trends, optimizing your eCommerce activity, or even getting data to feed your machine learning algorithms. If the existing collections are not suitable for your needs, you can request a new dataset.
The team is supportive. When I needed their help, I’ve sent them messages with their web app itself and they get back to me per email and per skype. The rest of the questions/answers till my issue was solved were per skype.
Price
Bright Data offers slightly different pricing models from the other providers: pay as you go,yearly subscription, and monthly subscription.
This monthly option for the Data Collector includes the following models:
Experimenting for $350/month: from $3.50/CPM, 100K page loads.
Starter for $750/month: from $3.00/CPM, 250K page loads.
Production for $1,250/month: from $2.50/CPM, 500K page loads.
Plus for $2K/month: from $2.00/CPM, 1M page loads.
Custom pricing model for enterprise
The rotating residential proxies are available to test with a 7-day free trial.
One More Thing
Although the above products are great ones to have in your toolbox as a software developer, a data scientist, a machine learning enthusiast, a business owner, a marketer, or even a content creator, you can always save your time and effort by outsourcing your data collection tasks to a web scraping service like ScrapeIt.
ScrapeIt offers:
Up to 100,000 rows for $220 if it’s one-time data delivery.
Up to 100,000 rows for $170/mo if it’s recurring data delivery.
From $1000 for a custom solution.
I hope this list of tools and services gives you some guidance and helps you to make the best decision regarding how to get the information you need to achieve your endeavors.
Want more?
I write about engineering, technology, and leadership for a community of smart, curious people 🧠💡. Join my free email newsletter for exclusive access or sign up for Medium here.