Ask HN: What are the best tools for web scraping in 2022?

Web Scraping Browser Testing Search Engine

puppeteer Landing Page

1

puppeteer

Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium...

#Automated Testing #Browser Testing #Automation 106 social mentions
Google Landing Page

2

Google

Google Search, also referred to as Google Web Search or simply Google, is a web search engine developed by Google. It is the most used search engine on the World Wide Web

#Search Engine #Internet Search #Web Search 3737 social mentions
WebAutomation.io Landing Page

3

WebAutomation.io

Quickly extract data from websites and turn into APIs or CSV without writing code

If you are looking for pre-made tools and dont want to write any code, check our https://webautomation.io.

#Automation #B2B SaaS #Web Scraping 2 social mentions
Colly Landing Page
4

Colly

Colly is a scraping framework to extract structured data from websites.
Pricing:
- Open Source
I’m not sure about “best” but I’ve been using Colly (written in Go) and it’s been pretty slick. Haven’t run in to anything it can’t do. http://go-colly.org/.

#Browser Testing #Automated Testing #Development 9 social mentions
Hacker News Landing Page
5

Hacker News

Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator.
Pricing:
- Open Source
For things like regular expressions, it's useful to know that Python has a "-c" option which can be passed a multi-line string as part of a CLI pipeline. You can do something like this: <pre><code> curl 'https://news.ycombinator.com/' | python -c '.

#Social Networks #Social News #Startups 659 social mentions
CommonCrawl Landing Page

6

CommonCrawl

Common Crawl

You can also use the common crawl dataset. https://commoncrawl.org/.

#Search Engine #Internet Search #Web Search 97 social mentions
Apify Landing Page

7

Apify

Apify is a web scraping and automation platform that can turn any website into an API.

I'm working on a personal project that involves A LOT of scraping, and through several iterations I've gotten some stuff that works quite well. Here's a quick summary of what I've explored (both paid and free): * Apify (https://apify.com/) is a great, comprehensive system if you need to get fairly low-level. Everything is hosted there, they've got their own proxy service (or you can roll your own), and their open source framework (https://github.com/apify/crawlee) is excellent. * I've also experimented with running both their SDK (crawlee) and Playwright directly on Google Cloud Run, and that also works well and is an order-of-magnitude less expensive than running directly on their platform. * Bright Data nee Luminati is excellent for cheap data center proxies ($0.65/GB pay as you go), but prices get several orders of magnitude more expensive if you need anything more thorough than data center proxies. * For some direct API crawls that I do, all of the scraping stuff is unnecessary and I just ping the APIs directly. * If the site you're scraping is using any sort of anti-bot protection, I've found that ScrapingBee (https://www.scrapingbee.com/) is by far the easiest solution. I spent many many hours fighting anti-bot protection doing it myself with some combination of Bright Data, Apify and Playwright, and in the end I kinda stopped battling and just decided to let ScrapingBee deal with it for me. I may be lucky in that the sites I'm scraping don't really use JS heavily, so the plain vanilla, no-JS ScrapingBee service works almost all of the time for those. Otherwise it can get quite expensive if you need JS rendering, premium proxies, etc. But a big thumbs up to them for making it really easy. Always looking for new techniques and tools, so I'll monitor this thread closely.

#Web Scraping #Data Extraction #Web Crawling 26 social mentions
SerpApi Landing Page
8

SerpApi

Scrape Google search results from our fast, easy, and complete API.
Pricing:
- Freemium
We've built https://serpapi.com We've invented the industry what you referring as "data type specific APIs"; APIs that abstract away all proxies issues, captcha solvings, various layouts support, even scrapping-related legal issues, and much more to a clean JSON response every single call. It was a lot of work but our success rate and response times are now rivaling non-scraping commercial APIs: https://serpapi.com/status I think the next battle will be still legal despite all the wins in favor of scrapping public pages and common sense understanding this is the way to go. The EFF has been doing an amazing work in this world and we are proud to be a significant yearly contributor to the EFF.

#SEO #SEO Tools #APIs 76 social mentions
Playwright Landing Page
9

Playwright

Playwright is automation software for Chromium, Firefox, Webkit using the Node.js library having a single API in place.
Pricing:
- Open Source
Node.js and cheerio is what came to my mind too. I heard the team behind Puppeteer moved from Google to Microsoft, and started the project Playwright, which has a more ergonomic API and better cross-browser support (Chromium, WebKit, and Firefox). https://playwright.dev/.

#Development #Tool #Browser Testing 282 social mentions
Simple Scraper Landing Page
10

Simple Scraper

Extract data from any website in seconds — download instantly, scrape in the cloud, or create an API.
Pricing:
- Freemium
- $30.0 / Monthly (6,000 credits)
A good no-code solution is https://simplescraper.io. Leans towards non-developers but there's an API too.

#Web Scraping #API Tools #Scraper 21 social mentions