Software Alternatives & Reviews

Ask HN: What are the best tools for web scraping in 2022?

puppeteer Google WebAutomation.io Colly Hacker News CommonCrawl Apify SerpApi Playwright Simple Scraper
  1. Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium...

    #Automated Testing #Browser Testing #Software Development 102 social mentions

  2. 2
    Google Search, also referred to as Google Web Search or simply Google, is a web search engine developed by Google. It is the most used search engine on the World Wide Web

    #Search Engine #Internet Search #Web Search 3693 social mentions

  3. Quickly extract data from websites and turn into APIs or CSV without writing code
    If you are looking for pre-made tools and dont want to write any code, check our https://webautomation.io.

    #Automation #B2B SaaS #Web Scraping 2 social mentions

  4. 4
    Colly is a scraping framework to extract structured data from websites.
    Pricing:
    • Open Source
    I’m not sure about “best” but I’ve been using Colly (written in Go) and it’s been pretty slick. Haven’t run in to anything it can’t do. http://go-colly.org/.

    #Web Scraping #Data Extraction #Data 9 social mentions

  5. Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator.
    Pricing:
    • Open Source
    For things like regular expressions, it's useful to know that Python has a "-c" option which can be passed a multi-line string as part of a CLI pipeline. You can do something like this: <pre><code> curl 'https://news.ycombinator.com/' | python -c '.

    #Social Networks #Social News #Startups 500 social mentions

  6. Common Crawl
    You can also use the common crawl dataset. https://commoncrawl.org/.

    #Search Engine #Web Scraping #Data Extraction 91 social mentions

  7. 7
    Apify is a web scraping and automation platform that can turn any website into an API.
    I'm working on a personal project that involves A LOT of scraping, and through several iterations I've gotten some stuff that works quite well. Here's a quick summary of what I've explored (both paid and free): * Apify (https://apify.com/) is a great, comprehensive system if you need to get fairly low-level. Everything is hosted there, they've got their own proxy service (or you can roll your own), and their open source framework (https://github.com/apify/crawlee) is excellent. * I've also experimented with running both their SDK (crawlee) and Playwright directly on Google Cloud Run, and that also works well and is an order-of-magnitude less expensive than running directly on their platform. * Bright Data nee Luminati is excellent for cheap data center proxies ($0.65/GB pay as you go), but prices get several orders of magnitude more expensive if you need anything more thorough than data center proxies. * For some direct API crawls that I do, all of the scraping stuff is unnecessary and I just ping the APIs directly. * If the site you're scraping is using any sort of anti-bot protection, I've found that ScrapingBee (https://www.scrapingbee.com/) is by far the easiest solution. I spent many many hours fighting anti-bot protection doing it myself with some combination of Bright Data, Apify and Playwright, and in the end I kinda stopped battling and just decided to let ScrapingBee deal with it for me. I may be lucky in that the sites I'm scraping don't really use JS heavily, so the plain vanilla, no-JS ScrapingBee service works almost all of the time for those. Otherwise it can get quite expensive if you need JS rendering, premium proxies, etc. But a big thumbs up to them for making it really easy. Always looking for new techniques and tools, so I'll monitor this thread closely.

    #Web Scraping #Data Extraction #Data 21 social mentions

  8. Scrape Google search results from our fast, easy, and complete API.
    Pricing:
    • Open Source
    We've built https://serpapi.com We've invented the industry what you referring as "data type specific APIs"; APIs that abstract away all proxies issues, captcha solvings, various layouts support, even scrapping-related legal issues, and much more to a clean JSON response every single call. It was a lot of work but our success rate and response times are now rivaling non-scraping commercial APIs: https://serpapi.com/status I think the next battle will be still legal despite all the wins in favor of scrapping public pages and common sense understanding this is the way to go. The EFF has been doing an amazing work in this world and we are proud to be a significant yearly contributor to the EFF.

    #SEO #SEO Tools #APIs 69 social mentions

  9. Playwright is automation software for Chromium, Firefox, Webkit using the Node.js library having a single API in place.
    Pricing:
    • Open Source
    Node.js and cheerio is what came to my mind too. I heard the team behind Puppeteer moved from Google to Microsoft, and started the project Playwright, which has a more ergonomic API and better cross-browser support (Chromium, WebKit, and Firefox). https://playwright.dev/.

    #Development #Tool #Browser Testing 231 social mentions

  10. Extract data from any website in seconds — download instantly, scrape in the cloud, or create an API.
    Pricing:
    • Freemium
    • $30.0 / Monthly (6,000 credits)
    A good no-code solution is https://simplescraper.io. Leans towards non-developers but there's an API too.

    #Web Scraping #API Tools #Scraper 18 social mentions

Discuss: Ask HN: What are the best tools for web scraping in 2022?

Log in or Post with