Software Alternatives & Reviews

CommonCrawl Reviews and details

Screenshots and images

  • CommonCrawl Landing page
    Landing page //
    2023-10-16

Badges

Promote CommonCrawl. You can add any of these badges on your website.
SaaSHub badge
Show embed code

Social recommendations and mentions

We have tracked the following product recommendations or mentions on various public social media platforms and blogs. They can help you see what people think about CommonCrawl and what they use it for.
  • Ask HN: How does one implement web plagiarism?
    Https://commoncrawl.org/ is a non-profit which offers a pre-crawled dataset. The specifics of individual tools probably vary. I imagine most tools would be based on academic datasets. - Source: Hacker News / 4 months ago
  • Things are about to get a lot worse for Generative AI
    Should the NYT not sue https://commoncrawl.org/ ? OpenAI just used the data from commoncrawl for training. - Source: Hacker News / 4 months ago
  • Indexing a Billion Pages
    What you’re likely referring to is Common Crawl: https://commoncrawl.org. - Source: Hacker News / 4 months ago
  • Interview with Viktor Lofgren from Marginalia Search
    > ... a project called "Nutch" would allow web users to crawl the web themselves. Perhaps that promise is similar to the promises being made about "AI" today. The project did not turn out to be used in the way it was predicted (marketed), or even used by web users at all. Actually Nutch is used to produce the Common Crawl[0] and 60% of GPT-3's training data was Common Crawl[1], so in a way it is being used... - Source: Hacker News / 5 months ago
  • Google's Plan to Stop Apple from Getting Serious About Search
    > Let's share the index as public data Common crawl[1] data has been in AWS for over a decade. [1]: https://commoncrawl.org. - Source: Hacker News / 6 months ago
  • OpenAI is too cheap to beat
    Common Crawl claims to have 82% of the tokens used to train GPT-3, and it's available to anyone. Add all the downloadable material at archive.org and you've got a formidable corpus. https://commoncrawl.org/. - Source: Hacker News / 7 months ago
  • So Far, AI Is a Money Pit That Isn't Paying Off
    With all these models trained on a corpus of texts from the Internet[1], what's going to happen in 5 years when the internet is full of text generated after January 2022, the current cutoff data for ChatGPT? As an increasing proportion of training data is either pre-2002 or LLM-generated confabulations, when will it all collapse in on itself? Do these companies have any plans for how they are going to get... - Source: Hacker News / 7 months ago
  • Ask HN: Where to find free public datasets besides Kaggle/BigQuery/HuggingFace?
    The one all the cool kids use: https://commoncrawl.org/. - Source: Hacker News / 8 months ago
  • How did OpenAI crawl the web so effectively without getting blocked?
    It didn't. Others did it, OpenAI used it. https://commoncrawl.org/. - Source: Hacker News / 9 months ago
  • OpenAI and Microsoft Sued for $3 Billion Over Alleged ChatGPT 'Privacy Violations'
    Interesting to know about this https://commoncrawl.org/ and https://huggingface.co/datasets/tiiuae/falcon-refinedweb. Thus, not only Open AI uses public data. I don't understand the point of the lawsuit. Source: 10 months ago
  • [FREE] Beware of scammer u/Zackkkbbb
    AI companies? They don't use reddit's API at all for AI scraping. They all use https://commoncrawl.org/, so as long as reddit is indexable by web browsers, it's being scraped for use by AI companies. Source: 10 months ago
  • Is it against the Reddit rules to scrape information from the way back machine?
    Commoncrawl cdx API is unreliable—currently manely scanning warc and arc files form s3. Source: 10 months ago
  • Personal GPT: A tiny AI Chatbot that runs fully offline on your iPhone
    The hallucinations are coming from the LLM interpolating from the training data, substantial portions of which is scraped off of the internet. Because other peoples' prompts never leave their devices (this app makes no internet connections). Source: 10 months ago
  • Reddit removed moderators behind the latest protests before restoring a few
    I really need a TL;DR on everything that's happened over the past month! Is it correct that Reddit hiked its API prices in response to realising their data is worth a fortune (since it can be used to train/update LLMs)? Is this understanding correct? Wouldn't it be possible to do a 'Common Crawl' [1] just for reddit, or would the lag (of perhaps a few minutes) make third-party reddit apps too slow? (if LinkedIn... - Source: Hacker News / 10 months ago
  • What's Your Take on Reddit's Plan to Sell Your Posts and Comments as "Data Licensing" to Other Companies to Train AI Language Models?
    Haven’t heard of https://commoncrawl.org or did you crawl something special? Source: 10 months ago
  • What Reddit Got Wrong
    Reddit and StackOverflow notably. Most use https://commoncrawl.org/. - Source: Hacker News / 11 months ago
  • If you value the open-source movement, donate to Common Crawl
    I'm not affiliated at all, I just see the importance. This is one of the most important resources we need to maintain if we're going to keep the open-source AI movement alive. https://commoncrawl.org/. Source: 11 months ago
  • Building a ChatGPT-like chatbot for Bengali. Please suggest good sources for a lot of Bengali text data.
    How many tokens are you hoping for? Do you intend to train from scratch or finetune on some model? I would start with the data sources cited in this paper. There is also this monolingual corpus that I read about some time ago. Also, I would use common crawl and perhaps use fastText on it for language identification. There is also this thing called a Bangladesh National Corpus or something that someone wrote a... Source: 11 months ago
  • Reddit Laying Off About 90 Employees and Slowing Hiring Amid Restructuring: Moves aim to help social-media company break even next year
    Web scraping is protected by US laws, this is why all AI companies all share a common prescraped trove called Common crawl. Source: 11 months ago
  • Writers' Call for OTW and Ao3 Transparency, Clarity, and Stance on A​.​I.
    There are programs that go onto websites and basically open every page on the website, downloading all or some of the text, and storing them. This process is known as "scraping". When people talk about it in the context of AI, they're referring to https://commoncrawl.org/. Source: 12 months ago
  • ChatGPT Data Breach BreakDown - Why it Should be a Concern for Everyone!
    The dataset they use is the common crawl database https://commoncrawl.org/ you can just ask ChatGPT and it will tell you. Source: 12 months ago

Do you know an article comparing CommonCrawl to other products?
Suggest a link to a post with product alternatives.

Suggest an article

Generic CommonCrawl discussion

Log in or Post with

This is an informative page about CommonCrawl. You can review and discuss the product here. The primary details have not been verified within the last quarter, and they might be outdated. If you think we are missing something, please use the means on this page to comment or suggest changes. All reviews and comments are highly encouranged and appreciated as they help everyone in the community to make an informed choice. Please always be kind and objective when evaluating a product and sharing your opinion.