CommonCrawl Reviews and details

Screenshots and images

Landing page //
2023-10-16

Badges

Promote CommonCrawl. You can add any of these badges on your website.

<a href='https://www.saashub.com/commoncrawl?utm_source=badge&utm_campaign=badge&utm_content=commoncrawl&badge_variant=color&badge_kind=approved' target='_blank'><img src="https://cdn-b.saashub.com/img/badges/approved-color.png?v=1" alt="CommonCrawl badge" style="max-width: 150px;"/></a>

Show embed code

Social recommendations and mentions

We have tracked the following product recommendations or mentions on various public social media platforms and blogs. They can help you see what people think about CommonCrawl and what they use it for.

Ask HN: How does one implement web plagiarism?
Https://commoncrawl.org/ is a non-profit which offers a pre-crawled dataset. The specifics of individual tools probably vary. I imagine most tools would be based on academic datasets. - Source: Hacker News / 4 months ago
Things are about to get a lot worse for Generative AI
Should the NYT not sue https://commoncrawl.org/ ? OpenAI just used the data from commoncrawl for training. - Source: Hacker News / 4 months ago
Indexing a Billion Pages
What you’re likely referring to is Common Crawl: https://commoncrawl.org. - Source: Hacker News / 4 months ago
Interview with Viktor Lofgren from Marginalia Search
> ... a project called "Nutch" would allow web users to crawl the web themselves. Perhaps that promise is similar to the promises being made about "AI" today. The project did not turn out to be used in the way it was predicted (marketed), or even used by web users at all. Actually Nutch is used to produce the Common Crawl[0] and 60% of GPT-3's training data was Common Crawl[1], so in a way it is being used... - Source: Hacker News / 5 months ago
Google's Plan to Stop Apple from Getting Serious About Search
> Let's share the index as public data Common crawl[1] data has been in AWS for over a decade. [1]: https://commoncrawl.org. - Source: Hacker News / 6 months ago
OpenAI is too cheap to beat
Common Crawl claims to have 82% of the tokens used to train GPT-3, and it's available to anyone. Add all the downloadable material at archive.org and you've got a formidable corpus. https://commoncrawl.org/. - Source: Hacker News / 7 months ago
So Far, AI Is a Money Pit That Isn't Paying Off
With all these models trained on a corpus of texts from the Internet[1], what's going to happen in 5 years when the internet is full of text generated after January 2022, the current cutoff data for ChatGPT? As an increasing proportion of training data is either pre-2002 or LLM-generated confabulations, when will it all collapse in on itself? Do these companies have any plans for how they are going to get... - Source: Hacker News / 7 months ago
Ask HN: Where to find free public datasets besides Kaggle/BigQuery/HuggingFace?
The one all the cool kids use: https://commoncrawl.org/. - Source: Hacker News / 8 months ago
How did OpenAI crawl the web so effectively without getting blocked?
It didn't. Others did it, OpenAI used it. https://commoncrawl.org/. - Source: Hacker News / 9 months ago
OpenAI and Microsoft Sued for $3 Billion Over Alleged ChatGPT 'Privacy Violations'
Interesting to know about this https://commoncrawl.org/ and https://huggingface.co/datasets/tiiuae/falcon-refinedweb. Thus, not only Open AI uses public data. I don't understand the point of the lawsuit. Source: 10 months ago
[FREE] Beware of scammer u/Zackkkbbb
AI companies? They don't use reddit's API at all for AI scraping. They all use https://commoncrawl.org/, so as long as reddit is indexable by web browsers, it's being scraped for use by AI companies. Source: 10 months ago
Is it against the Reddit rules to scrape information from the way back machine?
Commoncrawl cdx API is unreliable—currently manely scanning warc and arc files form s3. Source: 10 months ago
Personal GPT: A tiny AI Chatbot that runs fully offline on your iPhone
The hallucinations are coming from the LLM interpolating from the training data, substantial portions of which is scraped off of the internet. Because other peoples' prompts never leave their devices (this app makes no internet connections). Source: 10 months ago
Reddit removed moderators behind the latest protests before restoring a few
I really need a TL;DR on everything that's happened over the past month! Is it correct that Reddit hiked its API prices in response to realising their data is worth a fortune (since it can be used to train/update LLMs)? Is this understanding correct? Wouldn't it be possible to do a 'Common Crawl' [1] just for reddit, or would the lag (of perhaps a few minutes) make third-party reddit apps too slow? (if LinkedIn... - Source: Hacker News / 10 months ago
What's Your Take on Reddit's Plan to Sell Your Posts and Comments as "Data Licensing" to Other Companies to Train AI Language Models?
Haven’t heard of https://commoncrawl.org or did you crawl something special? Source: 10 months ago
What Reddit Got Wrong
Reddit and StackOverflow notably. Most use https://commoncrawl.org/. - Source: Hacker News / 11 months ago
If you value the open-source movement, donate to Common Crawl
I'm not affiliated at all, I just see the importance. This is one of the most important resources we need to maintain if we're going to keep the open-source AI movement alive. https://commoncrawl.org/. Source: 11 months ago
Building a ChatGPT-like chatbot for Bengali. Please suggest good sources for a lot of Bengali text data.
How many tokens are you hoping for? Do you intend to train from scratch or finetune on some model? I would start with the data sources cited in this paper. There is also this monolingual corpus that I read about some time ago. Also, I would use common crawl and perhaps use fastText on it for language identification. There is also this thing called a Bangladesh National Corpus or something that someone wrote a... Source: 11 months ago
Reddit Laying Off About 90 Employees and Slowing Hiring Amid Restructuring: Moves aim to help social-media company break even next year
Web scraping is protected by US laws, this is why all AI companies all share a common prescraped trove called Common crawl. Source: 11 months ago
Writers' Call for OTW and Ao3 Transparency, Clarity, and Stance on A.I.
There are programs that go onto websites and basically open every page on the website, downloading all or some of the text, and storing them. This process is known as "scraping". When people talk about it in the context of AI, they're referring to https://commoncrawl.org/. Source: 12 months ago
ChatGPT Data Breach BreakDown - Why it Should be a Concern for Everyone!
The dataset they use is the common crawl database https://commoncrawl.org/ you can just ask ChatGPT and it will tell you. Source: 12 months ago

Do you know an article comparing CommonCrawl to other products?
Suggest a link to a post with product alternatives.

Suggest an article

Generic CommonCrawl discussion

This is an informative page about CommonCrawl. You can review and discuss the product here. The primary details have not been verified within the last quarter, and they might be outdated. If you think we are missing something, please use the means on this page to comment or suggest changes. All reviews and comments are highly encouranged and appreciated as they help everyone in the community to make an informed choice. Please always be kind and objective when evaluating a product and sharing your opinion.

CommonCrawl

Common Crawl subtitle

CommonCrawl Reviews and details

Screenshots and images

Badges

Social recommendations and mentions

Generic CommonCrawl discussion