Https://commoncrawl.org/ is a non-profit which offers a pre-crawled dataset. The specifics of individual tools probably vary. I imagine most tools would be based on academic datasets. - Source: Hacker News / 4 months ago
Should the NYT not sue https://commoncrawl.org/ ? OpenAI just used the data from commoncrawl for training. - Source: Hacker News / 4 months ago
What you’re likely referring to is Common Crawl: https://commoncrawl.org. - Source: Hacker News / 4 months ago
> ... a project called "Nutch" would allow web users to crawl the web themselves. Perhaps that promise is similar to the promises being made about "AI" today. The project did not turn out to be used in the way it was predicted (marketed), or even used by web users at all. Actually Nutch is used to produce the Common Crawl[0] and 60% of GPT-3's training data was Common Crawl[1], so in a way it is being used... - Source: Hacker News / 5 months ago
> Let's share the index as public data Common crawl[1] data has been in AWS for over a decade. [1]: https://commoncrawl.org. - Source: Hacker News / 6 months ago
Common Crawl claims to have 82% of the tokens used to train GPT-3, and it's available to anyone. Add all the downloadable material at archive.org and you've got a formidable corpus. https://commoncrawl.org/. - Source: Hacker News / 7 months ago
With all these models trained on a corpus of texts from the Internet[1], what's going to happen in 5 years when the internet is full of text generated after January 2022, the current cutoff data for ChatGPT? As an increasing proportion of training data is either pre-2002 or LLM-generated confabulations, when will it all collapse in on itself? Do these companies have any plans for how they are going to get... - Source: Hacker News / 7 months ago
The one all the cool kids use: https://commoncrawl.org/. - Source: Hacker News / 8 months ago
It didn't. Others did it, OpenAI used it. https://commoncrawl.org/. - Source: Hacker News / 9 months ago
Interesting to know about this https://commoncrawl.org/ and https://huggingface.co/datasets/tiiuae/falcon-refinedweb. Thus, not only Open AI uses public data. I don't understand the point of the lawsuit. Source: 10 months ago
AI companies? They don't use reddit's API at all for AI scraping. They all use https://commoncrawl.org/, so as long as reddit is indexable by web browsers, it's being scraped for use by AI companies. Source: 10 months ago
Commoncrawl cdx API is unreliable—currently manely scanning warc and arc files form s3. Source: 10 months ago
The hallucinations are coming from the LLM interpolating from the training data, substantial portions of which is scraped off of the internet. Because other peoples' prompts never leave their devices (this app makes no internet connections). Source: 10 months ago
I really need a TL;DR on everything that's happened over the past month! Is it correct that Reddit hiked its API prices in response to realising their data is worth a fortune (since it can be used to train/update LLMs)? Is this understanding correct? Wouldn't it be possible to do a 'Common Crawl' [1] just for reddit, or would the lag (of perhaps a few minutes) make third-party reddit apps too slow? (if LinkedIn... - Source: Hacker News / 10 months ago
Haven’t heard of https://commoncrawl.org or did you crawl something special? Source: 10 months ago
Reddit and StackOverflow notably. Most use https://commoncrawl.org/. - Source: Hacker News / 11 months ago
I'm not affiliated at all, I just see the importance. This is one of the most important resources we need to maintain if we're going to keep the open-source AI movement alive. https://commoncrawl.org/. Source: 11 months ago
How many tokens are you hoping for? Do you intend to train from scratch or finetune on some model? I would start with the data sources cited in this paper. There is also this monolingual corpus that I read about some time ago. Also, I would use common crawl and perhaps use fastText on it for language identification. There is also this thing called a Bangladesh National Corpus or something that someone wrote a... Source: 11 months ago
Web scraping is protected by US laws, this is why all AI companies all share a common prescraped trove called Common crawl. Source: 11 months ago
There are programs that go onto websites and basically open every page on the website, downloading all or some of the text, and storing them. This process is known as "scraping". When people talk about it in the context of AI, they're referring to https://commoncrawl.org/. Source: 12 months ago
The dataset they use is the common crawl database https://commoncrawl.org/ you can just ask ChatGPT and it will tell you. Source: 12 months ago
Do you know an article comparing CommonCrawl to other products?
Suggest a link to a post with product alternatives.
This is an informative page about CommonCrawl. You can review and discuss the product here. The primary details have not been verified within the last quarter, and they might be outdated. If you think we are missing something, please use the means on this page to comment or suggest changes. All reviews and comments are highly encouranged and appreciated as they help everyone in the community to make an informed choice. Please always be kind and objective when evaluating a product and sharing your opinion.