There are programs that go onto websites and basically open every page on the website, downloading all or some of the text, and storing them. This process is known as "scraping". When people talk about it in the context of AI, they're referring to https://commoncrawl.org/. - Source: Reddit / 17 days ago
The dataset they use is the common crawl database https://commoncrawl.org/ you can just ask ChatGPT and it will tell you. - Source: Reddit / 23 days ago
It can do the opposite of what you want perfectly if you know what you're trying to prove in great details. ChatGPT is very, very good at making plausible statements in accordance with what they feed it in Common Crawl, and various sources like that on the internet. - Source: Reddit / 28 days ago
They aren’t just scraping the top ten websites or some shit. It’s a deep crawl of the entire internet. Petabytes of data. One of the main sources a lot of these LLMs use is this: https://commoncrawl.org. - Source: Reddit / about 1 month ago
Interesting, thanks. I was just basing my claim off the fact that GPT uses Common Crawl (which incidentally is hosted on AWS). However, I don't know how much of the entire internet is actually included in Common Crawl. - Source: Reddit / about 1 month ago
Not even "effectively". For a start, it's got nothing to do with Google search results (the main dataset is the open data Common Crawl which accounted for ~80% of the data used, along with WebText2, Wikipedia and other sources). - Source: Reddit / 2 months ago
Yes, and OP and everyone else can perhaps focus on backup on Common Crawl etc. - Source: Reddit / 2 months ago
What about Common Crawl? https://commoncrawl.org/. - Source: Hacker News / 3 months ago
Yeah I posted this in another thread but your claim is simply untrue. There is little to no need for anyone to scrape the web themselves, CommonCrawl does it for free and provides access to all that data for free, they are a charity similar to The Internet Archive: https://commoncrawl.org/ As for using that data to train a model, I've done it at my own company and it's not that expensive, certainly not the... - Source: Hacker News / 3 months ago
>There are only a handful of companies with enough data to be able to train artificial intelligence algorithms. https://commoncrawl.org/. - Source: Hacker News / 3 months ago
No-one paid to have it tweaked. It predicts the most probable sequence of word tokens based on your input (and its previous output if there was one). Its understanding of how similar words are to each other (word embeddings) was trained based on predicting the next word in a sequence, mainly from a dataset called of webpages called CommonCrawl (https://commoncrawl.org/). It doesn't "know" anything, it just has a... - Source: Reddit / 3 months ago
Open AI then co-trained the two encoders with the multilingual laion5b dataset, which contains 5.85 billion image-text pairs: 2.2 billion of these pairs are labelled in 100+ non-English languages, with the rest in English or containing text that can’t be nailed down to any one language (like place names or other proper nouns). These are taken from a sampling of images and their HTML alt-text in the Common Crawl... - Source: dev.to / 4 months ago
Common Crawl is a good one to start off playing with. It's huge, covers a long span of time, and has both full and plaintext only versions. https://commoncrawl.org/. - Source: Reddit / 4 months ago
In any case, this is part of the training data, I think: https://commoncrawl.org/. - Source: Reddit / 4 months ago
If all you want is a LM and it doesn't need to be trained by you or run on infrastructure you control, you could try to see whether ChatGPT already understands well enough. A Tunisian friend of mine told me that he asked it to tell a joke in Tunisian Arabic and it worked, only the joke wasn't funny. If you want or need to train on your own data, social media is a good bet for colloquial language. You could try... - Source: Hacker News / 4 months ago
Datasets are fairly easy to compile. 60% of GPT is from https://commoncrawl.org/ and using the. - Source: Reddit / 4 months ago
The goal of the prototype was simple. I "just" wanted to download 10.000 documents to understand how hard it is to collect and kind of archive them. The immediate problem was that I didn't know where can I get links for this many files. Sitemaps can be useful in similar scenarios. However, there are a couple of reasons why in this case they are not really a viable solution. Most of the time it doesn't contain... - Source: dev.to / 5 months ago
AFAIK they used Common Crawl data as their training data. - Source: Reddit / 5 months ago
Open AI then co-trained the two encoders with the multilingual laion5b dataset, which contains 5.85 billion image-text pairs: 2.2 billion of these pairs are labelled in 100+ non-English languages, with the rest in English or containing text that can’t be nailed down to any one language (like place names or other proper nouns). These are taken from a sampling of images and their HTML alt-text in the Common Crawl... - Source: dev.to / 5 months ago
And StabilityAI didn't even do their own web scraping, they used the Common Crawl data. - Source: Reddit / 5 months ago
So a previous user got heated with some of my responses I guess and referenced that https://laion.ai/ uses https://commoncrawl.org/ which is a bot that scrapes data for the entire internet (petabites of data). - Source: Reddit / 6 months ago
Do you know an article comparing CommonCrawl to other products?
Suggest a link to a post with product alternatives.
This is an informative page about CommonCrawl. You can review and discuss the product here. The primary details have not been verified within the last quarter, and they might be outdated. If you think we are missing something, please use the means on this page to comment or suggest changes. All reviews and comments are highly encouranged and appreciated as they help everyone in the community to make an informed choice. Please always be kind and objective when evaluating a product and sharing your opinion.