-
Tesseract is an optical character recognition engine for various operating systems
> Does android even have native OCR? Tesseract? <a href="https://github.com/tesseract-ocr/tesseract">https://github.com/tesseract-ocr/tesseract</a>.
#OCR #Image Recognition #PDF Editor 72 social mentions
-
Common Crawl
Common Crawl claims to have 82% of the tokens used to train GPT-3, and it's available to anyone. Add all the downloadable material at archive.org and you've got a formidable corpus. https://commoncrawl.org/.
#Search Engine #Web Scraping #Data Extraction 90 social mentions