Ask HN: How to OCR a PDF and preserve whitespace?

This page summarizes and extends the software alternatives mentioned in the source post on Hacker News.

2024-06-07

OCR Image Recognition OCR API

Amazon Textract Landing Page

1

Amazon Textract

Easily extract text and data from virtually any document using Amazon Textract. Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.

#OCR #Image Recognition #OCR API 37 social mentions
Tesseract Landing Page

2

Tesseract

Tesseract is an optical character recognition engine for various operating systems

Many of the OCR services are based on the free, open-source Tesseract OCR, but don’t expose all of the options. If you’re handy with shell scripts or Python, you can probably get better performance by hand-tuning options for your particular images. For example, if I recall there are page segmentation options to tell Tesseract to expect multi-column text. That alone might get you better performance than the automatic mode. <a href="https://github.com/tesseract-ocr/tesseract/">https://github.com/tesseract-ocr/tesseract/</a>.

#OCR #Image Recognition #PDF Editor 79 social mentions

Discuss: Ask HN: How to OCR a PDF and preserve whitespace?

Related Posts

How To Convert JPG to Word Online？Convert JPG to Word Online for Free.

geekersoft.com // over 2 years ago

What is the best way to extract text from image?

geekersoft.com // almost 3 years ago

Top 3 online ocr, image into text converter

geekersoft.com // almost 3 years ago

Best Free And Open Source OCR Software | HelloSign

hellosign.com // almost 3 years ago

7 Best OCR Software of 2022 (Free and PAID)

theecmconsultant.com // about 3 years ago

The best alternatives to Abbyy FineReader

klippa.com // about 3 years ago