Apache Tika has worked well for me in the past, ended up running it on an AWS Lambda https://tika.apache.org/. - Source: Hacker News / 9 months ago
If you accept running Java, the Apache Tika is extremely good at parsing content (https://tika.apache.org/). - Source: Hacker News / 10 months ago
Apache Tika can spit out text from lots of formats. I've used it with grep (or rg) to make a small scale searching of local folders. Tika does a really good job at OCR for finding if text is in a file. Source: about 1 year ago
Https://tika.apache.org Meta data from things. Source: about 1 year ago
At my previous job we had the same problem which we solved by using Tika. We called it on the server along with other stuff, but there is also a Python binding. Source: almost 2 years ago
For document content I've heard good things about Apache Tika. Spyglass could leverage it via the rest api. Source: about 2 years ago
How about you batch convert it to text with Tika and then run Python (or even grep or awk) on it? Source: about 2 years ago
There is also Apache Tika (https://tika.apache.org/) - file format detection & content extraction library. - Source: Hacker News / over 2 years ago
I installed FileRun recently and that might get you close. It's fast and the search is pretty good as it can integrate Apache Tika, I like the OnlyOffice integration as well. It's closed-source, which isn't great for me, but you get 3 accounts without having to pay. Source: over 2 years ago
Any native or FFI callable thing like the java tools such as Apache Tika? A quick duckduckgo search didn't turn up anything for me. Tika has served me well in the past, but I have no idea what I'd use with CL. Source: over 2 years ago
This is just a simple bash script and Apache Tika (https://tika.apache.org/). You could script this together in minutes. Try https://github.com/flameshot-org/flameshot and feed the results through Tika to OCR the results. - Source: Hacker News / almost 3 years ago
For the first step you can check out Tika: https://tika.apache.org/. Source: almost 3 years ago
'FindTextInDocuments' uses 'Apache's Tika' and conversion of Pdf documents to text takes longer than any other. Source: about 3 years ago
I use Apache software every day: Mostly Commons, but also POI, PDFBox, and Tika. They were pioneers for enterprise-friendly open-source libraries at a time when the GPL stroke fear into the hearts of development managers everywhere. - Source: dev.to / about 3 years ago
Https://tika.apache.org/ - Apache Tika can be integrated as a custom processor or called via REST and run as a seperate server/service. - Source: dev.to / about 3 years ago
Do you know an article comparing Apache Tika to other products?
Suggest a link to a post with product alternatives.
This is an informative page about Apache Tika. You can review and discuss the product here. The primary details have not been verified within the last quarter, and they might be outdated. If you think we are missing something, please use the means on this page to comment or suggest changes. All reviews and comments are highly encouranged and appreciated as they help everyone in the community to make an informed choice. Please always be kind and objective when evaluating a product and sharing your opinion.