Hi, if I have tabular text like this
I get OCR text like this:
DATUM Rechng.-Nr.: KUNDEN-NR.: PREIS €
16.02.2021
4898 915
BETRAG €
So, the first column row by row in one row, and the second column in a mix of one row per row and something else.
Is there any way to tell Abbyy OCR to return OCR data left to right (in a LTR environment, that is) and not try to be cute about tables?
There’s no such option currently, at least not in DEVONthink.
Thanks. It’s not even in FineReader. Interestingly, PDFPen does The Right Thing™, as does (presumably) the Vision framework.
Coming from Paperless ngx the way PDFs are being OCRed has been somewhat inconvenient in DT as retrieving the date (usually found in the top right) does not work consistently anymore. Devonthink is superior in all other functions though.
Does DEVONthink retrieve no date at all or the wrong date? You could also convert the PDF documents to plain text whether the text layer actually contains the expected date.
Devonthink does retrieve a date and functions like newest/ oldest do work but what I’d need is that the algorithm retrieves the first date while reading a file like we would - without tabular structure but row by row.
The primary challenge is that documents like invoices have several dates in them (date of invoice, date of service, date payment is due, …) and are structured in those insivisible columns (similar to the example above).
Just picking the “newest” or “oldest” document date won’t always work as texts may contain a duedate for payment (newest date but not creation date) or such.
Please see my attached examples, where the retrieved date is 10.07.2023 as it processes the upper right block as a separate column while ignoring that it is basically something like row 3 of the document. The expected result would be 03.07.2023.
Thank you for the feedback! The algorithm just uses the text of documents, not their layout so far. No matter whether PDF or another file format.