I just found that Devonthink doesn’t recognize the texts (hence, zero word count) of some pdf files that are fully OCRed, and very clean.
I can copy and highlight the texts (words) in the acrobat reader and even within the Devonthinkś reader. I get clean texts on my clipboard when I copy them.
The funny part is: Once I opened the files, and highlighted some texts. Devonthink immediately recognizes the texts, and populates the concordance.
But, still, Devonthink is showing “no concordance”. What is going on here?
Can you guys check if the issue is just within my system by downloading and dragging this file into your database?
I found the issue had been with the indexing. I rebuild the database and the above file is recognized now.
But, i still have a couple of documents that are show gibberish text within Devonthink (copied;as well as in the concordance). They have fine texts when copied in Adobe.
It could be a DRM variant. When the certificate that locks the PDF is broken, the text is converted in gibberish, in other words: that is the “text” stored into the PDF, that when passed by the encryption layer, it is converted into real good text.
The only solution is re-OCR the text, with DT included OCR or with external tools. For that, I have last Abbyy PDF version in Windows 10 (macOS one is crap), that if it is a text PDF, it OCRs as text PDF with the right text inside instead of gibberish.
Did it. It seems that PDF has some non-standard thing because ABBYY in Windows 10 generates a scanned-like one instead of a pure text as the original is. With MRC compression to lower size.
I did a conversion from DT as well, but result took 135 MB (I think there no exists any good OCR system in macOS). However, once I annotated the ABBYY one, it become same 135 MB size… As said, native Apple PDF support is crap.
I’m attaching both, and in both annotations are shown right into DT.