Why am getting "no concordance"?

Dellu · July 31, 2021, 8:16am

I just found that Devonthink doesn’t recognize the texts (hence, zero word count) of some pdf files that are fully OCRed, and very clean.
I can copy and highlight the texts (words) in the acrobat reader and even within the Devonthinkś reader. I get clean texts on my clipboard when I copy them.

The funny part is: Once I opened the files, and highlighted some texts. Devonthink immediately recognizes the texts, and populates the concordance.

But, still, Devonthink is showing “no concordance”. What is going on here?

Can you guys check if the issue is just within my system by downloading and dragging this file into your database?

rfog · July 31, 2021, 8:42am

It is working here. I dropped the file into my Global Inbox and some seconds after it was recognised as PDF+Text and showed all right in “See Also & Classify”.

Dellu · July 31, 2021, 8:44am

Thank u for the fast reply.

Well, then, the problem should be in my system. I remember, Devonthink crashed a couple of times when I indexed the folder first. That might be the culprit. Shall I rebuild the database?

rfog · July 31, 2021, 9:09am

No. Only restart DT and then select File → Update Indexed Files when you have selected the offending folder where the file is.

If that does not resolve it, then delete the file, empty trash, repeat “Update Indexed Files” and then add the file one more time.

Dellu · July 31, 2021, 9:57am

I should have tried this one: I just clicked on the rebuild; gonna take hours to finish up.

rfog · July 31, 2021, 11:22am

Yes. Normally all indexed issues got solved via “Update indexed files” and sometimes “Synchronize” in same File menu.

Dellu · July 31, 2021, 11:24am

updating the indexed files was not solving the issue. But, i missed your second point: deleting the files and repairing–would bear solved it easily.

BLUEFROG · July 31, 2021, 4:33pm

I saw no issues with this file.
Do you have it working now?

Dellu · July 31, 2021, 9:32pm

I found the issue had been with the indexing. I rebuild the database and the above file is recognized now.
But, i still have a couple of documents that are show gibberish text within Devonthink (copied;as well as in the concordance). They have fine texts when copied in Adobe.

Horn2010 The expression of negation.pdf (3.2 MB)

CAn you look at this file, for example?

BLUEFROG · July 31, 2021, 11:45pm

This should point you to a PDFKit issue, not a DEVONthink one.
Acrobat doesn’t use PDFKit, for obvious reasons.
DEVONthink does.

Do you have the file without the highlight?
If not, download a new copy of the file.

Check the Concordance before and after highlighting.

Dellu · August 1, 2021, 7:17am

You r right. I get it; the issue is universal. Preview is also giving out gibberish.

Re-downloading the files is not fixing it.

The highlighting is not helping either.

Optimizing the pdf in Acroba, embedding fonts, converting the format to Acroba 10 etc, is not working.

it is strange.

rfog · August 1, 2021, 8:39am

It could be a DRM variant. When the certificate that locks the PDF is broken, the text is converted in gibberish, in other words: that is the “text” stored into the PDF, that when passed by the encryption layer, it is converted into real good text.

The only solution is re-OCR the text, with DT included OCR or with external tools. For that, I have last Abbyy PDF version in Windows 10 (macOS one is crap), that if it is a text PDF, it OCRs as text PDF with the right text inside instead of gibberish.

Dellu · August 1, 2021, 9:13am

I did OCR twice on one of the files; even after converting it to image (pdf) by printing it to pdf. NO luck at all.

rfog · August 1, 2021, 9:27am

Wow, that’s very strange (and nasty). If you put one of those, I could OCR it from my Windows to see if that resolves the issue.

Dellu · August 1, 2021, 9:28am

can you check the file I have attached above?
(https://discourse.devontechnologies.com/uploads/short-url/8kIWCZQShxr4aBDYyYl98MKbz5q.pdf)

rfog · August 1, 2021, 10:22am

Did it. It seems that PDF has some non-standard thing because ABBYY in Windows 10 generates a scanned-like one instead of a pure text as the original is. With MRC compression to lower size.

I did a conversion from DT as well, but result took 135 MB (I think there no exists any good OCR system in macOS). However, once I annotated the ABBYY one, it become same 135 MB size… As said, native Apple PDF support is crap.

I’m attaching both, and in both annotations are shown right into DT.

You can get both from here: Dropbox - Horn2010 The expression of negation.zip - Simplify your life

Dellu · August 1, 2021, 11:05am

ABBYY did a wonderful work.

I am surprised how Adobe DC failed to solve it. Even converting it to PNG, and building it back miserly failed.

I need to turn to my old windows machine then.
Thank you very much man.