Faulty Spotlight importer

I know this problem is not specific to Devonthink. I am asking here in case somebody came up with a solution to it.
PDF files that are correctly formatted, with clean and crisp texts are giving out gibberish text under Spotlight (Devonthink).

When I copy texts using Acrobat or Pdf expert, I am getting clean texts. But, if I copy the same lines using Preview, Devonthink and all other apps that relpy on Apple’s Spolight importer, the result is junk.

Here is a sample pdf (692.3 KB)

This problem has been bugging me for so long.

It would be really greate if sb has a way around this terrible problem (re-running OCR engine is not solving it for me).

That’s an issue of the PDFKit framework of macOS, not of Spotlight. If OCR doesn’t solve it, then the only workaround is probably to use an app which does not rely on this framework (like Acrobat or PDF Expert).

1 Like

BTW:
In case of the sample, the plain/rich text conversion are much better after OCR actually.

Reading is fine: I can do it with Pdf expert. The problem is the search in DT is not getting the words correctly. So, basically those files are invisible to the search (AI) in DT.

See above, OCR seems to work as expected at least for the sample.

1 Like

I am not getting any improvements.

The last paragraph of the first page (after running OCR)

The studyof referencehas a long traditionin the philosophicalliterature, and has been investigatedfromvariousperspectiveswithinlinguisticsand psy- chology(see, forexample,Karttunen1976,Nunberg1978,Hawkins 1978,1984,
1991,Clark & Marshall 1981,Grosz 1981,Heim 1982,Maclaran 1982,Giv6n 1983,Ariel 1988,Kronfeld1990,and numerousworkscited therein).Although manyimportantinsightsand observationshave come out of this work,basic
factsconcerningthedistributionand understandingofdifferenftormsofre-
ferringexpressionin naturallanguagediscourse stillremainunexplained.In thispaperwe outlinea theorywhose mainpremiseis thatdifferendteterminers
and pronominalformsconventionallysignal differenctognitivestatuses (in- formationabout location in memoryand attentionstate), therebyenablingthe

The same paragraph using PDFexpert:

The study of referencehas a long tradition in the philosophical literature, and has been investigated fromvarious perspectives within linguistics and psychology (see, for example, Karttunen 1976, Nunberg 1978, Hawkins 1978, 1984, 1991, Clark & Marshall 1981, Grosz 1981, Heim 1982, Maclaran 1982, Giv6n 1983, Ariel 1988, Kronfeld 1990, and numerous workscited therein). Although many importantinsights and observationshave come out of this work, basic facts concerning the distributionand understanding of different formsof referring expression in natural language discourse still remain unexplained. In this paper we outlinea theory whose main premise is that different determiners and pronominal forms conventionallysignal different cognitive statuses (informationabout location in memory and attention state), thereby enabling the

What application did the inital OCR?

And don’t be deceived by the appearance of text on a document. It has no bearing on whether the OCR’d text is good or even exists.

This is page 1 from DT post-OCR:
COGNITIVE STATUS AND THE FORM OF REFERRING EXPRESSIONS IN DISCOURSE.txt (3.5 KB)

you are getting much better text with the ocr. Did you convert the pdf to some other format before you run the OCR?

No.

  • What are your OCR settings?
  • Are you on an Intel or Silicon Mac?

Oh, there was an issue on the setting. I am getting the same result with you now.

Thank you.

You’re welcome.
What was the problematic setting?

Move the original to trash was set to off. I used to do that ON always. So, basically, I was copying from the original.

Gotcha. Logical resolution. :smiley:

1 Like