OCR occasionally produces a replicant of the searchable PDF in Global Inbox

Hello,

In the process of manually applying OCR to individual PDFs, occasionally a replicant of the produced searchable PDF will be generated in the Global Inbox. This phenomenon happens sporadically without any discernible patterns.

I searched in the forum and found two potentially related posts, neither of which seemed to reach a definitive answer:

Here are some of my settings that may be relevant, judging from the above threads:

  • DEVONthink Preferences > OCR > Original Document: Move to Trash is unchecked.
  • This phenomenon only occurs to a small portion, I’d estimate <10%, of the OCR’d files.
  • I only have one Smart Rule, Filter duplicates, installed at the moment.
    Screen shot of smart rule

If there’s only one instance of a record (i.e. it isn’t replicated) and the record’s location is inside the Tags group then DEVONthink creates a replicant in the database’s inbox.

I guess you’re applying OCR via a Smart Group and didn’t notice that some records only exist in the Tags group. And it’s created in the global inbox because that’s the record’s database.

1 Like

Hi, thanks for sharing your insight!

That might be the case. I’m using a Smart Group to filter all the documents that need OCR, but I do use ⌘R to Reveal the record before applying OCR. Taking your theory into consideration, maybe for some records I forgot to Reveal them first, and some of them happen to exist only in the Tags group.

Some weeks ago, I did notice that a small portion of my records only belonged to the Tags group. I tried to fix it but didn’t find a way, so I decided to forget about it. Now that you brought it up again, is there any way (e.g. Smart Groups) to filter all the records that only exist in the Tags group? Thanks :smiley:

This script replicates records to the root.

1 Like

Thank you for sharing the script!

I selected all the records (by searching for kind:any in All Databases) and ran the script in Apple’s Script Editor but did not notice any new replicants created (by searching for item:replicated in All Databases). Now I’m confused again: does this result mean I do not have any records that only exist in the Tags group? Then I guess all such records have already been discovered and dealt with during my OCR process?

Yes.

1 Like

The issue happened again, which was the first time after I made this post. This time I am confident the original scanned PDF did exist in a group before I applied OCR, so there is indeed something else going on here.

Hi there, I am occasionally experiencing the exact same issue. To me it seems to be a bug (using DEVONthink Pro 3.8.2 on macOS 12.2.1).

Are there any news regarding this topic?

Are you able to reproduce this?

Yes, I could reproduce it several times in a row on a specific pdf document. Revealed it before.

Each time two files were created, one in the original folder and one in the global inbox, both indicated as replicants. When I delete the one in the inbox, the file in the original folder is no longer indicated as replicant.

Note: the option to delete the original file during the OCR process is set to OFF in my settings (I like to check manually if everything was converted correctly).

How exactly did you reproduce this? Does it depend on the PDF or are certain steps sufficient?

It only happens to SOME pdf files, not to each and every one.

Today, I went through some groups in my databases and manually OCR’d pdf files that needed it (low or 0 word count). I stumbled upon a random pdf file, manually OCR’d it and the above described behaviour appeared. I deleted both replicants and manually OCR’d the original file again, and again, two replicants were produced (one in the same folder, one in the global inbox).

At this point, I can’t see a specific pattern which PDFs are vulnerant to this behaviour. This one had a 0 word count and was somehow old (created 2016).

Does this also happen after exporting the document, importing it again and OCRing the imported one?

Okay, I first have to find a file again, which produces this behaviour… :sweat_smile: as I deleted the last one.

Thank you for your help so far.

Did you have a smart rule that may have produce the replication behavior, but you deleted the rule?

No smart rules, just smart groups. But I definitely OCR’d this pdf from its very own group.

Yes. I am confident I found the cause now: it happens everytime you OCR a protected pdf file (read-only mode). Somehow, when the OCRing has finished, two new identical files are created: one in the same folder as the original file (source), and one in the inbox. Both are red and italic (so marked as replicants). If I delete the one in the inbox, then the other new file in the source folder is no longer marked as a replicant. The OCR-process seems to have worked fine, and the OCR’d new pdf file also has no protection (read-only mode) anymore.

Is this behaviour supposed to be like that?

Is this an indexed file you’re dealing with?

No, it’s imported to the database and managed by DEVONthink.

What’s the original location actually?