OCR occasionally produces a replicant of the searchable PDF in Global Inbox

xurc · December 1, 2021, 5:07am

Hello,

In the process of manually applying OCR to individual PDFs, occasionally a replicant of the produced searchable PDF will be generated in the Global Inbox. This phenomenon happens sporadically without any discernible patterns.

I searched in the forum and found two potentially related posts, neither of which seemed to reach a definitive answer:

Here are some of my settings that may be relevant, judging from the above threads:

DEVONthink Preferences > OCR > Original Document: Move to Trash is unchecked.
This phenomenon only occurs to a small portion, I’d estimate <10%, of the OCR’d files.
I only have one Smart Rule, Filter duplicates, installed at the moment.

Screen shot of smart rule

Screen Shot1276×664 76.1 KB

pete31 · December 1, 2021, 6:11am

If there’s only one instance of a record (i.e. it isn’t replicated) and the record’s location is inside the Tags group then DEVONthink creates a replicant in the database’s inbox.

I guess you’re applying OCR via a Smart Group and didn’t notice that some records only exist in the Tags group. And it’s created in the global inbox because that’s the record’s database.

xurc · December 1, 2021, 7:10am

Hi, thanks for sharing your insight!

That might be the case. I’m using a Smart Group to filter all the documents that need OCR, but I do use ⌘R to Reveal the record before applying OCR. Taking your theory into consideration, maybe for some records I forgot to Reveal them first, and some of them happen to exist only in the Tags group.

Some weeks ago, I did notice that a small portion of my records only belonged to the Tags group. I tried to fix it but didn’t find a way, so I decided to forget about it. Now that you brought it up again, is there any way (e.g. Smart Groups) to filter all the records that only exist in the Tags group? Thanks

pete31 · December 1, 2021, 7:19am

This script replicates records to the root.

xurc · December 1, 2021, 8:15am

Thank you for sharing the script!

I selected all the records (by searching for kind:any in All Databases) and ran the script in Apple’s Script Editor but did not notice any new replicants created (by searching for item:replicated in All Databases). Now I’m confused again: does this result mean I do not have any records that only exist in the Tags group? Then I guess all such records have already been discovered and dealt with during my OCR process?

pete31 · December 1, 2021, 8:22am

Yes.

xurc · December 2, 2021, 6:27am

The issue happened again, which was the first time after I made this post. This time I am confident the original scanned PDF did exist in a group before I applied OCR, so there is indeed something else going on here.

nauii · February 21, 2022, 1:23pm

Hi there, I am occasionally experiencing the exact same issue. To me it seems to be a bug (using DEVONthink Pro 3.8.2 on macOS 12.2.1).

Are there any news regarding this topic?

cgrunenberg · February 21, 2022, 1:39pm

Are you able to reproduce this?

nauii · February 21, 2022, 1:51pm

Yes, I could reproduce it several times in a row on a specific pdf document. Revealed it before.

Each time two files were created, one in the original folder and one in the global inbox, both indicated as replicants. When I delete the one in the inbox, the file in the original folder is no longer indicated as replicant.

Note: the option to delete the original file during the OCR process is set to OFF in my settings (I like to check manually if everything was converted correctly).

cgrunenberg · February 21, 2022, 1:57pm

How exactly did you reproduce this? Does it depend on the PDF or are certain steps sufficient?

nauii · February 21, 2022, 2:12pm

It only happens to SOME pdf files, not to each and every one.

Today, I went through some groups in my databases and manually OCR’d pdf files that needed it (low or 0 word count). I stumbled upon a random pdf file, manually OCR’d it and the above described behaviour appeared. I deleted both replicants and manually OCR’d the original file again, and again, two replicants were produced (one in the same folder, one in the global inbox).

At this point, I can’t see a specific pattern which PDFs are vulnerant to this behaviour. This one had a 0 word count and was somehow old (created 2016).

cgrunenberg · February 21, 2022, 2:17pm

Does this also happen after exporting the document, importing it again and OCRing the imported one?

nauii · February 21, 2022, 2:22pm

Okay, I first have to find a file again, which produces this behaviour… as I deleted the last one.

Thank you for your help so far.

BLUEFROG · February 21, 2022, 3:28pm

Did you have a smart rule that may have produce the replication behavior, but you deleted the rule?

nauii · February 21, 2022, 7:18pm

No smart rules, just smart groups. But I definitely OCR’d this pdf from its very own group.

nauii · May 30, 2022, 8:30pm

Yes. I am confident I found the cause now: it happens everytime you OCR a protected pdf file (read-only mode). Somehow, when the OCRing has finished, two new identical files are created: one in the same folder as the original file (source), and one in the inbox. Both are red and italic (so marked as replicants). If I delete the one in the inbox, then the other new file in the source folder is no longer marked as a replicant. The OCR-process seems to have worked fine, and the OCR’d new pdf file also has no protection (read-only mode) anymore.

Is this behaviour supposed to be like that?

BLUEFROG · May 30, 2022, 8:34pm

Is this an indexed file you’re dealing with?

nauii · May 30, 2022, 8:39pm

No, it’s imported to the database and managed by DEVONthink.

cgrunenberg · May 31, 2022, 7:12am

What’s the original location actually?