OCR to searchable PDF creates two replicants

Sans-Culottes · April 15, 2021, 3:57pm

I have imported several years of Evernote items into DT (15,000+), and DT created groups for each item imported, those groups contain the original PDF or JPEG, and also an HMTL of the title with a link to the PDF or JPEG. I have recently discovered that some of those items had not been through OCR and thus were not visible to my searching by word.

I ran the “PDFs not searchable Smart Group” mentioned elsewhere, and then from the results selected the items from a single database, and right clicked to “OCR to searchable PDF”. This processed most items ( I need to work through the log of items that appear to have the wrong permissions) , and created a “PDF + Text” item in the group that was created on import from EN. This was the outcome I was looking for.

Also created were replicants in the database Inbox, and I do not understand why this has happened.

Any thoughts please?

Cheers Chris

chrillek · April 15, 2021, 4:01pm

There’s a setting in the preference’s OCR tab that defines if the original PDF is removed after OCR or not. What does this say on your machine?

Sans-Culottes · April 15, 2021, 4:06pm

Hi

The “original document move to trash” checkbox is unchecked. However this is not the issue, as not only is the original document in the group, along with the expected PDF+Text item created by OCR, but there is another replicant of the PDF+Text file in the database inbox.

Blanc · April 15, 2021, 4:16pm

Have you got any smart rules active which may be doing this (e.g. searching all databases and performing a replicate on ocr action)?

In what way is there another replicant? If I understand you correctly, you have the original (PDF with no text layer), the OCR’d file and - in the inbox - one replicant of the OCR’d file.

Sans-Culottes · April 15, 2021, 4:50pm

You have understood me correctly. My poor use of words , the OCR’d file in the group with the original, and the replicant in the database inbox are both coloured red as replicants. Delete either and the colouring goes, and there are no replicants

Sans-Culottes · April 15, 2021, 4:54pm

No smart rules linked to actions, only reporting

Blanc · April 15, 2021, 5:10pm

I’ve never seen the behaviour you describe; is every file you OCR as you described in your first post being replicated? Or just some? Would you mind posting a screen cap of the smart rule you are using? I’ll see if I can replicate what is going on. Do please also run a verify & repair on the database concerned.

Can you replicate the same behaviour if you don’t select a large number of PDFs for OCR, but just one, for example?

Sans-Culottes · April 16, 2021, 6:16am

Thanks for your help. I will attempt to do this this afternoon, workload permitting.

Sans-Culottes · April 16, 2021, 5:13pm

Test of OCR function

Database verified and repaired – no issues.

This issue has arisen over several databases, but for test purposes I have closed all but one to improve visibility in the screenshots

22 items in imported database identified as not having been OCRed prior to import (see items needing OCR) by using Smart Group (see screenshot “Smart Group to select items needing OCR”)

Item one in the list (the top item) is OCRed via right click /OCR/to searchable PDF (see screenshot “Change in item count having OCRed item 1”)

No apparent issues with above, no replicants created, and the PDF+ Text file appears in the group with the original PDF and the associated HTML file created on import from Evernote. (see screenshot “Item 1 group”)

Selected items 4 to 5 inclusive, these are then OCRed via right click /OCR/to searchable PDF as before. (See screenshot “Change in item count having OCRed items 2 to 5”). Note that 2 items are now showing in “Inbox Evernote Import 2016” and in “Replicants (these are….) Smart Group”. (see screenshots “Inbox Evernote Import 2016” and “Items replicated during OCR of items 2 to 5”)

See screenshot “Example group with replicant”, showing Group, PDF document and HTML created on import from Evernote, and the replicant PDF +TEXT file created by OCR.

I have left the option in the OCR preference pane to move originals to trash after OCR unchecked at present in order to be able to unwind any actions that occur whilst OCRing. If I do check it, it leaves a .html file in every group that was created on import, and those files then point to the deleted file. This another issue that I either live with by retaining the PDF and the PDF +Text file, or the PDF+Text file and a redundant .html file. I need to think how best to deal with that point.

Meantime if I proceed with the OCR process, I need to carefully note the change in item count after each batch of OCRing, and if happy with that can delete the unwanted replicants from the Inbox should they occur.

Perhaps if I am doing something wrong you could advise.

Many thanks

BLUEFROG · April 16, 2021, 5:26pm

OCR doesn’t produce replicants on its own.
Your screen captures don’t display the smart rules installed. Please provide a screen capture of them.

Sans-Culottes · April 16, 2021, 5:45pm

I have no smart rules set up other than Reminders, Bates Numbering and Incoming Scans, all of which I believe where there on installation of DT.

My earlier post referred to smart rules for reporting, I should have said smart groups, apologies.

Blanc · April 18, 2021, 12:51pm

Which smart rule is that? I don’t have one of that description, perhaps you could post a screenshot. I haven’t as yet been able to replicate the behaviour you are describing, and together with Jim’s post at least suggests this problem may be specific to your setup.

Sans-Culottes · April 18, 2021, 2:24pm

See attached Incoming scans

Thanks for your assistance

Blanc · April 18, 2021, 4:02pm

Ok, thanks; that rule is not responsible, because it does nothing (it has no conditions).

I’ve just been through the images you posted above again (@BLUEFROG please will you take another look at this, I can’t explain what is going on): the images show that

when you OCR a document with OCR/to searchable PDF the original document remains listed in the smart group which displays files which are PDF with word count 0
the original document remains marked PDF Document
edit: as per your feedback a PDF + Text item is created in same group as the original document, as expected.
at the same time a replicant which is marked as PDF + Text is produced in the inbox.

Jim, could this be something to do with write permissions to the file? I cannot figure any setting or rule which would cause this to happen; I also can’t reproduce the behaviour with a simple test.

@Sans-Culottes - could you please do the following:

select a file (which is similar to ones with which you have had these problems) which has not yet been OCRd, and then select Show in Finder from the context menu.
in Finder select Get Info from the context menu.

Is the file showed as locked? (The check box is in the “General” section of the Info window in Finder (note: I’m not asking whether the file is locking in DEVONthink 3). Are you shown as having Read & Write permission in the Sharing & Permissions section in the same window?

Please have a look at a number of files to be sure the result is not a fluke. Please also look at a file you have already OCRd and which was replicated and remained in the smart group looking for PDF with 0 word count.

Sans-Culottes · April 19, 2021, 6:24am

Good morning

To add to your note to Bluefrog.

A PDF + Text item is created in same group as the original document, as expected.

Files inspected through finder as detailed by you below.

No files locked, all files show me with read write permission.

Cheers

Sans-Culottes · April 20, 2021, 6:41pm

Hi

Any further thoughts please?

Blanc · April 20, 2021, 7:01pm

Not really; does this happen with files which haven’t been imported from Evernote too?
I can’t come up with any mechanism for this happening other than a smart rule - and you have stated you have no such smart rules active. I had kind of hoped Jim might come back with some more insight; I think your best bet might be to open a support ticket (and please do post back here if together with DEVONtech you figure out why this is happening - it’s driving me a little bit mad…)

Sans-Culottes · April 22, 2021, 6:51am

This is not happening with items created in DT only those imported from EN. I have decided to OCR all PDF files that need it, then delete the replicants, and live with the results.

Should this issue reoccur with new items that are “created” rather than imported from EN ( which I no longer use so no new items are being generated this way), I will open a support ticket.

Thanks for your help.

Blanc · April 22, 2021, 7:25am

That’s a pragmatic solution. I’m sorry I haven’t been able to be of more assistance.

Sans-Culottes · January 6, 2023, 8:14am

Further occurrence

This is not the first time, but it is the first time that there were no DT interactions of any sort between the OCR operation and me noticing the non searchable items.

Actions:

I dragged two JPEGS from Apple Mail to a database inbox, tagged them , then moved them to a group.

Later I noted that my smart group that alerts me to non searchable files showed two files. From within the smart group I selected both files together and ran OCR to create searchable PDFs.

Result:

Searchable PDFs in the location of the originals as expected (A)
The originals moved to trash ( as expected per preference settings) (B)
A replicant of the searchable PDFs in the inbox (C)

Further action:

Deleted files (C) and (B)

Result:

This left (A) despite it being a replicant of a file that was deleted (C)

Any thoughts as to what is going wrong / on please?

Many thanks

Chris