Problem with OCR - creating searchable PDF

HerbertHoerner · October 18, 2022, 2:14pm

I use exactly the same setup; also for the last 2 years. Brother ADS 2800 scans to shared NAS folder (synology). And it was definitely working till mid of August (last accounting session which I remember)… I identified the problem now before I submitted the topic.

HerbertHoerner · October 18, 2022, 2:25pm

I opened one scanned PDF in apple preview. Saved the PDF to disk. Imported (drag and drop) in DT.
Started then an OCR to searchable PDF … the problem remains unchanged. OCR runs recognition, but no PDF+Text file is stored.

BLUEFROG · October 18, 2022, 2:50pm

Have you quit and relaunched DEVONthink, then checked DEVONthink 3 > Install Add-Ons for an update to the OCR engine?

HerbertHoerner · October 18, 2022, 3:17pm

Yes, I did. I also restarted DT3 and also Mac for several times. I also now firmware updated scanner. But still no change.
the only one additional step is a workaround: Convert scanned PDF to paginated PDF. And as a second step OCR to searchable PDF.
But this is a workaround and not possible in our day 2 day business in the therapy where we use DT3.

HerbertHoerner · October 18, 2022, 3:22pm

was there any change to security of files in the last apple upgrades? I am not in the position to analyze this further. But this would be an explanation why OCR is running and dies in “saving PDF”.

cgrunenberg · October 18, 2022, 3:24pm

A smart rule could perform this.

It’s an issue of the latest Abbyy engine on M1/M2 computers in case of PDF documents having invalid dates (like the ones created by the Brother software), the next release will include a fix for this.

HerbertHoerner · October 18, 2022, 3:38pm

Great, thank you for your fast support solving this issue. The workaround per smart rule I use since this problem occurred. And so I know that the problem is solved with next Abby-update.

clearsky · October 23, 2022, 6:22pm

Yes, no update available. Also reinstalled the ABBYY Plugin with no change. Even installed a firmware upgrade on the Brother ADS-2800 W like HerbertHoerner did. No help. It’s frustrating. The Process dies after the OCR, when the status says “saving PDF-document”.

edit: sorry, I’ve just red the lines below now. Waiting for a quick fix now…

clearsky · October 23, 2022, 9:34pm

Hey Christian,

thanks a lot for your help. I’m using your script to import and ocr the files to DT. Can you add the lines which will force DT to convert to a paginated PDF first and then do the OCR? You would do me a great favor since my apple script skills are a little weak…

– DEVONthink - Import, OCR & Delete.applescript

– Created by Christian Grunenberg on Fri Jun 18 2010.

on adding folder items to this_folder after receiving added_items

try

if (count of added_items) is greater than 0 then

tell application id “DNtp” to launch

repeat with theItem in added_items

set thePath to theItem as text

if thePath does not end with “.download:” and thePath does not end with “.crdownload:” then

set lastFileSize to 0

set currentFileSize to 1

repeat while lastFileSize ≠ currentFileSize

delay 0.5

set lastFileSize to currentFileSize

set currentFileSize to size of (info for theItem)

end repeat

try

tell application id “DNtp”

set theRecord to ocr file thePath to incoming group

if exists theRecord then tell application “Finder” to delete theItem

end tell

end try

end if

end repeat

end if

end try

end adding folder items to

cgrunenberg · October 24, 2022, 7:22am

Folder actions are actually discouraged since version 3, smart rules are recommended instead as they’re much easier to build & maintain without requiring scripting skills. In this case just index the folder, then create smart group limited to the indexed group and which performs its actions On Importing. Finally add the actions
…

Move into database
Convert to Paginated PDF
OCR > Apply
And optionally Move to move it to the desired group/database

clearsky · October 25, 2022, 8:48pm

Thank you for your explanations, Christian! I got it to work now, but the original File is not deleted. So afterwards I have 2 Files in DT, one without OCR indexed and on the hard drive, and one with OCR and renamed. How can I tell DT to delete the original file (the import & delete function of the folder script).

Another thing I did not understand was how I can perform actions on Smart Group and what the group is for. I now did it with a smart rule on the indexed folder.

chrillek · October 25, 2022, 8:51pm

There’s a setting in the OCR tab of the preferences for that.

clearsky · October 25, 2022, 9:08pm

Thank you for the hint, chrillek. Just checked it and the checkbox is already activated.

I just don’t get it. Why is this so hard? Does not anybody want to import directly from the scanner? And if that does not work, then scan to a folder and automatically import, ocr and delete the original file? Did i miss a part in the handbook? I’m just wondering, because sometimes you just don’t do it the right way, when it is too hard to get it done.

chrillek · October 25, 2022, 9:13pm

I don’t know, since I tell my scanner to send scans via e-mail. Regardless: You don’t tell what you did to make it (kind of) work, so it’s not possible to tell you what you might possibly change to make it work the way you want.

Which smart group? And what kind of “action” do you want to perform on it?

We don’t know what “it” is. @cgrunenberg suggested that you move the file into DT, convert to paginated PDF, and OCR that. Is that the same as “it”?

BLUEFROG · October 26, 2022, 3:24am

You’d obviously need to target a new indexed group in the Search in dropdown and the Move action.

cgrunenberg · October 26, 2022, 8:02am

In this case it’s just a temporary workaround for an issue caused by Brother scans not conforming to the PDF specification, the next update of the OCR engine will work around this.

HerbertHoerner · October 26, 2022, 1:21pm

I use the same workaround. But this checkbox in OCR to delete the file doesn’t work, because OCR is not creating the duplicate. It as already produced in the step earlier by ‘convert & continue to paginated PDF’. So I am also struggling with deleting the initial PDF. Any good hint how to delete the initial file produced by brother scanner (but within the same smart rule)?

BLUEFROG · October 26, 2022, 2:39pm

I would have suggested using a Move to Trash action as the second action but @cgrunenberg recently informed me he didn’t think that’s a good idea

cgrunenberg · October 26, 2022, 2:47pm

In this case this shouldn’t be an issue.

clearsky · October 26, 2022, 3:28pm

@HerbertHoerner don’t be confused like me, that the file is still visible in the finder. It will be deleted when the DT Trash is emptied. It will not be deleted right after application of the smart rule.