Automatic OCR pdf and save to same folde like originalfilename_ocr.pdf

Sonicx29 · May 2, 2021, 8:34pm

Dear sir or madam,

I need help how to resolve my problem.

I have a lot of folders with thousands and thousand pdf files. A lot of them are searchable a lot of not.

I know how to show only non-searchable pdf files via DEVONthink - Data > New from Template > Smart Groups > PDFs (not searchable).

I need to automatically save all non searchable pdf files to searchable (OCR) to the same folder like original with filename (originalfilename_ocr.pdf). And keep original (non searchable) file.

Could you help me how I can do this please?

BLUEFROG · May 2, 2021, 11:51pm

Welcome @Sonicx29

Are you importing or indexing files into the database?
Have you used the smart group you mentioned?
Do you see matching files when you select it in the database?
If you disable Preferences > OCR > Original document: Move to Trash, the un-OCR’d file will be preserved.

Note: Do not set DEVONthink to try to OCR ”thousands and thousands of files” in one action. Working in smaller batches is a much better idea.

Sonicx29 · May 3, 2021, 8:14am

Yes, I did smart group and I see matching files in the database.

I am sorry maybe my description of the problem wasnt clear.

My idea is create new ocr file not only in the database of DEVONThink but like physical new file in the folder (file system).

Is it possible please?

Thanks

cgrunenberg · May 3, 2021, 4:05pm

You could index the folder containing the PDF files and a smart rule could then perform the necessary actions:

But it might be also a good idea to use a flag, label or tag to mark already processed items and exclude them from the smart rule.

Sonicx29 · May 5, 2021, 10:36am

Dear friends, I am here with my problem again.

Thank you for your help but I need something different.

My philosophy of work with folders and files is different. I dont preffer work with database like DEVONthink is designed. Indeed, maybe I will change my mind

I need resolve my problem - I have a lot of folders with a lot of files and I want to keep this structure. Is it possible to use DEVONThink for resolving my problem? I want to check all pdf files by DEVONthink and find all non-ocr (non searchable) pdf and OCR them and create new file of OCRed pdf in the same folder where is original non-ocr (non searchable) pdf.

Can I reach this with DEVONthink without import all files to database and work with database?

In the end, your app si really great I have tried maybe all popular OCR software (ABBYY, Adobe Acrobat, …) and any of these app can find only non-OCR (non searchable) pdf.

Thank you.

cgrunenberg · May 5, 2021, 11:44am

This is indeed possible, see my first reply.

Sonicx29 · May 5, 2021, 1:09pm

Thank you!

I set it and it works!

But there are these problems:

I set everything like you wrote but new OCRed file doesnt have the name originalname_ocr.pdf but originalname-1.pdf. How can I fix ti?
and the more important - is it possible to set language for OCR? I need czech language because now OCR cant recognize for example - Čeněk, Škoda, …
and is it possible to change “smart rules” for this condition: - if there is non-searchable pdf and also the same file with same filename + _ocr (filename_ocr.pdf) skip it and OCR next one? This is very important for future using after adding new pdf files. Because I dont want to OCR all files again.

Thanks

cgrunenberg · May 5, 2021, 1:33pm

E.g. copy (see Change Alias action) the name to the alias before OCR, then use the alias in the Change Name action after OCR.

This can be only changed via Preferences > OCR

There’s no such smart rule condition. The only workaround would be to replace the actions with a script which checks this condition first before handling the actions on its own.

Sonicx29 · May 5, 2021, 1:56pm

Great ! Thank you. OCR preference changed - language is OK.

Could you help me please with last one (the right name) I set like this, but doesnt work like I need.

Snímek obrazovky 2021-05-05 v 15.56.17

cgrunenberg · May 5, 2021, 2:01pm

The actions should look like this:

Sonicx29 · May 5, 2021, 2:44pm

I tried it but It doesnt work correctly.

If I set this after OCR the result is - original file dissaper and there is one file (originalname-1) and this is searchable (OCR) and second file (originalname_ocr) and its non-searchable.

I would prefer to keep original file without rename or any change, only created new one searchable after OCR. Is it possible please?

Snímek obrazovky 2021-05-05 v 16.39.25

BLUEFROG · May 5, 2021, 3:17pm

If you’re just OCR’ing to PDF, use OCR > Apply. This doesn’t generate a new file. It will OCR in-place for your purposes.

Sonicx29 · May 5, 2021, 5:19pm

Sorry but I want to generate new file. Maybe my description of problem wasnt clear - on pictures there is description what I need.

Snímek obrazovky 2021-05-05 v 19.17.01

BLUEFROG · May 5, 2021, 6:28pm

Ahh… no worries! I misread your post.

Sonicx29 · May 6, 2021, 7:27am

“There’s no such smart rule condition. The only workaround would be to replace the actions with a script which checks this condition first before handling the actions on its own.”

It coudl by possible to resolve it if you can add to next version this functionality - add choice of “date created” is “newer than” “01.04.2021” (for example)

cgrunenberg · May 6, 2021, 7:50am

The action’s name is Execute Script.

That’s already possible:

Sonicx29 · May 6, 2021, 4:48pm

May I ask you please for help me with this? According my last message with 2 pictures. Thank you very much.