Automatc OCR and export - rule doesn't start when import from AppleScript

cgrunenberg · January 5, 2021, 8:53am

Smart rules are blocked while either a smart rule (e.g. due to OCR) or a script is still performed. Maybe Hazel tries to import the next file while this is still the case? To work around this and to simplify the setup, there are at least two options:

Use either Hazel or smart rules (e.g. a script in Hazel could perform all necessary actions)
Index the folder ~/Downloads/DMS WorkDir/Nach OCR/. In this case Hazel would simply move the original files to this folder, a smart rule in DEVONthink will monitor & process files added to this indexed group.

chrillek · January 5, 2021, 10:02am

I second this. Hazel watches a folder and acts if something arrives there. You can do the exact same thing in DT by indexing this folder and having a smart rule act on it: If it needs OCR, OCR it and move it to “Nach OCR”. If it doesn’t need OCR, move it to the same group at once. You could do that with a single smart rule that looks for the condition “type is PDF/PS” and then runs a script which

checks if the plain text attribute of the record contains anything
if so, runs OCR on the record
and then moves the record (which either was OCR’d before or is now) to the relevant group

cgrunenberg · January 5, 2021, 10:07am

Adding the condition Word Count is 0 would avoid this and is more efficient.

chrillek · January 5, 2021, 10:34am

I was thinking about: What happens with the files that are OCR’d already? If you check for OCR in the script, one smart rule suffices and the script does all the work (i.e. decide if OCR is necessary). If you use a condition in the smart rule (i.e. Word Count is 0), one needs at least two rules: one for files to be OCR’d and one for the rest.

soc007 · January 5, 2021, 12:09pm

The idea is good so far.
Have the following rule on the directory “Vor OCR”:

With Hazel I set the tag NeedOCR when a PDF is pushed into the directory:

If I now scan 2 documents, the first document is processed accordingly by the rule, but the 2 PDF is not processed accordingly:

I would like to run the rule when a document is created in the directory and not depending on the tag.
Just what do I choose? I’ve already tried a few, but doesn’t work immediately as soon as a document is created in the directory.

chrillek · January 5, 2021, 12:57pm

But what you describe is still a two step procedure. Why do you even involve Hazel? It only sets a tag which you then remove in DT anyway (at least when you do OCR. I wonder what happens to this tag when you don’t OCR the file because word count is greater 0). Why not have DT watch the relevant folder (i.e. index it) and have it do what you need it to do?

soc007 · January 5, 2021, 6:42pm

The one with the tag was introduced because files that are not intended for OCR could end up in the folder.
What do I actually want to achieve?

I have a directory on the NAS in which scan files are stored.
The devices are a professional document scanner, a scanner from an MFC printer and from an app on the iPhone.

Other files are also stored in this upload directory. e.g. Word or Excel documents, which of course should not be subjected to an OCR.

Hazel monitors this upload directory on the NAS and only moves the files that are used for OCR into the “Vor OCR” directory.

For (almost) all files that are saved in the “Vor OCR” directory, OCR with DT3 should be carried out automatically.

After the OCR, the files are stored in a new “Nach OCR” directory.
Here again Hazel monitors the directory and looks for the document date in the document. He uses this for a rename of the file.

How far I use a tag for this is of course optional. My only concern is that everything from the scan to the renamed document name is automated.

I am currently converting my office to paperless and have to scan around 10,000 pages. Every manual intervention is a no-go.

chrillek · January 6, 2021, 5:51pm

After you described your process, I tried to set up something similar. Now I know why you’re kind of frustrated
Here’s what I decided to do:

Scan to a NAS folder
Have Hazel watch it
Have Hazel copy all files arriving there to DT’s global inbox (~/Library/Application Support/DEVONthink 3/Inbox)
Use a smart rule in DT which monitors the Inbox and subjects all arriving PDF documents with word count < 1 to OCR

This is of course different from what you do in that

I do not mix all kind of documents in the same folder watched by Hazel
I do not move the OCR’d file to a special OCR folder

The first is a matter of taste (and necessity, since I do not have anything else but PDFs coming from my scanner - not touching Microsoft documents with a ten foot pole The second … well, if you only want to use the document date for naming purposes, DT should be able to do that itself. I don’t really care for automation there, because I have only a few documents per week arriving, so I can do that manually.

If you really manage to convert 10000 pages to OCRd PDFs without manual intervention: congrats. Depending on the document quality and the date format (I just saw something like “1 Dezember 2020” and also “Im November 2020”), you might want to adjust your date-finding function in Hazel

BLUEFROG · January 6, 2021, 9:40pm

It is still unclear why you’re using Hazel in this situation.

What is the purpose of this “upload directory”?

soc007 · January 7, 2021, 8:41am

Thanks for the tip on how to do it.
I will implement your suggestion like this.

Maybe important as information.
I save my documents on the NAS and not in the DT3 database. I basically just index the appropriate directory with DT and use DT for things like quick searches etc.

By storing the documents on my NAS, I can also access them from other devices (e.g. via VPN from outside my office).

It is already clear to me that I am not able to create 100% automation. But if I can get 70-80%, then that’s completely sufficient.

soc007 · January 7, 2021, 8:47am

I use Hazel because I think the functionality is good and, due to my filing system (not within DT but on my NAS), the tool is easier for me to use.

The upload directory on my NAS is for the scanned documents. I can scan documents without having to turn on my Mac. The NAS is always available.

soc007 · January 7, 2021, 9:22am

@chrillek
Super.
Now it works as expected.
All scanned documents are properly processed by the rule.

Thanks for the decisive tip with copying the files directly into the DT inbox directory.

chrillek · January 7, 2021, 11:27am

Good. Please note that in my setup I didn’t move the files back into the filesystem.
On a related note

You could also access them from other devices (provided they’re from Apple if you stored (i.e. imported) your documents in DT itself. That’s what I do, using my NAS only for synching and (now) intermediate scan store. It’s of course a matter of taste, I just find it more convenient to have a similar view on the documents from all devices and not having to start a VPN each time I want to see something.

soc007 · January 7, 2021, 12:01pm

@chrillek
That is of course also a possibility.
Since I only use Apple devices and often have to access the documents via iPhone while on the move, DT on the iPhone is out of the question because I (if I have interpreted the documentation correctly) cannot and do not want use the mass of documents via iCloud Sync .

chrillek · January 7, 2021, 12:16pm

Which is not necessary. If you use your mobile devices only for reading documents (i.e. don’t enter new ones or change existing ones on your iPhone/iPad), you could use Bonjour syncing while not on the move. It’s described in the documentation, I think
Bonjour is reportedly fast and reliable. Turn it on on your desktop and set the mobile devices to use Bonjour (do not turn Bonjour on on them, though!).
Again: a matter of taste. I’m just mentioning possibilites.

soc007 · January 7, 2021, 12:42pm

@chrillek
Thanks for the tip.

My primary concern is external access.
And here I currently don’t see an option with the DT mobil app.
I avoid backing up documents on a cloud outside the EU (with regard to data protection) and the DT Mobile app currently offers no other option.
Correctly?

chrillek · January 7, 2021, 1:31pm

What is the connection between “external access” and “backup”? I described how to keep DT on iPhone and desktop in sync without the need for a cloud provider or even the internet: local sync via Bonjour.

If you don’t want your iPhone data backed up in iCloud, turn it off. That’s completely unrelated to syncing.

soc007 · January 7, 2021, 1:55pm

@chrillek
sorry, then i got it wrong.

A “backup” of my documents (I hope I understand that correctly on my iPhone (approx. 1.5 Gb) should be a solution?
Quite apart from the fact that I don’t consider redundant data storage to be optimal, the amount of data is a bit much.
This will take hours, especially with the 1st “backup”.
correct?

chrillek · January 7, 2021, 3:24pm

Why do you read “backup” when I say “sync”? I suggest you read up on that topic (or better: both, namely sync and backup) in the documentation or here in the forum. There’s no requirement to download all documents to the iPhone.

BTW: Given the current frenzy with selfies, movies etc., I don’t think that 1,5 GByte is a relevant amount of data on a current iPhone.

soc007 · January 7, 2021, 4:17pm

@chrillek
OK.
thanks anyway for the excellent help.
I will look again at the topic in the documentation to what extent I should reconsider my reasons for storing files on the NAS.
if I should need further help with this, then I will open a new thread (are a bit off-topic here … ;-)))