OCR and save back as PDF+text?


I am evaluating DTPO for our local needs. Can someone please assist me with the following question/scenario?

I scan files to PDF in a network folder. Is it possible for DTPO to find these documents, OCR them, and save them back as PDF+text, all automatically?

It seems that this should be possible with scripting/automation but I do not see an example. I have been able to add them to a database, but the resultant files are only accessible via DTPO. I would like to keep all of these files as PDF+text, whether or not I incorporate them into any DTPO database.

Will I ultimately need a separate OCR application to do this?

Thank you,

If you want PDF+Text whether or not you import to DT, you’ll need to do this outside of DT, or use the DT-included OCR engine, then remove the files from DT that you don’t want there.

For a clean workflow, you might consider scanning/OCRing your files into a folder that you index in DT. This would require a standalone OCR engine, but would give you the flexibility you indicate you want.

Thank you Korm for the quick reply!

My question is, to save the cost of buying a separate OCR application (in addition to several DTPO licenses), can I script one of the DTPO instances to do this?

I have tried this via AppleScript and a watched folder, and can get the new files into DTPO, but not back out again to the file system.

The main reason for doing this is that some files will be more valuable than others, but I would like all saved as PDF+text in the filesystem.


Now that the ABBYY OCR engine is exceeding the Acrobat OCR engine, I’m trying to replace all of my Adobe apps. I have developed a very specific workflow that I am trying to replicate with DTPO 2.

  1. Scan Document with ScanSnap to an “ocr_inbox” folder
  2. Hazel uses Acrobat to OCR the PDF, then tags, applies an “OCR” label, renames file and moves to my inbox.
  3. I then move confidential files to a mounted encrypted disk image

My new workflow will be to index within a secured DTPO database.

I can handle the file renaming, tagging, and labels using Hazel, but I can’t figure out how to script DTPO to OCR the file in place. I’m assuming I could do an import with OCR, but I then want the file to be exported back to the original location and deleted from the DB.

Any ideas? Script examples would help.

Come on Eric or Bill, I know you have some secret sauce somewhere. I’ll wait until the celebration hangover wears off. :wink:

This workflow sound almost perfect for Automator. Have you looked at it and seen the blocks?

Every step you mention appears to be something replaceable by Automator steps, even all of the hazel things.

To me, it seems like the perfect solution or this kind of workflow because you can consolidate all of the different services under one hood.

AppleScript can do it too, just grab the files as a set and deal with each file individually in a repeat with loop. Make a handler for each step so you can trap where in the process it failed, if at all.

The trick it sounds like you are having is that you need to delete the records from the dtpo database after export. You can do thy in automator or AppleScript, too.

Ok, maybe a simpler solution will be to throw the file into the DTPO inbox then move to the file system manually. I’m part way there. Now how can I script index creation for all indexed folders?

Yes, it’s not that hard. For example if one selects some items, this exports them to the desktop and deletes them from the database:

tell application "DEVONthink Pro"
	set sel to selection
	repeat with curSel in sel
		export record curSel to "~/Desktop/"
		delete record curSel
	end repeat
end tell

(and one can also get rid of the metadata files DT exports).

Thanks. I could go a ahead and script the walk through all of the records to automate this but I’ve decided to make use of the inbox in DTPO.

I’m working out a script to automate re-indexing of folders after manually exporting files from the inbox.

Any ideas there? Doesn’t look like indexing is an available AS action. Only checking the index status of a record.

ummm… in automator, there is the “add items to current group” block that has an option to index instead of import.

in applescript, the DTPO Suite dictionary has import or indicate. indicate is a verb :laughing: can’t see why that choice wasn’t immediately apparent :mrgreen:

indicate v : Indicate (‘index’) a file or folder (including its subfolders). If no type is specified or the type is ‘all’, then links to unknown file types are created too.
indicate text : The POSIX path of the file or folder.
[to record] : The destination group. Uses incoming group or group selector if not specified.
[type all/chat/image/location/markup/pdf and postscript/quicktime/rich/script/sheet/simple] : File type to index.
→ record

I think you want “synchronize.”

Nah. He’s set up a watch folder into which scanned documents go. Has some other scripting bridge that watches the folder and imports them to DTPO for OCR.

He then wants to delete them, change the file name, place them in a general finder folder of a variety of security (he said confidential files).

THEN he wants the script to index (rather than import) the now freshly renamed and OCRd files.

Kind of a long way around the bush, but it sounds like he wants DTPO OCR and files located outside the database, but indexed in DTPO. Syncronization probably needs to happen several times during his workflow, but what he was asking was where the Index command was in the Applescript. It’s in the Indicate verb.

Actually you’re both right. Synchronize would work if I could script it to work on an indexed group. While I can manually trigger a synchronize for an indexed folder, there is no obvious handle to do the same in AppleScript.
That said, the manual synchronize will work since I can just select all groups and synchronize.

In the end, the AppleScript library still does not have enough depth for handling groups and indexed records to avoid changing my long standing workflow. So, I’ll try branching out and change my workflow for a few weeks and see how DTPO2 holds up.
Now if only the DTPO2 inbox could perform the OCR work with all PDF’s added to it automatically.

Amazing. Just shows how bad the library browser is in ScriptEditor since searching for “index” does not find “indicate” even though it’s in the command description.

Thanks for the heads up. I’m sure this will come in handy.

Maybe I am misunderstanding you, but you can set it up so that, every time you “activate” (ie, access) a given group in DT, which group has been created by indexing a folder on disk, it gets re-synchronized (updated) automatically.

Is that what you want?

See. I must be misunderstanding, too, because how can you make DTPO perform OCR on files that are merely indexed?

Synchronize would work if the files were in an already indexed folder, but he wants them OCR’d and file name changes and sounds like he wants them to move around then…


That would work perfectly (for part of the workflow). Can you describe the attached script that would perform that on an entire group? While I can get a reference to the current group in AS, when I try to use the “indicate” function on it, I get an AS error. Seems like you can only call the indicate function on a single record. Although using the menu item “Synchronize” works on an entire group. I guess I could use “System Events” to call the menu item, but that seems really cludgy and interrupts keyboard use while the script is running.

Thanks acl.

You can not. That’s why I’m going to have to change the structure of my workflow. I’ll use the DTPO2 inbox and OCR from there. During my weekly review process I’ll export the records to my secure files system and then reindex.
I know that seems cumbersome, but I do not want to use the DTPO2 internal file structure for long term storage. I prefer my own logic structure for long term. The unix file system will remain part of the mac OS for a long time. I just can not count on a single product to outlive the retention time of my documents.

OK. I didn’t think it was cumbersome, it is the only way you can have your cake and eat it.

You really have two steps that don’t mix: (1) obtain DTPO OCR on an externally located file; and (2) index newly renamed and placed external files.

Sounds like you can do everything we discussed above in step 1 with the variety of script alternatives, so that’s a done deal.

Step 2 has these lingering choices. You can synchronize the folders for step 2, but that seems to be an alternative to indicating individual files. I wonder if you indicate each file, and then synchronize, if there is a difference. In the very least, since you know the file names, it seems that using the particular command to indicate each would be best since you c an at the same time specify the record name, and other attributes you may wish to set at the time. Synchronizing would merely adopt what attributes it finds. Otherwise, they are pretty similar solutions.

There is a script to help with this included with DT. It is not a folder action but a script that is to be attached to the group inside DT. That is: You index a folder, and it appears in DT. You then select it and open the information panel (cmd+shift+I) and, where it says “Script”, you select that script.

The script is, as I said, included with DT. In total, its contents are:

-- Synchronize
-- Created by Christian Grunenberg on Oct Fri 22 2004.
-- Copyright (c) 2004-2008. All rights reserved.

on triggered(theRecord)
		tell application id "com.devon-technologies.thinkpro2"
			synchronize record theRecord
		end tell
	end try
end triggered

After this, every time you click on that group in DT, it is synchronized to the disk contents.

But this

tell application "DEVONthink Pro"
	set theRec to get record at "/Notational Data"
	set syncedP to synchronize record theRec
end tell

does work. Here “Notational Data” is a group at the top level of the current database, which is an indexed folder. syncedP is set to true if anything was synchronized, false if not (ie there were no changes).

Doing it individually for each file is also probably not much harder, but sounds like it’s more work.