OCR All Files in Place.

jamesr219 · December 8, 2008, 4:27am

Hello,

New to DT. Just Purchased Office Pro. Think I’m going to like it. However, I have a large collection of PDF files (~6k) which are scanned with ScanSnap but not OCR’d. I’d like to OCR all these documents (I realize it will take some time)…

What is the best way to OCR lots of documents? The other requirement I have is that I want to maintain the “modification time” of the document since the document is really the same just now includes text. It seems that when you do a conversion inside DT it creates a duplicate PDF file. This isn’t really what I want since I would need to go back and delete all the other duplicate ones?

Can someone suggest a good solution?

Thanks!

-jr

annard · December 8, 2008, 11:34am

There is a script in the DEVONacademy that can help you with this. Check the section “Example Scripts”.

jamesr219 · December 8, 2008, 1:57pm

Thanks. This has gotten me started in the right direction. I am not an AppleScript pro… This may be a silly question, but can someone tell me where I can find information on the different objects available in DT?

ex: The selection contains items, and those items contain a modification date. Where can I find the other fields those items contain?

Thanks,

-jr

annard · December 8, 2008, 2:45pm

If you use Script Editor, you can open the dictionary of DEVONthink Pro and that shows you what is available. Also check our online help for references to AppleScript information that you may find useful.
Another approach that may be a bit friendlier is to use Automator. On the distribution disk you can find a workflow to convert PDF files. There is also an OCR Items action that we provide.

jamesr219 · December 8, 2008, 3:02pm

Thanks. The dictionary was exactly what I needed…

I’ve basically got it mostly worked out. Here is the script in case anyone is interested.


-- OCR items without text
-- Save the original modification date.
-- Move the original non-text files into sub folder.
-- Created by Eric Böhnisch-Volkmann, Jan 22, 2008
-- Modified by James A. Russo, Dec 8, 2008
-- Copyright (c) 2008. All rights reserved.

using terms from application "DEVONthink Pro"
	tell application "DEVONthink Pro"
		activate
		set this_selection to the selection
		if this_selection is {} then error "Please select some contents."
		set first_item to item 1 of this_selection
		set collection_group to create record with {name:"Collected items without text", type:group} in parent 1 of first_item
		repeat with this_record in this_selection
			set this_text to plain text of this_record
			if this_text is "" and type of this_record is not group then
				try
					set converted_record to convert image record this_record
					move record this_record to collection_group
					set old_date to modification date of this_record
					set the modification date of converted_record to old_date
				on error error_message number error_number
					display dialog (error_message)
				end try
			end if
		end repeat
	end tell
end using terms from

devamag · January 13, 2014, 6:32am

I’ve searched for ways to efficiently OCR PDFs already in DT and the script here, though dated, seems best. However, the one problem is that, if “parent 1” is a tag, then the script will move the item from the original group to the tag group and so the original location gets lost.

What I’m looking for is everything this script provides minus the tag issue:

OCR PDFs.
keep original PDFs close by but isolated
easily delete the original PDFs after comparing and confirming they have been OCR’d.

I’ve seen the “Import, OCR, and Delete” script, but I’m wanting to OCR items already imported. I’m open to immediately deleting original PDFs but I don’t want to accidentally delete records that aren’t going to have words even when undergoing the OCR process, such as handwritten notes or pictures.

If I could just get this script to ignore parent groups that are tags, I think I would be most of the way there.

Does anyone know of a different script with the desired functionality or have an idea on modifying this script? Thanks!

korm · January 13, 2014, 10:38am

Looks like you might have imported documents into a tag group and replicated them to the group where the documents you want to OCR are stored. I have confirmed by testing that the outcome you describe is what happens if I do what I describe. The fix is to not import anything into a tag, just import into a normal group.

Otherwise, this situation should not happen if you select documents contained in a non-tag group.

Not clear about what that means – “close by” is where?