OCR & Indexed Files

I am new to Devonthink and am attempting to finetune my OCR workflow.

I have elected to index some of my case work folders on my hard drive rather than to import them (primarily because I am constantly working on the active files and I like my file structure that I have set up). I have imported my older work documents that I do not actively use.

I can see, within Devonthink, my PDFs that have not been OCR’d. I would like to OCR them in Devonthink and have Devonthink replace the indexed (non-searchable) PDF with the new indexed one. When I gave it a go … Devonthink OCR’d the PDF but then filed it within the database (it “trashed” my indexed file - not original - so I am guessing that when I synchronize with my indexed files again I will then have two of the same document – the OCR’d one in Devonthink & the non-OCR’d one that has continued to reside outside of Devonthink.)

How can I utilize the OCR on documents that are indexed rather than imported?

Thanks for the advice an advance!

Jeff

I just saw this answer that appears to provide the workflow … but does anyone have any suggestions re: how to automate it?

viewtopic.php?f=2&t=9943

Thanks.

Jeff

Unfortunately, I have examined the suggestion in the above link and I have encountered two problems (1 expected and 1 not expected).

Here’s the suggestion:

“If Preferences > OCR has the option to move the original PDF to the Trash CHECKED, the original will be deleted. You may find it useful to select a PDF and press ‘Command-R’ (the Reveal command) to see it in the group where it is filed. From that location, select the PDF and choose ‘Data > Convert > to searchable PDF’.
However, as the searchable PDF is stored within the database, and isn’t currently indexed, there’s an option to move it to the external folder that had been Indexed. Select the PDF, Control-click and choose the contextual menu option to move it to the external folder.
Finally, select the group corresponding to that external folder and choose ‘File > Synchronize’. Now the searchable PDF is Indexed, and is among the items listed within the Indexed group.”

  1. Here’s the unexpected problem: Using the above method - I am left with two files in my indexed location.

Methodology & Results:
I have “move to trash” selected in the OCR, but when I convert an indexed file to searchable PDF, my original non-OCR’d file stays the location outside the database rather than going into the trash.
My OCR’d file is in the database, so I then choose the contextual menu option and it goes back to the indexed location.
Weird thing is that Devon saves it right next to the original non-OCR’d file and just appends a “-1” to the end of it. Odd because it is not labeled that way in devonthink & devonthink only “sees” the OCR’d version. To confirm that devonthink was actually seeing the one that had the “-1” on it I asked Devonthink to show me the original in the finder … and it directed me to the new “-1” version.
BTW, I tried syncing with the hope that devonthink would “see” the two files and resolve it – no luck (devonthink no longer sees that original file even after sync).
I then have to manually trash the original non-OCR’d version or my working folders get all mucked up with multiple copies - 1/2 OCR’d and 1/2 not OCR’d.

  1. Here’s the expected problem: it is WAY too time intensive for me to manually OCR documents in this way, how can I automate it?

I have 200-300 non-OCR’d PDFs that are indexed with my active files outside of Devonthink and I want to OCR them. In my work, I’m constantly getting non OCR’d PDFs sent to me & I want to maintain them in my external file structure outside of Devonthink. In other words, it’s a big need now and it will be a recurring task …

SUMMARY
As I see it, first step is to find the most efficient way to do this by hand … second step is to figure out how to automate it with an Applescript, automator, or some other tool.

Anyone have any suggestions?

Jeff

Here’s a script to OCR selected and indexed images/PDF documents located inside indexed groups. This script uses the new “deconsolidate” command of DEVONthink Pro Office 2.0.9.

WARNING: The error handling of this script is very limited.


-- OCR indexed pictures/PDF documents
-- Note: Supports only selected images/PDF documents located inside indexed groups

tell application id "com.devon-technologies.thinkpro2"
	set theSelection to the selection
	repeat with theRecord in theSelection
		if (indexed of theRecord) and (indexed of parent 1 of theRecord) then -- Required for deconsolidating
			try
				set theType to type of theRecord
				if ((theType is picture) or (theType is PDF document)) then -- Only images/PDF documents are supported
					set thePath to path of theRecord
					if thePath is not "" then
						set theConvertedRecord to convert image record theRecord
						if exists theConvertedRecord then
							tell application "Finder" to delete (POSIX file thePath) as alias -- Move the original file to the trash
							delete record theRecord
							deconsolidate record theConvertedRecord
						end if
					end if
				end if
			end try
		end if
	end repeat
end tell

Has this worked for anybody? All the script does for me is OCR a file; it doesn’t do anything to the indexed files in the Finder.

I’m interested in a solution here as well. I am slowly grouping and indexing files on my HD. I sure would like to knowhowto perform OCR on indexed files. I don’t want to OCR them and have them imported.

The answer to my question is found in this post by using the mentioned Script.

So is this correct:

I have a folder in my file system (“attachments”) with PDFs, and this folder is indexed in DEVONthink. Some PDFs get moved to the folder before they are OCR’d in DEVONthink. I want DEVONthink to OCR these PDFs, and replace the non-OCR’d version of the PDF in “attachments” with the OCR’d version. Something like the following will work:

  1. Create Smart Folder that searches “attachments” group on DEVONthink for non-OCR’d PDFs. (i.e. kind:PDF, word count:<1).
  2. Attach script to smart folder that:
    [list]i. OCRs the PDF
    ii. deletes the original PDF
    iii. places the newly OCR’d PDF in its place[/list:u]

Step 2 will involve an applescript including the following code:

to OCR (with ‘delete original on import’ checked in OCR preferences):

on triggered(theRecord)
    try
        tell application id "com.devon-technologies.thinkpro2"
            convert image record theRecord to group id "url-of-indexed-attachments-group"
        end tell
    end try
end triggered

To move OCR’d pdf from group in DEVONthink to external file system’s “attachments” folder:

on triggered(theRecord)
   try
      tell application id "com.devon-technologies.thinkpro2"
         deconsolidate record theRecord
         synchronize record theRecord
      end tell
   end try
end triggered

Is that right?