I am new to Devonthink and am attempting to finetune my OCR workflow.
I have elected to index some of my case work folders on my hard drive rather than to import them (primarily because I am constantly working on the active files and I like my file structure that I have set up). I have imported my older work documents that I do not actively use.
I can see, within Devonthink, my PDFs that have not been OCR’d. I would like to OCR them in Devonthink and have Devonthink replace the indexed (non-searchable) PDF with the new indexed one. When I gave it a go … Devonthink OCR’d the PDF but then filed it within the database (it “trashed” my indexed file - not original - so I am guessing that when I synchronize with my indexed files again I will then have two of the same document – the OCR’d one in Devonthink & the non-OCR’d one that has continued to reside outside of Devonthink.)
How can I utilize the OCR on documents that are indexed rather than imported?
Unfortunately, I have examined the suggestion in the above link and I have encountered two problems (1 expected and 1 not expected).
Here’s the suggestion:
“If Preferences > OCR has the option to move the original PDF to the Trash CHECKED, the original will be deleted. You may find it useful to select a PDF and press ‘Command-R’ (the Reveal command) to see it in the group where it is filed. From that location, select the PDF and choose ‘Data > Convert > to searchable PDF’.
However, as the searchable PDF is stored within the database, and isn’t currently indexed, there’s an option to move it to the external folder that had been Indexed. Select the PDF, Control-click and choose the contextual menu option to move it to the external folder.
Finally, select the group corresponding to that external folder and choose ‘File > Synchronize’. Now the searchable PDF is Indexed, and is among the items listed within the Indexed group.”
Here’s the unexpected problem: Using the above method - I am left with two files in my indexed location.
Methodology & Results:
I have “move to trash” selected in the OCR, but when I convert an indexed file to searchable PDF, my original non-OCR’d file stays the location outside the database rather than going into the trash.
My OCR’d file is in the database, so I then choose the contextual menu option and it goes back to the indexed location.
Weird thing is that Devon saves it right next to the original non-OCR’d file and just appends a “-1” to the end of it. Odd because it is not labeled that way in devonthink & devonthink only “sees” the OCR’d version. To confirm that devonthink was actually seeing the one that had the “-1” on it I asked Devonthink to show me the original in the finder … and it directed me to the new “-1” version.
BTW, I tried syncing with the hope that devonthink would “see” the two files and resolve it – no luck (devonthink no longer sees that original file even after sync).
I then have to manually trash the original non-OCR’d version or my working folders get all mucked up with multiple copies - 1/2 OCR’d and 1/2 not OCR’d.
Here’s the expected problem: it is WAY too time intensive for me to manually OCR documents in this way, how can I automate it?
I have 200-300 non-OCR’d PDFs that are indexed with my active files outside of Devonthink and I want to OCR them. In my work, I’m constantly getting non OCR’d PDFs sent to me & I want to maintain them in my external file structure outside of Devonthink. In other words, it’s a big need now and it will be a recurring task …
SUMMARY
As I see it, first step is to find the most efficient way to do this by hand … second step is to figure out how to automate it with an Applescript, automator, or some other tool.
Here’s a script to OCR selected and indexed images/PDF documents located inside indexed groups. This script uses the new “deconsolidate” command of DEVONthink Pro Office 2.0.9.
WARNING: The error handling of this script is very limited.
-- OCR indexed pictures/PDF documents
-- Note: Supports only selected images/PDF documents located inside indexed groups
tell application id "com.devon-technologies.thinkpro2"
set theSelection to the selection
repeat with theRecord in theSelection
if (indexed of theRecord) and (indexed of parent 1 of theRecord) then -- Required for deconsolidating
try
set theType to type of theRecord
if ((theType is picture) or (theType is PDF document)) then -- Only images/PDF documents are supported
set thePath to path of theRecord
if thePath is not "" then
set theConvertedRecord to convert image record theRecord
if exists theConvertedRecord then
tell application "Finder" to delete (POSIX file thePath) as alias -- Move the original file to the trash
delete record theRecord
deconsolidate record theConvertedRecord
end if
end if
end if
end try
end if
end repeat
end tell
I’m interested in a solution here as well. I am slowly grouping and indexing files on my HD. I sure would like to knowhowto perform OCR on indexed files. I don’t want to OCR them and have them imported.
I have a folder in my file system (“attachments”) with PDFs, and this folder is indexed in DEVONthink. Some PDFs get moved to the folder before they are OCR’d in DEVONthink. I want DEVONthink to OCR these PDFs, and replace the non-OCR’d version of the PDF in “attachments” with the OCR’d version. Something like the following will work:
Create Smart Folder that searches “attachments” group on DEVONthink for non-OCR’d PDFs. (i.e. kind:PDF, word count:<1).
Attach script to smart folder that:
[list]i. OCRs the PDF
ii. deletes the original PDF
iii. places the newly OCR’d PDF in its place[/list:u]
Step 2 will involve an applescript including the following code:
to OCR (with ‘delete original on import’ checked in OCR preferences):
on triggered(theRecord)
try
tell application id "com.devon-technologies.thinkpro2"
convert image record theRecord to group id "url-of-indexed-attachments-group"
end tell
end try
end triggered
To move OCR’d pdf from group in DEVONthink to external file system’s “attachments” folder:
on triggered(theRecord)
try
tell application id "com.devon-technologies.thinkpro2"
deconsolidate record theRecord
synchronize record theRecord
end tell
end try
end triggered