Problem with "duplicate" pdf files after OCR

elwood151 · July 13, 2010, 4:11pm

I’m a little puzzled and need some help/explanation how to deal with that situation:

I’m using DevonThink Pro Office 2.0.3.
I’ve indexed a big folder containing hundreds of pdf files.
Some of them were not searchable (image only), so I imported them with OCR.

What happened:
In my DevonThink database, the original pdf file was automatically put in the trash and physically a new pdf file with the same name as the original was created somewhere in the .dtbase2 package.

BUT: the original pdf file remains in the original (indexed) folder and that’s my problem.
I have duplicate files (and the older version in my “official” directory in the finder.

If now I try to open one of these newly imported files (from within the DT database) and make annotations with SKIM, the .skim-File is stored with the corresponding pdf file e. g. in mydatabase.dtBase2/Files.noindex/pdf/9/.

And therefore it is not automatically added to my Database and would NOT be found in any database search.

…

from my point of view I would have to do the following:
export all pdf files from the Database and merge them (how?!) with the files in the “original” pdf-directory.
Manually search for all skim files in the Files.noindex/pdf-Path and also move them to the “original” pdf directory.

Then: remove the indexed directory from my database and add it again (again only indexed).

Is that the right way?
Do I risk to lose any information when removing the indexed folder form my database and index it again?

p.s. if I just manually moved the “new” pdf files and skim files from the Files.noindex/pdf directory to the other one, I assume the database would search them and run into problems?!

Sorry for the long text, I hope I could make my problem clear…

Kind regards

Martin

cgrunenberg · July 20, 2010, 8:01am

The original, indexed files should be removed from the folder after emptying the trash and clicking on the option to delete “Files” or “Files & Folders” too.

This is not recommended as the path/filename of the PDF might change (e.g. after renaming or modifying).

Saving the annotated files as a .pdfd should be more reliable. But don’t save them directly inside the database package, save them e.g. in the global inbox (should be available in the Finder’s sidebar and therefore in the “Save” panel).

elwood151 · July 20, 2010, 8:47am

Thanks, Christian!
Your answer helped me to correctly formulate my “real” question:

Is there a way to convert indexed documents into “PDF+Text” and replace the original document (and not create an imported copy)?

If no, when will you implement it?

Kind regards

Martin

cgrunenberg · July 20, 2010, 9:33am

That’s not (yet) possible but a script should be able to perform the conversion and replace the original file with the converted one. Here’s a simple example (using the desktop for the conversion) with little error handling, be careful:


-- OCR indexed pictures/PDF documents

tell application id "com.devon-technologies.thinkpro2"
	set theSelection to the selection
	repeat with theRecord in theSelection
		if (indexed of theRecord) and ((type of theRecord is picture) or (type of theRecord is PDF document)) then
			try
				set thePath to path of theRecord
				if thePath is not "" then
					set theConvertedRecord to ocr file thePath to incoming group
					if exists theConvertedRecord then
						set theNewPath to export record theConvertedRecord to "~/Desktop"
						delete record theConvertedRecord
						if exists theNewPath then
							set theIndexedRecord to indicate theNewPath to parent 1 of theRecord
							if exists theIndexedRecord then
								set name of theIndexedRecord to name of theRecord
								delete record theRecord
								do shell script "rm " & quoted form of thePath
							end if
						end if
					end if
				end if
			end try
		end if
	end repeat
end tell

elwood151 · July 21, 2010, 12:54am

Thanks, Christian - I’ll try it out.

cgrunenberg · April 22, 2011, 10:07am

An improved version of the script for DEVONthink Pro Office 2.0.9 is available here: viewtopic.php?f=2&t=13039&p=61304#p61304