DTPO "convert to searchable PDF" problem

Ryuji · March 18, 2008, 5:15pm

When I use this feature to convert “image only” PDF to PDF+Text format, with the new “trash original” switch on preferences turned on, I get a bunch of files named ???tmp???.tiff or something like this, in PDF+Text format, without deleting the original files. This is very annoying and it should be fixed, if this hasn’t been reported elsewhere yet.

Thanks,

Ryuji

Ryuji · March 18, 2008, 5:29pm

This problem occurs regardless of the status of the “move originals to trash” switch.

annard · March 19, 2008, 9:37am

It would help if you send one of those files to support@devon-technologies.com so we can investigate.

Ryuji · March 20, 2008, 3:09am

It turned out that the behavior was changed at version 1.5.1. Instead of saving the OCR’ed file in the same filename as the original PDF file, this version of DTPO saves in a filename that was used for one of the image files that are used to make the original PDF. Different PDF files are saved in different names after OCR’ed in DTPO 1.5.1. This behavior is annoying. Please change it back to the old behavior ASAP and issue a patch. (I can’t use this version for my work.)

annard · March 20, 2008, 10:59am

The behaviour was changed on user requests and proves that you can’t please everybody unfortunately. The current behaviour helps a lot of people greatly when getting PDF scans from scientific magazines for instance where the attributes are set correctly.

In your case, your workflow is to get a scan and import it into DT as is; then you change the name and sometime afterwards you’ll convert it to OCR? If I’m all wrong, please explain your workflow so I can understand what you’re trying to achieve. Don’t forget that it is possible to get any behaviour you want by scripting the conversion process (and I could help to create a workaround). But before I make any more changes I want to know the best way to optimise it for everybody (if at all possible).

Stefan · March 20, 2008, 11:34am

Hello,

from the Desktop I import a PDF with the name “Street1999_”. In DTPO I convert this file with OCR and get a second file with the name “untitled”.
Sometimes the name is OK but most of the time the name is “untitled”.

If I use import image with OCR I get the same result.

In my preferences I have checkt “Move to Trash” but this dose not happened.
The selected original files are still there and I must delete them manually.

Stefan

annard · March 20, 2008, 2:29pm

How was this file created? I’d like to understand your workflow as well from the source (paper/colleague/internet) to the destination.

That is to be expected but if you enable “Set Attributes” you can properly set the attributes of the document. Including the name. Note that I don’t expect you to change the way you work with your files, but I’d like to know why you go about it this way.

When you convert files, they will never be sent to the trash. Only when running OCR on import from a file or scanner application.

Ryuji · March 20, 2008, 6:08pm

I disagree with your general statement that PDF scans from scientific magazines have correct attributes. I also argue that it is a bad software design to decide its behavior on such attributes of the input file that are not very visible to the end user, without even giving a switch to give an option.

First of all, as a matter of principle, if I have to follow your workflow to generate PDF, Devonthink should advertise that it only works as expected if the PDF files are created within Devonthink.

Also, I’m only asking you to fix the unpredictable and problematic behavior of DTPO, and I’m not asking for your opinion on how people should generate PDF files or how people should change their workflow.

In reality, most of the PDF files I use are given to me by others, downloaded from scientific journal publisher’s website, etc. and I have no control over how those files were generated. DTPO used to behave just fine and completely predictably. I do not understand why you tell me to change my workflow when I’m inconvenienced by the sudden change in the program’s behavior.

annard · March 21, 2008, 10:41am

We’re trying to find out the best way to correct what may have turned out to be a mistake. We apologise for the inconvenience it has caused you. We wouldn’t dream of asking you to change your workflow but wanted to find out where you get your information from and so you did, thank you for that.

Given your and other feedback we will revert back to the old behaviour in the next maintenance release since we’d need another AI to find the optimum behaviour here. And to answer your question about when it will be released: we also have some other upcoming and necessary fixes from IRIS that we’d like to incorporate.

In the meantime, here is an AppleScript that you could use to get the old behaviour back. You can save it in “your home folder > Application Support > DEVONthink Pro > Images” where it will show up in our application’s script menu under “Images”. We hope this helps a bit.


-- The Right Way To Do Convert to Searchable PDF
-- Created by Annard Brouwer on Good Friday 2008.

tell application "DEVONthink Pro"
	activate
	try
		set thisSelection to the selection
		if thisSelection is {} then error "Please select some scans."
		repeat with aRecord in thisSelection
			with timeout of 3600 seconds
				try
					set newRecord to convert image record aRecord
					set name of newRecord to name of aRecord
				on error errMsg number errNo
					if errNo is equal to 9104 then
						log message errMsg info (name of aRecord) & ": Check “~/Library/Caches/DEVONthink Pro/OCR” for files that weren't imported"
					else
						log message errMsg info (name of aRecord)
					end if
				end try
			end timeout
		end repeat
	on error error_message number error_number
		if the error_number is not -128 then display alert "DEVONthink Pro" message error_message as warning
	end try
end tell

Stefan · March 21, 2008, 10:47am

Hello,

I get my PDF from the Internet. All PDF are scientific paper and I has no chance to change the attributes. In my workflow I load the paper within Safari and then print it to PDF with author and year as the filename. I know that I can send it to DTPO with a script but sometimes this way dose not work.
In my database the name field contains the author+year and the comment field contains the full title. I would expect when I convert such an entry that DTPO convert the selected file and also use the same name, comment, label, replicant position with a new created date. With other words DTPO should clone the file and then convert it with OCR.
Or if this is to much work to implement the behavior of DTPO 1.5 was in this point much better then the behavior of 1.5.1.

Stefan

By the way there are often papers which you can read but if you copy some text there are always unreadable characters.

annard · March 21, 2008, 3:11pm

Unfortunately OCR is never perfect and so there can always be characters that are not properly recognised. This is why we chose to represent the result using a PDF with the original scan image and the recognised text invisibly in front of it.

Ryuji · March 21, 2008, 3:25pm

Annard,

Your script causes error:
DEVONthink Pro got an error: AppleEvent timed out.

and it still saves in the filename determined from one of the attributes.

I’m looking forward to seeing 1.5.2 coming up very soon.

annard · March 21, 2008, 3:58pm

Try this instead:


-- My Convert to Searchable PDF
-- Created by Annard Brouwer on Good Friday 2008.

tell application "DEVONthink Pro"
	activate
	try
		set thisSelection to the selection
		if thisSelection is {} then error "Please select some scans."
		repeat with aRecord in thisSelection
			try
				with timeout of 3600 seconds
					set newRecord to convert image record aRecord
					set name of newRecord to name of aRecord
				end timeout
			on error errMsg number errNo
				if errNo is equal to 9104 then
					log message errMsg info (name of aRecord) & ": Check “~/Library/Caches/DEVONthink Pro/OCR” for files that weren't imported"
				else
					log message errMsg info (name of aRecord)
				end if
			end try
		end repeat
	on error error_message number error_number
		if the error_number is not -128 then display alert "DEVONthink Pro" message error_message as warning
	end try
end tell

If you export to a file (File > Export > Files and Folders…) it will have the proper name as displayed in the user interface. The real bug here is that the current incarnation of DT doesn’t use that name when dragging the file out to the Finder.

johnrover · March 30, 2008, 5:53pm

Really? This is incredibly misleading. So the “Set attributes” does apply to conversions of material that has already been imported, but the “move original to trash” does not? That would be very poor UI Design.

DevonTHINK folks– I scan and import a stack of paper about once every 2 weeks. I replicate, I duplicate, I organize everything into the proper locations… then, I want to let it run a batch OCR on these hunders of new documents overnight, automatically erasing the originals. I’m scared of applescript. Can you help me out?

cla · March 30, 2008, 6:33pm

I second that. I was happy to hear that there is a switch for “move original to trash” finally build in. But after i tried it out i was very disappointed.

The same poor UI Design with the import function. Sometimes my newly imported documents go into the Home-Folder of the database, sometimes they go into my “Incoming” folder that i set up in the options. I always have to check 2 different places to be sure about my documents.

My only hope is that DTP 2.0 will come soon and will blow all our minds away.
If not, i really have to look for an other solution. The only reason, why i still use DTP is the build in OCR.

johnrover · July 19, 2008, 6:18pm

Hello Devonfolks-

Any word on this?

Bill_DeVille · July 20, 2008, 2:37pm

This thread may be confusing: some behaviors have changed during updates since the thread started, and three issues are discussed – 1) behavior of scanning/OCR to DTPO; 2) conversion of image-only PDFs already stored in the database and 3) differences between Export of a document to the Finder or dragging a document to the Finder.

I’m running a beta of the next maintenance release of DTPO, which includes some revisions to the IRIS OCR engine, an improved ExactScan Capture scan mode (especially for Epson scanners) and also includes an Image Capture scan mode. But the discussion below applies generally to the current posted release of DTPO.

Behavior of scanning/OCR to DTPO

I use two scanners: a Fujitsu ScanSnap and a CanoScan LIDE 500F. The ScanSnap runs under ScanSnap Manager. I activate scans/OCR from the Canon scanner using File > Import > Document (from ExactScan).

My DTPO OCR preference settings: I have the option to set attributes turned off, as I frequently scan a series of documents on the ScanSnap and don’t want to have the OCR queue stop waiting for me to enter document attributes. I usually change the name of the new content after storage in the database; sometimes I don’t bother, knowing that I can find by content. I check the option to delete the original PDF.

I’m building a gazebo. I want to scan the contract and my check for materials purchase to the “Cabin Improvements” group in a database. So I open the Cabin Improvements group as the frontmost view and click in it. Whether I use the ScanSnap or the CanoScan (with ExactScan Capture), the new searchable PDF will goto my Cabin Improvements group. But if I haven’t selected a view/group, the new content will go to the top level of my database.

Conversion of image-only PDF stored in the database

Although most PDFs downloaded from the Web are already searchable and don’t need OCR conversion, a few sources still distribute image-only PDFs. In your database the Info panel of such an image-only PDF shows its Kind as PDF. A searchable PDF has PDF+Text as its Kind.

Select an image-only PDF and choose Data > Convert > to Searchable PDF. The resulting searchable PDF will replace the image-only PDF, with a new modification date. The image-only PDF will be sent to the Trash.

Export versus Drag PDF to the Finder

If you Export (File > Export > Files & Folders) a document from your database to the Finder, the resulting Finder file will have the Name you assigned in the database and the content of the Comment field will be saved to Spotlight Comments. Still other metadata may be stored in the file’s accompanying DEVONtech_storage file.

If you Drag & drop a document to the Finder from your database the resulting Finder file will have the filename displayed in the Path of the database document, which is NOT changed when the Name is modified. The metadata in the document’s Comment field is NOT transferred to the Spotlight Comments of the resulting Finder file.

So pay attention to the mode used to send files to the Finder, as the results differ by mode.

Is that bad UI? Not necessarily, if one thinks about the consequences of document renaming in the database. In DEVONthink 1.x, the Name of a document is metadata and doesn’t affect the filename of a PDF that’s stored in the Finder inside the database package file or (if Index-captured) resides in the Finder, outside the database.

What would be the consequence were the filename to change each time I change the document Name? Simply, any links to that PDF from other applications would be broken (as would be the Path links from Duplicates of that PDF in my DT database). Suppose, for example, you have a database managed by a bibliographic citation application consisting of a large PDF reference collection. You have also Indexed those PDF files into your DT Pro database. You are free to change document Names in your DEVONthink database without breaking the links to those PDFs in the bibliographic citation application’s database. I often find that freedom very useful.

Note that in DEVONthink 2, there will be substantial revisions to the database structure, with all documents actually stored in the Finder. Changing the Name will also change the filename. And links within the database can be to external files. So things will change.

johnrover · August 29, 2008, 3:15pm

Thanks Bill…

btw- I would love it if DevonTHINK had a “find image only PDFs and OCR them” function. I have litterally thousands of non-OCRed PDFs in my database.

I wish I could just tell DT to “find them and OCR them” and let it run overnight.

Anytime I try to select them all and let it run overnight, it takes forever just to get going, then crashes before getting through the first hundred. I seem to need to keep it to about 100 at a time.

annard · August 29, 2008, 3:46pm

It’s better to script something like that. Please check the section “Various Scripts” in our DEVONacademy scripts.