Question related to PDF importing

I don’t understand the difference between importing a PDF doc through the print/send to DTP script and the import image with OCR way.

Can somebody help?

Thank you.

In DT Pro Office one can import an image-only PDF and run OCR on it so that it becomes searchable for text. That’s also what happens when information is gathered by scanning a sheet of paper, producing an image of the paper, running OCR on the image and saving the searchable PDF to a database.

The OS X “print as PDF” feature allows one to capture as PDF any printable document from any application. If the document being printed contains searchable text, the resulting PDF will contain searchable text.

In one prints as PDF an image-only file, the new PDF will still not contain searchable text. OCR would be required to recognize and convert the image of text.

Thank you Bill. Is ‘Print as PDF’ equal to ‘Save to DevonThink Pro.scpt’?

Thanks.

Yes. In the Print panel there are several options displayed when one clicks the PDF button.

The Save to DEVONthink Pro option saves a PDF to the database group designated in Preferences > Import. This is an Import capture, and the actual PDF file is stored inside the Files folder in the database package file.

Ok, that means that to be 100pct sure that all my pdf’s are searchable I have to make the go through the OCR process, right?

Is there a script for this?

Sorry to be a pain…

Thank you again.

No, it doesn’t mean that.

Select a PDF and then open its Info panel. Look at the Kind information. If it says “PDF” the item is only an image and doesn’t contain searchable text. If it says “PDF+Text” it has searchable text.

Want to find all of the PDF image-only files in your database so that you can OCR them using the conversion routine in DT Pro Office? That turns out to be easy.

Select Tools > History. If your History view doesn’t have a column for Kind (file type) you can add it by clicking on View > Columns > Kind.

Click on the header in the Kind column. Now scroll to see the list of PDF files in your database. If you see one that you wish to OCR, select it and press Command-R to reveal the file in its group location. Select it and press Control-click, then choose the option to OCR it to a searchable PDF. A new copy will be produced that is the PDF+Text Kind. Now you can delete the image-only PDF from your database.

You will likely find that you don’t have many image-only PDFs, as most sources provide searchable PDFs. There may be no point in running some image-only PDFs through OCR, e.g. a scan of handwritten notes. (The OCR module isn’t smart enough to be able to read your handwriting.)

I’m so glad that I browse these forums from time to time. This was an invaluable tip for me. As you suggest is common, there weren’t too many PDF type files in my database. And many of the ones that were there weren’t really good candidates for OCR. But there were 7 or 8 really important documents of this type that thanks to this feature are now searchable in my database.

One thing I would add is that OCR is pretty processor-intensive. Users shouldn’t worry if nothing seems to be happening immediately.

Thanks for this invaluable insight, Bill. I followed your instructions and found that I have a large number of “PDF” files, non-searchable.

I thought I’d set up a pretty neat system to cope with my prolonged absences from home. . Most of my mail goes to my London address, so I set up the scanner to direct resulting files to my iDisk account. I get someone to scan all the correspondence and the files then turn up on my iDisk when I’m in Greece and I can act on them as necessary and then file them away in the database. This is particularly valuable for band statement and other financial material which needs to be handled promptly.

All files appear to be PDF images, which I suppose is pretty obvious, and therefore not searchable. Is there any way I can set up my system to create PDF+text files? I’m using a Futjitsu ScanSnap. Sorry if this is a dumb question, but I am just learning.

DT Pro Office includes an OCR module that can (1) interface with a scanner so that a searchable PDF is saved to the database; and/or (2) run OCR on image-only PDF files, whether by converting and importing external files or by converting image-only PDFs within the database.

Converting image-only PDF to searchable PDF:

CASE 1: Convert PDFs already in the database:

In DT Pro Office select one or more PDF documents to be converted. Choose Data > Convert > to Searchable PDF.

This will produce a copy of the original PDF that is searchable for text content. It’s up to the user to choose whether to delete the original image-only file or keep both copies in the database (perhaps one might keep the original for high-resolution printing).

CASE 2: Import and convert PDFs from the Finder:

In DT Pro Office choose Import > Images (with OCR), then select one or more PDF files to be imported and converted.

There’s useful information about OCR in the online Help files.

hi,
i tried doing the ocr conversion and it creates a pdf file, but i get a message that it cannot be imported. when i inspect it in the finder, the file cannot be opened. suggestions?

also, the pdf’s generated when i import from DevonThink using import from scanner are VERY large. is there anway to size them down and/or control the resolution of the import?

thanks

John, it’s possible that the PDF that failed conversion was encrypted in such a way that it could not be accessed or opened without a password.

Or it’s possible that this is a type of PDF that can’t be recognized by WebKit.

Is this a unique situation, or did you encounter that problem with a variety of PDFs from various sources?

File size: When OCR is run the image is re-rasterized. It’s very likely to result in a larger file than the original. DT Pro Preferences > OCR provide some options to help balance the file size and print quality. You can control the resolution in dots per inch, and also the quality of images. The higher the resolution and the higher the image quality, the larger is the file size.

thanks for the fast reply. the pdf is not protected, i imported it through DT and it is 4MB (a scan of a magazine article). if i make the scan with scangear, the file is only about 200k. it seems there is no way to set the resolution when scanning through DT?

i tried several different pdf’s and got the same error.

i set the dpi and image compression in the ocr prefs, but of course i don’t know how well it worked since (1) DT won’t import the file it created and (2) i can’t open it.

You can control the resolution at which the scanner operates. Scans should be made at 300 dpi (perhaps at 600 dpi for material containing color). DT Pro Office Preferences > OCR allows you to control resolution and image quality following OCR.

The PDFs can’t be imported or opened?

By any chance do you have ShapeShifter on your computer? If so, immediately remove it and restart. ShapeShifter is causing errors in your operating system.

no ShapeShifter. the resulting PDF after OCR cannot be opened or imported. the OCR seems to go smoothly as there are no errors up until then.

Please attach a PDF before and after OCR to a message to Support. Include both in a folder and compress it, then attach the compressed folder to the email message.