Adding (text) content to pdf images

MJH · September 6, 2003, 1:42pm

I am using DT to catalog journal articles - almost all of the articles are pdf files and many of those have been created in/by Acrobat. I particularly like being able to search those articles (and Auto-Classify, etc.) while retaining the original publication format in case I need to print the article for distribution.

Through our university, I can also access and download other articles in both pdf and HTML format.

Unfortunately, when importing these pdf files into DT, they are imported as images, since these articles were scanned in. However, the text of the article can be extracted from the HTML version of the article that is also available.

While I can extract the article text (from the HTML file) and store in the comment field, this data does not appear to be used for “Classify” or “See Also”. In addition, although I can search comment fields, it doesn’t appear possible to search both content and comments at the same time.

Is it possible to "attach/append" text to a pdf image as content? Or is there another possible solution to this?

Mats · September 7, 2003, 5:13am

MJH,
if I understand you right, you want to be able to treat PDF files as any other file (pdf or txt) in DT, i.e. classify, search and read them in DT, while retaining the original PDF file with its images, footnotes etc. for reference and printing.

One method that I have found very useful is to create a folder on your desktop, or elsewhere, to which you apply the ”save in Devonthink”-script (DT preferences set to “Images&PDF/Copy files/Don’t copy” ”/Use pdftotext” and ”/Convert to Plain Text”). To this file you can simply save all your RTF:s you might find on the net, recieve applied to incoming mail or create. You’ll be able to launch the original PDF file from the DT browser whenever you need it. The same goes for Word files (once you have have installed AntiWordService).

(In DT 1.6 these files, although converted to plain text files, were represented by RTF and Word icons the DT browser. Not so in 1.7. I’m told that this was a bug in 1.6. If so, it was a very useful bug )

I hope this tip goes some way to ease your problems.

MJH · September 7, 2003, 10:14am

Thanks for the thoughts, Matt.

Unfortunately, the pdf files of the scanned articles are imported into DT as images only - DT is unable to extract much, if any text from the scanned image.

Consequently, the "Convert to…" option results in a blank file.

cgrunenberg · September 7, 2003, 9:25pm

Some Canon scanners include an OCR software creating PDF documents which contain an invisible text layer. pdftotext is able to recognize this layer and therefore you’re able to search, classify etc.

MJH · September 8, 2003, 1:08am

Thanks for the heads-up, Christian.

It doesn’t look like any of the scanned journal articles have the invisible text layer you mention.

Do you have any other suggestions?

Should I just dump text extracted from the HTML versions into the comments and search twice (i.e., in content and in comments)?

cgrunenberg · September 10, 2003, 3:50am

That’s one possibility. Another one is to import only the HTML versions. Or import both HTML and PDF and store them in the same group.

MJH · September 10, 2003, 9:13am

Ok. Thanks, Christian.

Can I put in a request, then, to be able to search more than one "area" (e.g., content, comments, name, etc.) at a time?

cgrunenberg · September 10, 2003, 8:35pm

This is already on our long to-do list ;)

Bill_DeVille · September 10, 2003, 8:56pm

IF the image-only PDF scans were at a high enough resolution (200 dpi or more, preferably more), it may be possible to:

[1] If you have the full version of Acrobat, open the PDF and save it as a multi-page TIFF file. It is also possible to export the file as TIFF from Preview, if you don’t have Acrobat.

[2] Perform OCR on the TIFF images and resave as a new PDF document that contains text. (OmniPage Pro X runs under OS X, FineReader 5 Pro runs under Classic. I usually get better results from FineReader 5 Pro). There are still more alternatives, but the two OCR programs mentioned above probably give the best results. Some versions of Acrobat can process images to text, but I’ve found the results to be slow and not very good.

[3] Import the new PDF file into DEVONthink with appropriate preferences, including TextToPDF, for example.

This is a bit of work, but doesn’t take long and gives good results if the original document scan was at a relatively high resolution. Is the effort worth it? Depends on how important having the text in DT is to you.

I’ve even gotten good results from digital camera photos of documents taken in a library, using a Ricoh Caplio RR1 digital camera – but everything has to be just right, or considerable post-processing may be required to produce images that will work with OCR.