non-searchable pdfs

I’m not a current user, I was thinking about using DevonThink to organize my pdfs, which are mostly papers from academic journals. But many or most of them do not have searchable text, and cannot be converted to rtf as far as I’m aware. Because of that, it seems that DevonThink would not be all that useful for me.

Since I’m sure other people are using it for academic research, I assume you’re using it for other purposes?

Welcome to the forum! I found DT very useful and I’m sure you will too. Actually, my major use is academic research. I’ve just written a review paper for an academic journal that had approx 270 references; all in a DT database, and none in hardcopy form. I’m constantly importing pdfs from journals and haven’t had a problem working with any of them.
Devonthink allows 150 hours of ‘free trial’ in demo mode, perhaps you should try importing some papers from the journals you’re interested in and trying a search. I think you’ll be impressed.

I’ve tried to import some pdfs. It seems that most (or at least many) papers from JSTOR are not searchable in DT, which is what I expected as they are not searchable in Preview either. They are imported into DT as images.

I don’t use Jstor (in fact this is the first time I’ve heard of it), perhaps other users have more expertise. I can only offer some general advice. Most scientific journals offer pdfs in searchable format. I’ve found some that were obviously scanned from the print version (and searchable).
I would complain to JSTOR and your librarian about JSTORs archiving method.
I the meantime use JSTOR for searching, and try to find if the original journal article is available on the publishers website. Barring that perhaps an OCR program will generate searchable text from these ‘image pdfs’. I think Devon has indicated that OCR will be incorporated in some versions of DT.

I have the same problem with a few older PDFs (all academic). They must have been generated using a scanner without OCR software.

There is no way to make the text of these documents searchable without running them through an OCR program. I do not know if there are any free OCR programs for OS X, but I was able to get pretty far with Linux/OSS projects such as Gocr (jocr.sourceforge.net/). Perhaps one of these is available via fink or ports.

Hint: We’re currently working on a solution to this very problem. Remember our engagement with Fujitsu. Stay tuned!

Eric.

I too am an academic using many PDFs from JSTOR, and am unable to make proper use of them in DT, so I’m very pleased to hear you are working on this! Any tips in the meantime??

My wife (a librarian) said that she doesn’t think that JSTOR can afford to rescan the 20 million+ pages of text that they have, seeing as how they’re a non-profit organization, but she’ll cheerfully address and mail your contributions to reimburse them for this activity :wink:

AFAIK, JSTOR is primarily for journals in the humanities and social sciences, though I think they offer a few scientific journals as well.

After my last post, I checked the JSTOR website. In fact, they have done OCR readable scans of everything, but for reasons that are unclear to me, they only make them available as image files. However, it is possible, according to the website, to download an article as a TIFF file, and then read through one or another OCR program. Is there somebody out there with expertise in this area who could give some guidance about this?

Thanks!

Why would JSTOR rescan? They have the original scans; they would have to run these through an OCR program. With the proper program I think they could do this as batches, during their computer’s downtimes (I.E. evenings and weekends), with little or no extra occurred costs.

I don’t know about JSTOR specifically, but some journals were scanned to PDF at a resolution that’s too low to allow accurate optical character recognition (OCR).

These days, the PDFs available from many journals are often not the result of scanning. Journals are printed from digital files and the PDFs are often simply versions of the files sent to the printer. These files are computer readable directly.

This is exactly the point: one can do full-text searches in JStor but the downloaded PDF is not text-readable. So it should be possible for JStor to make the articles available as readable PDFs without much trouble i guess. Time to start emailing the folks at JStor perhaps…

Gerben

I have run into the same problem with Jstor files. I did do an experiment following these steps to make the documents at a level suitable for OCR.

Open page in Photoshop -> apply SI2 plugin (inexpensive image processing see below for reference) -> bring resolution up to something acceptable like 300 dpi -> save -> Run OCR.

However this is a clunky process. Maybe if I was more adept at programming I could write a script for it but I have had more pressing research issues which have prevented me from diving into it. Though if I could pull it off that would be a click shareware idea! :slight_smile:

Hopefully the team at DevonTechnologies is working on a more refined process.

R. Joe

It’s time consuming, but I’ve been downloading my JSTOR articles, and then sending them through Acrobat 7 for processing into a searchable pdf. Works fine, though it does take time. Other than funky diacritics for transliteration purposes (which, alas, my articles tend to be full of), the OCR’ing works fine, and DT is then able to search them just fine and dandy.

If anybody gets a chance, maybe they can post how much of a speed increase there is in using these functions under Acrobat 8 (now that it is universal).

Also, I’d imagine the upcoming version of DT Pro Office would be useful in these regards? :wink:

OCR in the upcoming version of DT Pro is faster and more accurate than OCR in Acrobat, and transfers the OCR’d PDFs directly to the database.

That’s great–thanks!!!

I have recently started experiencing with processing JSTOR .tiff files by using an OCR software (WorkingPapers X).

It is very easy (at least with documents with only 1 column), very fast and well worth it.

All you need to do it save the document as .tiff
Import it in WorkingPapers
Read it to rtf format

then it is just a matter of importing file in DTPro.

Pascal Venier
pascalvenier.com

I’m currently importing dozens of pdf’s into dt using the import…images (with ocr) function. I love the results, but the conversion is anything but fast.

Mike

Well, the conversion is done using the ReadIRIS engine, which is the best (and only) engine available for both PowerPC and Intel processor based Macs.