How well does Text Lightning work?

I’m trying to import PDF files into DT using Text Lightning but it doesn’t seem to work so well. If the PDF is larger than a a few hundred kb, the imported file is blank. I’ve tried dumping PDFs on TextEdit as well, and the same thing happens – the Text Lightning window pops up to show it’s doing the conversion from PDF to RTF, but the file that results is blank.

Another problem: If the PDF is a few MB in size, Text Lightning just seems to go into an interminable spin – it either never finishes converting the file, or I’m just not patient enough to let it.

Are there any secrets to making Text Lightning work better? Is anyone getting better performance than this?

Thanks!

mowabb:

I got TextLightning at the same time I got DEVONthink PE last year, and have used it to do RTF text imports from many PDF files. Some PDFs were very large – the RTF from one of them is over 2.2 MB. I’ve not experienced the difficulties you note in making these imports.

That said, I’m not using TextLightning now. For the past 2 or 3 months I’ve been using DT’s “built-in” text converter and importing plain text, which I then convert to RTF. These imports are really quick on my 500 MHz G4 PowerBook.

I want to import the text of PDFs to take full advantage of DT’s searching capabilities. I want to see the relevant text in the search window, which can’t be done yet with PDF documents themselves (that may change with OS X 10.3 and a new version of Preview). But I gave up long ago on the aesthetics of text conversions by TextLightning, or even RTF files created by Adobe Acrobat – they are not at all well formatted – paragraphs get ‘broken’, no graphics, and tables can’t be rendered as tables. So I keep the PDF files externally, linked to the DT text content window; if I find an interesting item in a search, I launch it to read or print the item in its original PDF format.

Just a thought, but have the pdfs you are trying to import been custom encoded or password protected?

David

Probably custom encoded, but not password protected. They’re scanned files (readings and syllabi) from my law school. The info window in the PDF viewer PostView shows that they were encoded w/Adobe Acrobat 4. I own Acrobat and when I open the PDFs via Acrobat I can’t select text in them, either, so I guess they’re locked up somehow. Is there any way to “unlock” the text?

What/where is this "built-in" text converter? Does it convert PDF text to plain text, or just text from other types of files?

Thanks!

For what it’s worth, I get the same result — sometimes.

Importing PDF files sometimes works fine and sometimes it just results in a blank document.

Not sure why it happens but will be more observant from now on to see if I can see a pattern.

moabb:

You noted that the troublesome PDF files were scanned. Two possibilities come to mind:

[1] The PDF files are merely images, and contain no text. You noted that you cannot select text from within Acrobat. To further confirm, try doing a Find in Acrobat for a term that’s visible on the screen. If Find cannot “see” the term, that would be additional confirmation that the PDF is only an image file.

[2] The PDF either has read only security, with no printing or saving by the user, and/ or the PDF is encrypted so that TextLightning cannot get to the text.

For files that are sufficiently important to me, I’ve managed to capture their text in both cases.

For case [1], it may be possible to save the PDF as TIFF files, run them through an OCR application, and save as ‘normal’ PDF (image and text). This works pretty well if the images were scanned at 300 dpi or better. But if the PDF is low resolution, this won’t work well, if at all. For the source of files you described, case [1] seems probable to me.

Case [2] can be trickier, since the provider of the PDF has gone to some trouble to keep users from getting at the material. Earlier versions of OS X (I think through 10.1) allowed one to open and save a PDF file under the Preview application – effectively eliminating many types of file security and encryption. Later versions of Preview have killed that loophole.

If it’s possible to print an encrypted PDF document, just try to print the document into a new PDF. This new PDF file isn’t encrypted anymore and in most cases, pdftotext and TextLightning are able to handle the result.

Go to Preferences >> Images and PDF and check the “Use pdftotext” option. Until today, I always kept the “Convert to plain text” option checked, but a more careful reading of the help files just now taught me that you can get the best of both worlds by leaving that unchecked: you see the pdf as it is meant to be seen, but the “Classify” and “See also” buttons work, and if you open the file the “Words” button also works. This is a good way of avoiding the formatting problems mentioned above. To me, it’s a most impressive feature.

About TextLightning, mine hasn’t worked at all since version 1.6 of DEVONthink. Other people were able to solve this problem, but I wasn’t. I’ve been very disappointed with TextLightning, as the website seems to be unavailable most of the time. I haven’t been able to work out in what ways it’s better than the included pdftotext.

Rick