TextLightning PDF-to-RTF-conversion doesn't work

When I choose TextLightning-PDF-import and check the “convert to RTF”-option, the PDFs I import are converted to TXT (I’m using DT v1.9.1). I’ve downloaded TextLightning (versiontracker.com/dyn/moreinfo/macosx/13405) and tried it again with the same PDF-file and: it works.

Anybody got an idea, why TextLightning/convert-to-RTF doesn’t work inside DT?

Thanks, Stefan

Stefan:

I just tried it, and TextLightning PDF to rich text works for me. If you’ve just installed TextLightning and DT was already running, try closing your open applications and logging out/logging in (or restarting).

Note: Another option for importing PDFs is by File > Index. This results in a smaller file size, and displays the PDF as text + image. Comments by the DT developers indicate that they are beginning to recommend index import for large, static files.

Currently, there are a couple of downsides to index import of PDF files. First, the sesrch term(s) is not highlighted when viewing a search hit. Second, the only really satisfactory way to view the imported file is by opening it under Preview (which also allows for viewing search terms quickly, but requires repeating the request).

I’m hopeful that viewing PDF documents (especially in the Search Tool window) in DT will be greatly improved when the Tiger OS is released.

Hi Bill,

Thankx for your comment!

I got the problem already before I installed TextLightning, so it seems to be another kind of problem.

Indeed, the indexing function serves my interests well.

I’ve got one question: In the manual it says that…

Does this mean, that DT stores the PDF-file itself, so I can open the PDF even when I deleted the original PDF-file? I tried it out, and I can open it, even after deleting. But doesn’t the term indexing suggests that only a reference (and the "textual content ") is held in DT and not the original file? Well, if it saves the original file then I would be very happy… :wink:

Cheers, Stefan

Stephan wrote:

I’m not certain, as my practice to date has been to link to original PDF & PS files, rather than importing them into the database or the db Files folder.

So I will defer to Eric or Christian to give you the answer on that question.

I tried around with different files, removed TextLightning from my system and restarted, but I still can’t make the inside-DT-TextLightning/conversion-to-RTF work. Does anybody got a clue?

Thanx a lot!

Stefan

Does TextLightning work for you in TextEdit? To be honest, we don’t really encourage the use of TextLightning as it’s old, buggy and virtually unsupported.

If you want, please call us. You’ll find our office number on the Contacts page on our website.

Best,

Eric.

Thankx for your comment Eric!

I haven’t tried it with TextEdit, but I will check it out.

Maybe it’s more informative if I briefly tell you what my goals with DT are:

I’m working a lot with scientific articles that I retrieve as PDF. If I index the whole document (on average ca. 7’000 words), then a lot of information is indexed, which is not so relevant (e.g. details in the method section) and that lowers the signal-to-noise-ratio. Steven Johnson pointed out, that text chunks between 50 and 500 words worked fine for him and offered him a high signal-to-noise-ratio:

http://www.stevenberlinjohnson.com/movabletype/archives/000230.html#more

So I would like to import the PDF into DT and delete the not so relevant parts so I end up with the theoretically interesting parts of the articles. Then I would link the original PDF to the entry so I could retrieve it in the future.

Now why RTF? For me there are two reasons:

  • A lot of articles I’m interested in are layouted with double columns and a plain text-import will mess up the order of the paragraphs.
  • I would like to make (wiki-)links between the articles, which is not possible with plain text as I understand.

Can I somehow achieve these objectives without RTF?

Thanx!

You could open the PDF in Preview, select the parts you’re interested in and copy them into a fresh RTF document. You can even select them in Preview and directly choose “Preview > Services > DEVONthink > Make new plain note”. Then, in DT, select the text document and choose “Format > Make Rich Text”.

Best,

Eric.

Stefan:

[1] I was glad to see Steven Johnson’s praise of DT’s contextual recognition “See Also” function in his NY Times essay. His practice of saving text fragments works well, and there is some logical support for his recommendation of his approach.

Personally, I don’t find it necessary to break up my existing documents into fragments in order to benefit from “See Also.” In fact, I’ve run into examples such that fragmenting a large document would have reduced the serendipitous value of “See Also.”

[2] Capturing text from multicolumn PDFs is a bit tricky, as Preview won’t let you select a single column. Text becomes selected across the page, picking up sections of all the page columns. You can only select one page at a time. You may have to do substantial editing; at the least, you may need to delete extraneous material.

Eric suggested two methods for capturing text from Preview. Of the two, I like the Services route, which can capture plain text to DEVONthink. This method has the advantage that it will place the PDF’s path into the Info panel, so that a link from DT to the original PDF file exists. Another major advantage is that the Services > DEVONthink > Append plain text operation is available, and allows selection of additional text to be appended after the first capture – for example, you could first capture the title page, then append selections from page 14 and page 53 of the PDF file into a single DEVONthink text document. If you wish to add links, first change the document formatting to rich text, then set your links.

Disadvantages: Usually, paragraph formatting will be lost. Each column may become a single paragraph, for example. Styles will be lost. Sometimes words will ‘run together’ and may require editing for DT to perform searches correctly.

[3] As Eric noted, TextLightning is sometimes problematic, perhaps depending on the version of the PDF document to be converted. RTF conversion by TextLightning is much slower than plain text conversion using the pdftotext option. But the paragraph formatting of the original document is preserved, as are character styles. The converted text may look strange, but is easier to read. Therefore, editing text to remove undesired material is much easier than editing plain text captures using option [2] above.

I have TextLightning 3.1, running under OS X 10.3.7. I have no problems importing RTF text from, for example, PDF versions of articles from Science Magazine. So I can’t account for your difficulties in getting TextLightning to work for you. Note: I generally use TextLightning conversions only for such multicolumn PDF files. More often, I use File > Index import to capture the text of complex PDF files, as the result is easy to read (but not to edit).

I wonder if Eric means that the problem is that DT can’t always successfully call TL to do the job. In my experience, TL is a lot less than perfect, but it always does some sort of translation … except in DT.

Sometimes it works fine in DT and sometimes not. So far, I’ve found that I have to run TL on its own and import the RTF into DT about 60 percent of the time.

But it’s worth doing that because PDFTOTEXT just creates hunks of text that are unreadable and copying from Preview works only for a single page (Have you found a way to copy text from more than one page of Preview at a time? I’d like to know about it.)

There’s another PDF to text converter with more options called Trapeze – worked pretty good for me when I tried it, you’ll have to convert before bringing into DT though.

Thanx to all of you for comments and suggestions!

These documents you are talking about, what is their structure?

I’m interested in behavioral economics and experimental psychology. The articles in these field have often the general structure (see e.g.,
people.virginia.edu/~tdw/dun … b.2003.pdf):

  • introduction
  • theory
  • method
  • results
  • discussion
  • general discussion

I was thinking about not indexing the whole PDFs because the three sections method, results and discussion are not important for the semantic meaning of the article. They contain detailed information about the procedure, subjects, materials, statistical analysis, discussion of statistical results and so forth. But maybe this is not a problem because the AI of DT ignores text elements which appear everywhere and are thus not informative.

Of course I would be happy with just indexing the PDFs. But before I import all my PDFs and then the signal-to-noise-ration is too low, I’d rather ask you for your experiences.

What do you think?

Stefan:

There’s no single answer.

The most compact capture of the example document you cited would be copying the title/author/italics (abstract) information at the beginning. That would give you the basic ‘semantic content’ of the article.

But what if you get interested in looking at methodologies and statistical procedures in the field? Then the approach above wouldn’t do the job for you.

I usually capture short to medium length (perhaps up to a hundred pages) PDFs in their entirety. But I’ve got some PDFs that run 500 pages or more. I may ‘cheat’ by replacing the captured text with a good review article summarizing the material (but still linked to the full PDF). Or I may hit DT’s Summarize button to automatically reduce the content size while probably retaining the semantic content. I’ve even summarized summaries.

Sometimes a particular item becomes a “sink hole” that confuses DT’s classification feature. For example, I captured a 45-page article about science policy in India that became DT’s first choice for classifying many other items – often with little or no apparent connection to the topic of science policy in India! Running Summarize on that item greatly reduced DT’s classification confusion.

Literature research, especially using computer tools, will probably remain more of an art than a science for some time to come. It’s about thinking and exploring. At this point, I generally chose to err on the side of capturing too much information rather than trying to limit captures stringently.

I see your point and I agree: The solution has to be satisfactory, not perfect, since there is no perfect solution, even for a single user.

Thanx for your suggestions!

Here’s a cool tip I discovered by mistake a few days ago:

You can select single columns in Preview. Hold down Cmd-Option, and then you can drag to select any rectangle on the page. Copy, and paste. It worked with a few PDF’s I used, but your-mileage-may-vary depending on the PDF encoding. (Select within a column, or partially across 2 columns, and you can get… gibberish!)

Joe