Script for annotating by paragraph

Peter100 · August 27, 2012, 3:20pm

I am looking for a way to script note-taking at the paragraph level. I am new to DevonThink. I have no programming / scripting skills. I am a PhD student trying to finish my thesis in social science.

Scenario: I have ca 4000 docs/pdfs that I’ve collected over the past three 3 years. These vary in terms of quality and depth. The ideas expressed in them are not always coherent or flow, but there might be some good bits at paragraph level. I could of course search/tag everything but I would still need to hunt through each one separately to find the good bits. Also, I assume that DT’s AI capabilities are aided by dividing the docs into smaller pieces.

Solution?: I wonder if anyone has a script (or could easily make one) that creates a note from every paragraph in the doc/pdf, uses the first 50 (or so) characters of the first sentence as the note title, and compiles/merges these into one annotation template so they can be viewed in context in the original doc/pdf (one annotation template per for doc/pdf).

My idea is based partly on this script for importing Skim pdf annotations into DT: complexpoint.macmate.me/Site … think.html . While it works okay with pdfs that are all ready annotated I am looking for something to process docs and pdfs that have yet to be annotated. Also this script creates a seperate note/annotation template, one for each highlight/comment. It would be better if these were all collected/collated onto the same annotation template, one per doc/pdf.

Of course I probably would not want to perform this on all the docs/pdfs but at least 500 or so.

Does this make any sense? I’m grateful for any and all input from those more skilled than I!

korm · August 27, 2012, 7:25pm

You want to take every paragraph in a PDF document, strip all but the first 50ish characters, and combine the truncated paragraphs back into a new document?

I assume the PDFs are OCRd. An OCRd PDF can be converted to Rich Text, all that text selected, and then use Edit > Summarize to c manuondense the paragraphs to short precis. It’s a possible (albeit somewhat manual) approach to get your started.

Peter100 · August 27, 2012, 8:27pm

I know it does sound odd, and it probably is, because I’m not clear yet how DT works. Let me refine my question a bit.

So I have two types of docs/pdfs: a) drafts and papers which I’ve authored; and b) downloaded/scanned journal articles, book chapters, etc (mostly in pdf).

In instance “a” I would like to break these up into paragraphs because often, not always, these docs are really a drafted collection of disconnected ideas. Having them contextualized in relation to the original doc would let me know how connected or disconnected they are. They consist mostly of txt, rtf, doc, and docx.

Instance “b” feels a bit more tricky because I’m still unclear about the best way to set up DT to handle pdfs. Should they be searchable or not? How should they be stored? What’s the best way to link these with bib ref software and read from the Cloud (Sente, Bookends, etc)?

I realize my suggestion of creating notes for ALL paragraphs seems extreme, but I’m simply looking for a way to use the power of DT to give me an initial survey of all my documents at the paragraph level rather than document level.

Let me give an example: Say I have a PDF with 300 paragraphs. 30 of these are mildly interesting, 20 very interesting and 10 are outstanding. The rest I don’t think are relevant. Normally one would have to read/browse through the entire document, annotating it to pick out the best bits. However if I had each of these paragraphs as notes I could then tag or create smart groups that would collect them together based on the topic I am exploring. This would then give me handles into the text at the paragraph level rather than the document level. I could then import these into my outlining software, quotes in context, rather than the entire pdf.

Oh, the note title would include Author Date Page # along with first 50ish characters. It would also ignore sentences less that ca 100 character to exclude names, headings, figures, etc.

Does this still sound silly? Perhaps there is a more straight-forward way to do this but I don’t understand DT well enough yet to know what that would be…

korm · August 27, 2012, 9:17pm

This doesn’t sound silly at all - in fact, it’s one of the basic problems in research method. I’ve faced this problem regularly in my own research – and I’m not confronted with a defense deadline

Take a look at this thread and its scripts - as well as the threads internally referenced there. The scripts in that thread operate on a RTF file – you can convert OCRd PDFs to RTF in order to use the script. I’ve noticed you’ve cross-posted your inquiry to the Tinderbox forum. I wrote the scripts in the above thread because I then take the exploded notes and dump them into Tinderbox, where it is easy to create note attributes, agents, and maps that can help your research and evaluation of the written texts. Lots of advice available in that forum too.

Look also in the scripting forum at the variety of annotation scripts. Try also the annotation script that’s packaged with DEVONthink.

I advise experimenting on a small subset of your library; get a technique, then stick with it.

If you have some time, listen to David Sparks’ Mac Power User’s 100th show (posted today) which has several interesting segments on research – including coverage of DEVONthink and Tinderbox.

Peter100 · August 27, 2012, 9:26pm

Thanks! this all sounds promising. I’ll post back with whatever I come up with.

Peter100 · September 5, 2012, 3:33pm

Over at the Tinderbox forum I was advised to do a paragraph tear down of my own text rather than the pdfs, which I realize now makes a lot more sense.

So the next quest: automated/scripted pdf keyword search/annotation. I wonder if anyone might suggest how to automatically create yellow highlighted annotations/notes in each pdf and then have these assembled onto a DT annotation template and then collected into smart groups after search phrase/string? For example, the search (cat OR dog OR bird OR hamster AND pet) would automatically highlight all the paragraphs in my pdf library that match this search, give each paragraphs a note indicating the specific search string, and then collect all the highlighted paragraphs into a DT annotation template - labeled with the article’s name.

It’s taken me awhile to articulate this. I’ll have a look at through the other threads and post again if I find anything. In the meantime I’m grateful if people post any tips here!

korm · September 5, 2012, 4:04pm

“automatically create highlight[s]” - Not possible to script DEVONthink to do this. Your best bet is to script this with Skim, IMO. That said, the task of capturing the paragraph that surrounds a search target (e.g., the paragraph within which “bird” appears) is tricky, from an automated perspective, but not from a manual perspective.
“assemble highlights into a DT annotation template” - depends on the template, but you should be able to get Skim to do a workable export to DEVONthink - and it can be RTF
“collected into smart groups” – sure, but cannot create a smart group with a script. Though, it’s simple to make a sample smart group and just clone it and change a few predicates.

There’s another approach. Convert your PDF to RTF. Do your search and highlighting there. Use the script at Scripts > Data > Merge Highlights (RTF) to yank the highlights into a new RTF document. It’s also possible, then, to script the search and highlight process in RTF, where it’s not possible to script it in PDFs. That the PDF-to-RTF conversion is frequently ugly doesn’t matter, because the converted document is only an interim product and can be disregarded.