Better Move To/See Also for webpages by removing irrelevant

I am often saving web pages to DevonThink as PDF files.

It seams that for web pages, the DevonThink hallmark functions Move To/See Also is not working well. The suggestions are seldom useful. I presume a large part of the reason for this, is that a web page contains so much more information than just it’s main text - and a lot of the extra text is not closely related to the main text.

I would like to take a stab at improving the precision of DevonThinks guesses by removing the extra text from the index, using an algorithm from Readability (lab.arc90.com/experiments/readability/
) to identify the main text (this is more or less the same algorithm that Safari uses for its Reader function).

Since I already have some experience using Ruby for scripting DevonThink, I will use the Ruby port, github.com/iterationlabs/ruby-readability

But I have some questions:

  • When the PDF document gets its text analyzed (and becomes PDF+Text), I assume it is the “Text” that is used as base for the Move To/See Also function.

But what text is this. A document seams to have two texts associated - “plainText” and “richText”. Do I need to replace both?

After replacing the original text with a clean version, will it be reindexed automatically, or do I need to trigger it somehow?

  • When a new PDF is added to the database, DevonThink will automatically analyze its text. Is there any hook that can be used to replace DevonThinks method for extracting text from the PDF with my own? Alternatively, is there any way to start a program of my own when DevonThink is done with a new document?

It’s not possible to replace the (indexed) text of PDF documents via AppleScript. The only workaround is to edit the document using a third-party solution and to reindex it afterwards (using the synchronize command).

No.

I’ve done a Ruby script that Readability-fy web pages I’ve collected i DevonThink. It does significantly improve the accuracy of Move to/See Also.

The script can be found here:
github.com/tommysundstrom/DevonThink-helper

I’m using Instapaper for this. “Normal” articles are run through Instapaper, IMDB pages not. This results in a very useful See Also functionality. On normal articles it shows similar articles. On movies it shows movies which have no direct relation to the title or content but are somewhat similar to the selected movie, so it’s more like recommendations. I guess it gets this information out of the full IMDB page, which contains recommendations. So when I click on a movie in DT it shows me matching movies which I also stored in DT (and which are the movies I’ve either seen or bought). Perfect!