I am often saving web pages to DevonThink as PDF files.
It seams that for web pages, the DevonThink hallmark functions Move To/See Also is not working well. The suggestions are seldom useful. I presume a large part of the reason for this, is that a web page contains so much more information than just it’s main text - and a lot of the extra text is not closely related to the main text.
I would like to take a stab at improving the precision of DevonThinks guesses by removing the extra text from the index, using an algorithm from Readability (lab.arc90.com/experiments/readability/
) to identify the main text (this is more or less the same algorithm that Safari uses for its Reader function).
Since I already have some experience using Ruby for scripting DevonThink, I will use the Ruby port, github.com/iterationlabs/ruby-readability
But I have some questions:
- When the PDF document gets its text analyzed (and becomes PDF+Text), I assume it is the “Text” that is used as base for the Move To/See Also function.
But what text is this. A document seams to have two texts associated - “plainText” and “richText”. Do I need to replace both?
After replacing the original text with a clean version, will it be reindexed automatically, or do I need to trigger it somehow?
- When a new PDF is added to the database, DevonThink will automatically analyze its text. Is there any hook that can be used to replace DevonThinks method for extracting text from the PDF with my own? Alternatively, is there any way to start a program of my own when DevonThink is done with a new document?