Just index the main text of a web page

tommysundstrom · December 20, 2010, 8:05am

I am often saving web pages to DevonThink as PDF files.

It seams that for web pages, the DevonThink hallmark Move To/See Also is not working as well as for ordinary documents. The suggestions are seldom useful. I assume a large part of the reason for this, is that a web page contains so much more information and navigation than just the main text - and a lot of the extra info is not closely related to the main text.

I think the result would be better if a algorithm like the one used by Readability where applied, to find the main text of the page, and index just that. This would give a text similar to the one presented by Safaris Reader function.

Readability, lab.arc90.com/experiments/readability/
Ruby port of the algorithm, github.com/iterationlabs/ruby-readability There is also a javascript version.

cgrunenberg · December 20, 2010, 10:59am

Thanks for the suggestion! The only workaround right now is to take a plain or rich note of the main text using services. This reduces also the size of the database and improves the concordance.

tommysundstrom · December 28, 2010, 11:56pm

I did a quick and dirty script that translated my pdf:s to rtf-documents, using the algorithm mentioned above. The accuracy of the suggestions is vastly improved. Not to the level that I would trust Auto Classify (at least not for my database), but now I get several useful suggestions most of the time.

Extracting the main content of the page, and presenting it in a size and style suitable for reading, also is an improvement.

The major drawback is that the algorithm is not 100% accurate. Some pages I have to convert back to pdf again, to see the relevant content.