Making the web clipper as good as Evernote's

Dellu · July 19, 2014, 1:37pm

I want and tried to totally rely on Devonthink (and totally leave Evernote out of my life). But, the clumsiness of the web clipper in Devonthink makes me to go back to Evernote again and again.

It would be great if the web clipper could be enhanced to work like the one in Evernote:

Specially the lack of feature to filter out ads and the rest of the junk from the actual article is a major drawback in the DT clipper. (yes, I have tried the other tricks, like “print friendly”. But, they are all unsatisfactory.)

is it doable?

Bill_DeVille · July 19, 2014, 3:30pm

I almost never capture a full Web page, as I want to exclude irrelevant content. There are two major advantages. By excluding unwanted content, the file size of the capture will be smaller. More importantly, by excluding irrelevant text the efficiency of searches and of the AI assistants such as Classify and See Also will be improved.

Most of my captures are as rich text of a selected area of an HTML page. Tip: Select from the bottom up for fast selection of the desired content.

In Safari, DEVONagent Pro or DEVONthink’s browser, a Service shortcut, Command-) will capture the selection as rich text. Rich text preserves text formatting, links, images, tables and lists. In a few cases where preservation of the layout of images or tables may be important, I’ll use the Service keyboard shortcut Command-% to capture the selected area as WebArchive.

On some pages with lots of irrelevant images, capture as rich text of the desired content can reduce the size of the capture by two or even three orders of magnitude, compared to a full-page capture as PDF or WebArchive.

Note that these two Services to capture selected content do not work in Firefox or Google Chrome.

Dellu · July 19, 2014, 4:37pm

Well, I don’t know what to say about this. For various reasons, my browser of choice is Chrome (and sometimes Firefox). I am not going to drop it just for this.

Thank you for the reply, anyways.
I still hope that DT team can do better than this on the clipper
(checking out various blogs why people consider Evernote superior than DT gives you a good glimpse of what users think…the efficiency of the clipping tool is one of the most important reasons why Evernote captivates users mind. Sorry, I am not comparing DT with Evernote; I still believe DT is much more efficient application. But, improving the clipper could make it even more useable for the web. I actually think that the web clipper is the most critical area of improvement for DT.)

Cassady · July 19, 2014, 8:36pm

Could you maybe be a bit more specific? Is it only the ability to strip adds - or what other features/efficiency would you be hoping for? The partial Skitch integration?

gg378 · July 19, 2014, 9:09pm

It seems we have all our preferences. I no longer capture anything as a webarchive (not cropable, and as far as I know, not a universal format beyond OS X), but I also don’t use RTFD (as much as the idea behind it is quite nice (package with original figures etc), it is not a universal format (I think there is not a single viewer that would work on Windows); also it falls shockingly short on some fronts, such as the ability to scale images); the final dealbreaker is that RTFD doesn’t really work well on iOS).

My method of choice is to clip the webpage as a “single page” pdf. It gives me an archival snapshot of the site in the most portable, standardized format there is short of ASCII. It can be nicely annotated with your tools of choice on OS X and iOS.

Just like Dellu, I am not keen on ads and anything that is not relevant to the content I try to capture. Here is what I do: After clipping the webpage into DT as a single page pdf, I open it with Preview and use the crop feature to cut down on irrelevant stuff. This works amazingly well, at least for the typical sites I capture, because all these pages (in particular blogs) seem to have most of the “crap” on the left, the right, the bottom, or even the top, and very little interspersed in the actual body of the page; therefore a single rectangular crop will do the trick. If there is some bothersome stuff within the body, one could quickly cover it up with a rectangular patch using the Preview annotation tools.
The captured pdfs are small in size, so this is not an issue at all for me (and I sync all my databases to DTTG, which is the ultimate test for the unwieldiness of a DB).

Dellu · July 19, 2014, 11:34pm

I really don’t care about Skitch. As you can read from gg378’s post, he is processing the pages after he clipped them. That is the stage I hate; and what makes me to look for Evernote.
So, it is not about just removing the ads; it is rather about filtering out the article and leave the rest of the page out.

Most blog posts have nice content in the middle; and all the surrounding the post in all the corner carries irrelevant junk. Have you tried Evernote clipper? It has a ways of just picking the substance and filtering out the irrelevant. In case you expand your selection, you have that little + sign that you can expand your clip area. It is like a snapshot; you can expand and contract it. In some blogs, the comments of the users contain as valuable information as the actual article. in these cases, I expand the clipping area to the comments and clip it. Perfect!

Personally, the single page PDF is a perfect format so far as the junk is filtered out saving me from further processing.

There are other ways of sending as PDF; like “print what you like”, Readability+Send PDF to Devonthink etc as I mentioned above. They usually succeed in removing the junk parts. But, since they usually work by reformatting the pages, they destroys the beauty of the pages (the original formatting).

Bill_DeVille · July 20, 2014, 12:20am

I don’t like Evernote’s “draw a box” mode of selecting desired content on an HTML page. The layout of some of the sites that I routinely capture from result in that “box” either leaving out relevant material, or pulling in a lot of irrelevant content in the resulting clipping. Example: Most Wall Street Journal articles.

Rich text captures will typically be smaller in file size than a PDF capture. I agree that RTFD files are not “universal”. If I want to share a file with colleagues who don’t use Macs, the solution is to open the RTFD in Bean, add page headers and footers and “print” it as PDF. Then send the PDF.

I do my draft writing within DEVONthink, in rich text. It’s easier to grab excerpts, images and tables from rich text clippings than from PDFs, when I’m working with captured documents. I do my final editing in Pages by copying draft notes and pasting them into a Pages documents. Images come over directly to Pages that way.

gg378 · July 20, 2014, 12:50am

We clearly have quite different needs. Bill is absolutely right that if you want to use the captured materials, e.g. for copy and paste operations, RTF is much better than pdf; in the latter format, copying from multi-column layouts can be a nightmare. I am overwhelmingly capturing subatomic-physics related materials, and in general not for further processing, other than reading and understanding it, and available for future searches. In other fields, the needs are different, i.e. someone else might have to heavily quote from the collected materials; then pdf is not the best.

Originally, I was also very much against any post-processing; things had to “just work” or what’s the point? I strived for nearly automatic collection of materials. After a while I noted that this just led me to capture stuff by the truck load, without looking at it ever again. I now enter items sparingly. At that level, the post-processing is trivial, and it also helps to review the actual material. My take is that if I don’t have the time to redact the materials at a minimum level, it’s probably not important enough for me to store.

What is important to me is that the actual capture step is quick and painless; because often something comes up in the middle of another task. The capture has to quickly and reliably go into the inbox. Then, ideally, every day, in a review session, these captured items are sorted into groups (keywords) and further refined (cropping of web clippings etc). This last step is critical: Intense further processing on the spot is no good, as it interrupts the workflow. But if the inbox is not cleaned up daily, I get a huge backlog, and it is a nightmare to catch up on 200 or more inbox items.

I am curious how successful Evernote is with “figuring out what is useful to you in a website and what isn’t”. A lot of script-driven sites are hard to handle, and I figure that you still have to work quite a bit to ensure that you captured what you want. That’s why I like the “capture to pdf” method. Generally (not always), everything is captured correctly. Removing junk by cropping in a review session is infinitely better than realizing (or not) that something is missing. The pdf cropping is only about cosmetics, not information. It can always be done later or omitted, if time is tight.

And ultimately, the captured pdf is only my safety net, in case the website vanishes, and to include that page into my local DT search. Many webpages get updated incrementally, and in the end, I often don’t look at the captured version, but the up-to-date live webpage.

Dellu · July 20, 2014, 1:14am

Well, in Evernote, you are offered with 5 choices when you click the icon.

Article
Simplified article
Full page
Bookmark
Screenshot
dropbox.com/s/0p5du6851um20wr/ppic368.png

Article is my favorite; and it is the one that I am suggesting to be implemented in DT clipper. It basically crops the article only for you (and of course, expandable if you need to add some part of the page which is not part of the article). In case the page is too complex or you are afraid you will miss some content, you have the choice of clipping the whole page.

The Article saves you the post processing stage at least in simpler pages. For me, the article does is almost always. I hardly clip the whole page when I use Evernote.

gg378 · July 20, 2014, 2:46am

Have you looked into the “Instagram reformatting” option in the clipping plug-in? I tried it a few times, and it strips all the unnecessary stuff from a blog page, for example. I still prefer to capture the full page and then crop.
But this might come reasonably close to the “Article” method in Evernote.

Cassady · July 20, 2014, 9:19am

That’s what I was wondering as well.
I use the Clipper with PDF (One Page) as my “default” setting - works a charm, and as mentioned above - can be taken/sent/read anywhere, on most anything.

If I’m clipping off ‘formal’/academic sites, then that’s all I use.
If I’m clipping off a advert-full newspaper site, I select the “Reformat with Instapaper” - together with the PDF (One Page), strips out 99% of the bloat, with the exception of the actual website’s links to other sections (which are usually at the bottom in any event).

I haven’t compared it to Evernote’s function - but would presume it gets close?

Dellu · July 20, 2014, 4:17pm

yes, I have tried these methods. It is “Instapaper”, by the way, not “Instagram”.

Instapaper is the same to that of Readability(also, Printer Friendly). yes, I use them sometimes. They are not closer to the Article of Evernote. Rather to the Simplified Article clipping of it. I get good results in some pages. As I have mentioned above, the reformatting destroys the beauty of the pages + they fail when the article has some sections.

If you guys have ever tried to clip conversations in Quora.com; or Reddit, you would have known how the Instapaper (readability) technique fails terribly while the Evernote Article does it all elegantly.

korm · July 20, 2014, 6:50pm

Preference opinions, or suggestions about usage of one or the other clippers is interesting, but the original point of this thread:

is a reasonable question – one that merits a factual answer from DEVONtechnologies at some point.

Here is a feature comparison that might be helpful. Send me a PM if something needs refinement and I will update this. I’ll admit, the labeling of some features as “Similar - Post Clip Adjustment Needed” is more informational than useful.

Dellu · July 20, 2014, 8:15pm

Thank you Korm. That is beautiful. I can now easily state my request “save as Article” feature.

gg378 · July 20, 2014, 9:57pm

Hmm, OK, if we confine ourselves to that, then the answer is outright trivial: According to Dellu, such code exists for Evernote, and unless this code is so elaborate that it needs to somehow reside on massive Evernote servers, which I doubt (but is the case e.g. for their handwriting recognition), the answer is obviously YES. But that doesn’t get us far.

The real question is then “what is the real cost to DT development to focus on this feature in a satisfying manner?”. This is the trouble with many feature requests in this forum. Are there enough users that desperately want this? I think that’s what the feedback from me, Cassaday, and Bill was trying to provide: Those who are interested in this thread so far have indicated that they are quite satisfied with the options provided.

The recent addition of the “Instapaper” option (and duh - I don’t know how I slipped to Instagram in my previous post) shows that the devs sense some sort of need for such a feature. But it also shows that they felt compelled to use an easy-to-tap-into external service to accomplish this, rather than spending time implementing something on their own.

If it’s reasonably easy AND does not detract from core development (for me, it must not take any time away from DTTG 2 development), sure, I would support such a method on Dellu’s word that it works really well.

P.S.: Very nice comparison chart, Korm!

ibuys · July 31, 2014, 2:08pm

In my opinion, the usefulness of the web clipper is less about a feature for feature comparison than how the two extensions function. Using the Evernote clipper, everything happens on the same page, in the browser. Using DEVONthink’s, the clipper launches in a separate window that pulls you out of Safari. This is very distracting when running Safari full screen, and only less distracting when as a window on the main desktop.

My suggestion for the clipper would be to streamline the workflow as much as possible, and allow me to save content without interrupting my research.

I have to admit that the “⌘ )” shortcut is pretty great, would still be useful to streamline clicking the button.

tbmueller · August 10, 2014, 2:57pm

I agree completely with Dellu: a more effective web clipper would for me be a top development priority, and I’d most like to see the equivalent of Evernote’s “Save as Article” or “Save Simplified Article,” which I find both effective and useful.

tbmueller · August 13, 2014, 2:00pm

Those in search of a way to import nicely-scraped web pages into DT might look here:
chainsawonatireswing.com/201 … evonthink/

Pretty cumbersome, but perhaps worth the effort…

jzents · August 29, 2014, 2:19pm

Bill_DeVille:

I almost never capture a full Web page, as I want to exclude irrelevant content. There are two major advantages. By excluding unwanted content, the file size of the capture will be smaller. More importantly, by excluding irrelevant text the efficiency of searches and of the AI assistants such as Classify and See Also will be improved.

Most of my captures are as rich text of a selected area of an HTML page. Tip: Select from the bottom up for fast selection of the desired content.

In Safari, DEVONagent Pro or DEVONthink’s browser, a Service shortcut, Command-) will capture the selection as rich text. Rich text preserves text formatting, links, images, tables and lists. In a few cases where preservation of the layout of images or tables may be important, I’ll use the Service keyboard shortcut Command-% to capture the selected area as WebArchive.

On some pages with lots of irrelevant images, capture as rich text of the desired content can reduce the size of the capture by two or even three orders of magnitude, compared to a full-page capture as PDF or WebArchive.

Note that these two Services to capture selected content do not work in Firefox or Google Chrome.

I found this information helpful as I had not begun to master the abilities in DT for this purpose, and it was one of the purposes I have for it. One question: once you pull in a Web-page as RTFD I found I could highlight, but I cannot find a way to annotate. Am I missing something?

Bill_DeVille · September 2, 2014, 4:56pm

I assume you mean annotation in the sense of text notes that can be added to PDFs.

As an old curmudgeon who started working with document information systems long before Adobe introduced the Portable Document Format (PDF), I was delighted by the introduction of PDFs as a means of sharing documents across multiple computer platforms and operating systems. But from the beginning I considered Adobe’s text annotation notes primitive and unsatisfactory because they were limited to plain text and were not searchable. And of course they are not available for other filetypes.

When DEVONthink appeared in 2002 I used rich text notes to annotate documents of any filetype. Simply create a new rich text note, link it to the referenced document and start entering notes. Rich text can include formatted text, links, images, tables and lists, and these notes are searchable.

A few years ago the Annotation template was introduced for the scriptable editions of DEVONthink (Pro and Pro Office). Select the document to be annotated, press a keyboard shortcut for the menu command Data > New from Template > Annotation, and an Annotation rich text note is created for the referenced document. Links are provided both to and from the referenced document. This is a powerful and flexible means of annotating documents of all filetypes, and doesn’t vandalize the referenced document with ugly add-on graphics.

If my Annotation note for a PDF includes several excerpts of text from different pages of the PDF, I can copy/paste the Page Link for each excerpt to create a clickable link to the appropriate page. If I’m annotating a long text, HTML or WebArchive document and wish to have a means of returning to the location of an excerpt or comment in that document, I use a cue string that will allow a Lookup search to return to that location. (It’s easy to pick a cue string that, by means of an exact string search in DEVONthink, will be unique even in a database with tens of thousands of documents.)

This approach to note-taking allows very powerful networking of notes in a way that’s not possible with Adobe notes. From my note I can link to other notes and documents of any filetype. For example, if I’m annotating a document while exploring how a concept is handled in it, my note can link to other notes about variations on that concept.

Tip: When working with annotation I’ll open either a note or its referenced document as a new tab. If I’ve got a network of notes I’ll open other notes and documents by Control-clicking on their links to open them as new tabs as well. Tabbing allows me to associate documents and allow jumping around among them without losing scrolling position in each of them. (I also use tabbing when investigating See Also suggestions.)