Need tips on how to properly import non-pdf documents

bosie · December 26, 2011, 12:12pm

Hi,

my setup: devonthink pro, devonagent pro, chrome, chrome DT extension.
7 databases, 3 of which are used the most. i bought and read the DT ebook.

my problem: import process takes too long.

i browse mostly using chrome. when i come along something i want to import, i normally want to import specific content of the page, keep the layout and multimedia content.
this is absurdly difficult IMO.

content

i can clip the entire page as a webarchive, which seems to have massive disadvantages, among others:

throws off classification
most often huge filesizes (i clipped this page yesterday [1] which turned out to be 9MB)
throws off search
re-downloading the entire page/webarchive

if i then go in manually and remove all the unnecessary parts, it is time consuming and does not always work. besides, it normally does not even shrink the archive much.

importing a pdf is a nightmare too: i open the pdf in chrome and then what? do i seriously have to save it temporarily to the disk, drag it into dt and then delete the pdf?

opening the page in devonagent, selecting the things i want and importing them via “Add Selection to DT” is viable but it kills the layout. Attached as a screenshot. left side is the imported version, right is how the page looks in devonagent and it also lightly shows what i have selected

classification

i can’t seem to streamline my import process. clipping has a different import process than adding via devonagent. dragging the item from path finder onto the DT icon is also showing a different window. suboptimal.

another thing: regardless the window i can’t quickly select the group i want to put the item in. with almost 500 groups and 5 hierarchical levels it is needless to say … manually selecting does not work.

auto-classification: has never worked for me. 95 out of 100 documents will not get classified at all. what am i doing wrong here? 3 out of the 5 remaining documents will get wrongly classified

why do i classify? because search is not good enough. i have a group “backpacks” that contains around 25 or 30 backpacks. with search i get a list of 100 documents or sth and those 30 backpacks are distributed evenly it seems

any tips are highly appreciated, it is just too time consuming overall.

best,
bosie

[1] tripleaughtdesign.com/Equipm … T-Pack-EDC

Greg_Jones · December 26, 2011, 1:10pm

A couple of quick thoughts, mostly because I don’t have time now to post lengthy thoughts.

First, while I really like a lot of the features of Chrome, I find that it makes a sub-optimal browser to capture data to DEVONthink. The lack of rich text capture in Chrome is perhaps the only thing that keeps me using Safari. However, if you add DEVONthink’s Global Inbox (from DEVONthink’s Install Add-Ons menu) to the Finder Sidebar, you can save PDFs directly from Chrome to the DEVONthink Global Inbox.

On the auto-classification issue, do you mix both documents and sub-groups in your groups? Classification will work much better if groups contain either groups, or documents, but not both. Groups that contain sub-groups should have classification turned off in the info panel, but you’ll need to check that newly added sub-groups for documents have classification turned on. New sub-groups groups inherit the properties of the containing group, so one can unknowingly create groups that are not enabled for classification.

bosie · December 26, 2011, 1:16pm

first, thanks for the answer, although it isn’t lengthy

to the Finder Sidebar, you can save PDFs directly from Chrome to the DEVONthink Global Inbox.

I am not using Global Inbox as I find it takes even longer to correctly classify with it. I would still have to go to DT and correctly classify, no?

lack of rich text capture in Chrome

The screenshot is using Devonagent (and i doubt any other import mechanism is better than DA?) and you can see for yourself. Devonagent does a horrible job.

groups

No. Only leafs contain documents.

korm · December 26, 2011, 1:40pm

If I read correctly, @bosie wants to end up with searchable clips of portions of web pages and documents. A quick way to do that is using command-shift-4 to clip a portion of the browser page to a file. OCR the file in DTPO. Yes, you didn’t say you have DTPO - but it can be had - or use PDF PenPro or anything else that will OCR images - Acrobat doesn’t image OCR well. And then you have searchable PDF clips of pages. All of which can be sped up by batching steps of the process, and/or use a folder action with Automator actions to OCR whatever is dropped on folder, index the folder, and so on.

OS X default is to put screenshots on the desktop - utilities like savescreenie (free) will let you configure the default saving location as well as the format. High-quality .jpgs work better than medium quality .pngs. If you use Skitch set your capture format to high quality .jpg.

bosie · December 26, 2011, 2:39pm

@korn
interesting approach. I don’t have DTPO but Abby. There are a few problems i see with this approach though right off the bat.
a) printing goes out the window
b) getting DT to recognize the URL?
c) most of my content is not visible on a single screen
d) content won’t scale anymore
e) at least with abby and sketch on the maximum quality, the resulting scan is still quite ugly

korm · December 26, 2011, 2:44pm

Not sure what that means - you can print the clippings, no?

True that. The screen print approach is quick and dirty (emphasis on dirty).

bosie · December 26, 2011, 2:49pm

printing

I can print the clipping but not when it is scanned. quality IMO degenerates to a level where printing is impossible. and not being just regular text hinders printing.

bosie · December 27, 2011, 2:56pm

as for the chrome PDF import problem.
printing to DT works, i can whip up a nice solution via Keyboard Maestro. The other points stand, especially importing websites

Bill_DeVille · December 27, 2011, 9:20pm

I agree with Greg Jones that Chrome is suboptimal for capturing data from the Web to DEVONthink.

My favorite capture modes are the Services that capture a selected area of a Web page as rich text (keyboard shortcut is ‘Command-)’) or as WebArchive (keyboard shortcut is ‘Command-%’). Capture of just the portion of a page that contains the desired article improves the efficiency of the AI assistants such as Classify and See Also by eliminating extraneous content, and usually also significantly reduces the storage space by comparison to capture of the full page.

As bosie indicated a preference for maintaining the formatting/layout of captured page elements, I would suggest capture of a selected area of the page using the keyboard shortcut ‘Command-%’ to capture as WebArchive.

These Services work properly in Safari and DEVONagent, but not in Chrome or Firefox.

I usually have DEVONthink’s Preferences > Import - Destination checked for "Select group’, which allows one to send output to any location in any open database.

bosie · December 27, 2011, 9:32pm

Bill, thanks for your reply. As I said, i do have devonagent but are you sure DA is actually working properly?
i tried both your approaches from devonagent (rich text and WebArchive) and neither actually worked because the layout is not kept. as far as i can tell, at least the background and padding/margins are not properly working.

as for the import. i have the same setting activated. the problem i have with it is the speed with which i can select a group. it probably takes 30 seconds or more to select the right group (5 hierarchical layers, 500 groups) plus i need to switch to the mouse.

Bill_DeVille · December 27, 2011, 11:52pm

It’s probable that the code for things like page background was not included in the area of the page that you selected for capture. I usually capture as rich text, because I’m only interested in the text, links, tables and images included in an article that I want in my database, so I don’t care about layout niceties. In some cases, layout becomes a bit more important and that’s when WebArchive is better. For example, if I capture a thread in the user forum that includes scrollable boxes that contain script code or images, WebArchive handles that well. But I don’t expect a section of a page to include background, etc. that was set outside my selection, nor would that be important to me in almost all circumstances, if I’ve got my article content.

I’ve also got hundreds of groups in some of my databases, though in most cases they don’t go more than 2 or 3 levels deep. I just did some tests of time taken to file items using the HUD presented by the ‘Set groups’ option and most were filed in under 5 seconds - but I’m not averse to mousing. Sometimes, though, it takes considerable time to decide on an appropriate location (thinking can’t always be speeded up, even by a pot of coffee).

If you’ve got items to split among several open databases, you might find it quicker to set the destination choice either to the Global Inbox or to the Inbox of the frontmost database. In the latter case, if the database has a good deal of content already the Classify AI assistant might be helpful in filing - although that requires mousing, too.

Another approach that requires mousing but can be fast if you are selecting multiple items for filing to the same location is use of the Groups & Tags panel. That’s what I usually use when I’m tackling a large batch of unfiled documents. It’s quick if I can select multiple items to go to the same database/group by the title of the documents.

bosie · December 28, 2011, 11:56am

bill,

thanks for your reply.

@import webarchive selection
i think you are right about the selection. i played a bit with it. anything set outside the selection is dismissed. the layout otherwise is actually kept.
not a big problem i found but on a few sites, the text is not black. sometimes even white (with a black background). if the background is not imported, i can’t see anything
on regular articles i use instagram but if all i want is a selection (or on a site, on which instagram does not work) it is a pity.

@import with “set groups”
ok, 30 seconds was exaggerating. i timed my imports yesterday. around 15 seconds. maybe its the complete lack of coffee…
but i always have all my databases expanded. it would probably take 1 second if the mask had a search function

@global inbox
but classify AI does not work for me, it never assigns a document to a group (if you mean ‘Auto Classify’ by it)

@groups and tags:
you put everything in the global inbox, then once a day or so you go to DT and toggle the Groups & Tags panel? since i don’t know if i am going to import a lot in the next few minutes, i guess i would have to establish a daily grouping routine.

@grouping
on a different note: sometimes i wonder why i bother with groups at all. DT does not seem to take my groups into account when retrieving my data. i could simply use tags instead and hope DT improves the tags views (which i guess won’t happen any time soon or never at all)

thanks

bosie · December 28, 2011, 10:07pm

bill,

are you using the “Group Items” feature? if so, could you elaborate on how/when you are using it and why you chose it over other grouping features in those situations?

pretty please?

thanks in advance,
bosie