Importing and Sorting > the case of web pages > which format is best for AI?

andrearicci · May 15, 2020, 5:23am

Hello,
I am researching a great deal on the internet these days and I importing lost of links from the chrome/firefox extensions. I am usually importing a basic URL. That’s the quickest lightest procedure. But I see that this approach produces stacks of URL which are NOT analysed by devonthink AI (I guess the algorithm applies at best to the URL string).

Should I import a paginated PDF systematically in order to exploit devonthink AI when sorting documents in a second stage of the process ?
As you can see from the picture, clicking the URL does not trigger any result on the list/cloud/graph column on the right.
Thanks for your help and comments
A.

andrearicci · May 15, 2020, 5:25am

And if I had to use the pagined PDF or any other option which stores more of the document actual words, what would be the command to use to batch process all the stored URLs ? You can’t see from the pic but I stored loads. SO I need an automatic solution to create a PDF database using those urls.
Thanks again for your help

cgrunenberg · May 15, 2020, 5:41am

You can indeed use the See Also & Classify inspector for bookmarks if the preview pane is visible or after opening the bookmark in its own window.

Any format containing the text (PDF, formatted notes, plain/rich text, HTML, web archives, Markdown) will be sufficient to make the text searchable and to fully support the concordance & classifying etc. If the layout is important, then a single page PDF is the best and most future-proof option while paginated PDFs are better for printing. And the other formats are of course more suitable if you do not only want to annotate but also to edit the contents.

There are several commands in the Data > Convert, e.g. to PDF (paginated) or to PDF (one page). More options are available in the menu Scripts > Download, these scripts support any item having a URL and not only bookmarks.

andrearicci · May 29, 2020, 5:46am

Dear cgrunenberg,I am always most grateful for your efforts to support the user community. This post of yours has allowed me to solve this issue and make further tests. I’ll post here the results to provide some additional feedback to other users.
The test was carried out on a URL which concerns a very prominent US photographer, Stephen Shore. The url develops few bytes itself but as I said in my initial post that “container” has too little information to be used statistically by the AI of DTP. SO the classification is inaccurate (it proposes the “training” group). The PDF option and the webarchive options, offer instead much more words to the AI to play with. Both yield the suggestion to classify the file in the “Photography (Art) Market”. This suggestions is way more pertinent and way more coherent with the way I have classified documents so far. It’s not perfect (as I have not established a real rule to organise all files in this database), but it’s a very good guess. So the stats identify that there is already a logic (a identifiable or cluster) in my database. The difference between the two kinds of conversion (thanks again for solving that relatively easy problem) is that the webarchive is 12 megs and the PDF 9 megs. So the latter would be better in terms of saving space. However all the formatting goes lost with PDF in this case AND the video in the page disappears as well. So webarchive is, in the case of a complex/visual web resource, the best solution (at least in the contrive environment of this test) to exploit DTP AI.

Many thanks again!