Sorter - Import original PDF from Web

Hi,

it is possible that the sorter detects that a PDF document is displayed when capturing web content from Safari.
I would then like to import this as the original document. So far DevonThink seems to do a conversion of the displayed PDF.

regards

It does a conversion in what way?

A URL to test would be useful.

Whenever you have questions I assume that my request is probably not a normal behaviour :wink:
So I did some further testings and it seems, that the import setting “one Page” is the reason.
At the first sight the document indeed seems to be the original one if I choose “paginated”.
Good to know but not obvious to see.
So am I right, that this should be a direct download of the pdf then ?

Whenever you have questions I assume that my request is probably not a normal behaviour

This is indeed sometimes the case. However, it’s usually a need for clarification since there are so many possibilities. For example, when people say they can’t sync. That’s not enough information since we support more than one sync method.

Choosing PDF (One Page), PDF (Paginated), or even Web Archive on some URLs, should yield the PDF intact.

What URL are you testing?

Jim, I know … it just feels bad if I can answer my own question myself after this.
But I always have to consider if I (in the end) spend a lot of time in researching (without solution) or if my little research is enough.

However I think that this kind of feedback might give an idea of how some of the great features could be designed more clear for non power users.
Just telling you that because I hope you don’t think I ask stupid questions without thinking :sweat_smile:

Back to topic:
You can try this one https://www.wi.msm.uni-due.de/lehre/lehrveranstaltungen/sommersemester-18/biso-3747/download/BISO_SS_2018_1_auf_1.pdf/
It’s the same document I am using for my case Remembering reading position

Just telling you that because I hope you don’t think I ask stupid questions without thinking :sweat_smile:

Nope. I just know it’s a very natural tendency to assume people know what we’re talking about, even when we haven’t provided enough detail to be understood.

Many years ago, I was texting @eboehnisch about something I was testing on my machine. After a barrage of texts, he simply said, “Jim, I am not looking at your machine and don’t know what you’re referring to.” And as I looked back at my texts, I quickly saw I made a ton of assumptions about what he would understand, even things that were only visible on my machine. From then on, I always strive to be more clear in what I write and encourage others to do the same.

I often use the analogy of support similar to being a doctor. If a patient comes in and says, “I don’t feel well.”, that’s okay but it also doesn’t provide enough information to treat the person. If the person said, “My left pinky toe hurts when I walk. I did trip and stub it on the coffee table three days ago.”, then we’re really in a good position to help. :slight_smile:

PS: I clipped that PDF with the Web Archive option and it clipped paginated, just as the original.

I quite believe you. It’s great that you don’t lose your nerves and always offer good support :smiley:

I have tested a little bit more.
There is, for example, the meta-information “geographical location”. This is different from the downloaded document.

So it seems either that DT really downloads the PDF or that DT creates a 1to1 clipping and takes over the meta information… but add some more.
On the other hand DT markes the downloaded (safari) and the clipped documents as dublicates. Also, the file name is different from the original.

It was just important to me to understand what happens under the hood of DT and how to proceed if I really need to make sure to have the original document.

I’ve been saving them to the Inbox. Seems to work, though i don’t use Devonthink’s OCR. Should I not do this?

Even have the Inbox as a shortcut in the Save As filechooser.

I quite believe you. It’s great that you don’t lose your nerves and always offer good support :smiley:

Thank you for the nice compliment :slight_smile:

So it seems either that DT really downloads the PDF or that DT creates a 1to1 clipping and takes over the meta information… but add some more.

Yes, DEVONthink can add a geolocation for the current location when files are added to a database.

On the other hand DT markes the downloaded (safari) and the clipped documents as dublicates. Also, the file name is different from the original.

That is because the content is the same. The name doesn’t matter. If you want to consider file size and file type in duplicate detection, encbale Preferences > General > Stricter recognition of duplicates. However, the name is still not considered.

Here are four files with the same text content.

The two in the red section are duplicates, even though the names are different.
The green one is the same format but a different file size, so it’s not a match.
The blue one is a different format and size, so it also doesn’t match.

If you are just scanning with no OCR operation from an external application, you can scan to DEVONthink. If there is any secondary operation, like OCR, you should not. In general, it’s just a good policy to not scan to the Global Inbox in the Finder.

But does that advice hold for PDFs on the web?

Ahh… apologies. I misunderstood the contxt of your question. Clipping into the Global Inbox is perfectly acceptable.

Thanx Jim, please help me clarify one last thing.

If I might share the document in future I want to be aware that I am sharing individual information that have not been included in the original document.
Geolocation was the one I found… but might there be additional things I am not thinking about right now ?
BTW: If this clipping has a pdf download character it would be nice to keep the original filename. The url information is redundant and already included in the “url”.

You’rw welcome.

That’s the only added bit of metadata as far as I am aware.

BTW: If this clipping has a pdf download character it would be nice to keep the original filename. The url information is redundant and already included in the “url”.

Can you clarify this?

If you clip f.e. https://www.wi.msm.uni-due.de/lehre/lehrveranstaltungen/sommersemester-18/biso-3747/download/BISO_SS_2018_1_auf_1.pdf/.pdf this url will be also the filename.
The actual filename ist “BISO_SS_2018_1_auf_1.pdf”. At least it could be the name of the website.
I know I can change it but that’s an unnecessary step in my opinion. No big deal

I just entered this URL manually in the Clip to DEVONthink tab of the Sorter, used the PDF (paginated) format and disabled the clutter-free option. But the clipped item was actually named “Business Intelligence: Strategie und Organisation Sommersemester 2018”. Which version of DEVONthink and which browser do you use?

I got the URL as name in macOS 10.15.3 and Safari…

Saving from Firefox (since it can’t be clipped) yields…

Did you try to enter the URL on your own in Clip to DEVONthink without using a browser? What’s the name then?

Entering the URL in the Sorter yields the title like you have.

May I ask why?