Thanks. Blasted through it, but been coding a while.
Good idea. Made me realise I could check the generated name too.
Thanks also for the info about embedded URLs, I’ll keep an eye out for that.
I’ve noticed some pretty crazy OCR issues that seem to stem from fonts not being embedded properly, but they can bed fixed easily enough in DT with OCR > to searchable PDF.
The problem though with OCR’ing is that you also loose all the links. So far it seems to be an either or approach, but I’d be interested to know if you’ve found a way to get the best of both worlds (I haven’t so far and am just accepting the text layer problems when they happen)
I played around with clipping in DT, thinking I could automate that, but then faced the usual banner/overlay issues.
If other issues come up, I’ll see what markdown has to offer. The downside being it could require some manual editing, eg, setting code block languages etc.
TBH I have found very few of those problem (having clipped >1500 pdfs so far). I think mostly Medium still shows banners, but for 90-95% of my clipped pages it works quite well. Have you enabled ‘block advertisements’ ? (Also: I’m using PiHole to block many of the advertisements on the network level, that might make a difference too)
Another thought about capturing and fonts: If you’re going to capture through Safari anyway and would be willing to accept a little less context, you can switch to Reader mode to probably prevent many of the font issues. Not every page has a reader view available though. An interesting aside is that capturing through Reader mode does give nicely paginated PDFs that are easier to read often on other devices.
Markdown for me offers a totally different experience and would mainly work for content that is text heavy, but doesn’t work for fully designed articles. But obviously use cases vary.
Having a quick play with clipping inside DT. The first thing I notice is that I can’t get consistent widths. Sounded like ‘PDF Clipping Width’ might help but doesn’t.
That can be solved (I have that issue too) - check out my approach here: Automatically capture and annotate items (to use with Obsidian) (basically using the same approach you’re using with setting the bounds on Safari windows, but then on the DT window)
Nice resource, I’ll have a dig through when I have a few more mins. Assume it’s something like opening in a new window, scripting the resize, and then clipping?
I’ve just opened this page though in DT to try and the rendering’s incorrect: notably dark background that should be light.
Something else I’m just thinking about. As it’s often problematic that PDFs have the wrong text layer through non-standard fonts, a crude approach to fixing that might be to inject JS/CSS like this: document.head.appendChild(document.createElement("style")).innerHTML="body { font-family: sans-serif !important; }" after loading the page. A small experiment showed that indeed most font rendering issues I know go away (but pages are not exactly as you might see them in Safari).
Latest version that should fix latest record and URL issue:
property currentUrl : null
property originalBounds : {}
property originalRecordCount : 0
on run
tell application "Safari" to activate
setUrl()
setOriginalBounds()
changeWidth()
setOriginalRecordCount()
exportAsPdf()
restoreWidth()
updateUrl()
end run
on waitFor(element)
set i to 10
repeat until exists element
set i to i - 1
if i = 0 then exit repeat
end repeat
end waitFor
on setUrl()
tell application "Safari"
set currentUrl to URL of current tab of window 1
end tell
end setUrl
on setOriginalBounds()
try
tell application "Safari"
set originalBounds to bounds of the first window
end tell
on error number -1719
display notification "Clip to DEVONthink: No open Safari windows!"
end try
end setOriginalBounds
on changeWidth()
tell application "Safari"
copy originalBounds to newBounds
set item 3 of newBounds to the (first item of originalBounds) + 1024
set bounds of the first window to newBounds
end tell
end changeWidth
on setOriginalRecordCount()
tell application id "DNtp"
set originalRecordCount to count of (search "additionDate:Today")
end tell
end setOriginalRecordCount
on exportAsPdf()
tell application "System Events"
click menu item "Export as PDF…" of menu "File" of menu bar item "File" of menu bar 1 of application process "Safari"
tell process "Safari"
my waitFor(a reference to sheet 1 of window 1)
tell sheet 1 of window 1
if value of pop up button 1 is not "Inbox" then
set inboxRow to null
repeat with aRow in row of outline 1 of scroll area 1 of splitter group 1
if name of first UI element of aRow starts with "Inbox" then
set inboxRow to aRow
select inboxRow
exit repeat
end if
end repeat
if inboxRow = null then
keystroke "g" using {shift down, command down}
my waitFor(a reference to sheet 1)
keystroke "~/Library/Application Support/DEVONthink 3/Inbox"
keystroke return
end if
end if
click button "Save"
end tell
end tell
end tell
end exportAsPdf
on restoreWidth()
tell application "Safari"
set bounds of the first window to originalBounds
end tell
end restoreWidth
on updateUrl()
tell application id "DNtp"
repeat
set todaysRecords to search "additionDate:Today"
if (count of todaysRecords) > originalRecordCount then exit repeat
end repeat
script nullRecord
property addition date : (current date) - 60
end script
set latestRecord to nullRecord
repeat with aRecord in todaysRecords
if (get addition date of aRecord) > (get addition date of latestRecord) then
set latestRecord to aRecord
end if
end repeat
set URL of latestRecord to currentUrl
end tell
end updateUrl
As mentioned, might be some other issues, but at least this gets around the overlay issues I’ve been having.
I’m trying if this might be a good workaround for some clipping issues (also e.g. when sites use hyphenating which breaks words and prevents them from being found):
You execute the JavaScript by inserting it into a window (works for a DT window, as well as Safari windows). In DT:
set captureWindow to open window for record theRecord
do JavaScript "document.head.appendChild(document.createElement('style')).innerHTML='body, p, h1, h2, h3, h4, h5, h6 { font-family: sans-serif !important; hyphens: manual !important; -webkit-hyphens: manual !important; }'" in captureWindow
Why not using a@media print rule? And if it’s about printing to PDF, Helvetica can be used instead of sans-serif. That’s guaranteed to be available on all PDF devices.
Bummer. Indeed, it doesn’t. Well, I think it should. That’s what @media print is all about, isn’t it? Also, if a website does already provide a print style sheet, that will not take effect either. Weird.
I got that. I just don’t think that it’s reasonable to do that. Given that the screen layout probably contains navigation and a lot of other things that just do not make sense in a PDF.
But then there are always those users who insist that the PDF “must look exactly like the web document” …
I’m probably one of those Not religiously, but it is why I’m using PDF to capture resource material. All the links, comments etc. provide context when I visit something later (e.g. when the website have disappeared). So for me it’s especially the reason why I don’t capture clutter-free or “Reader”-type PDFs. It’s basically a screen capture with the added benefit of searchable text