Clip to DEVONthink

Thanks. Blasted through it, but been coding a while.

Good idea. Made me realise I could check the generated name too.

Thanks also for the info about embedded URLs, I’ll keep an eye out for that.

I’ve noticed some pretty crazy OCR issues that seem to stem from fonts not being embedded properly, but they can bed fixed easily enough in DT with OCR > to searchable PDF.

Indeed - see also this discussion: Wrong text (layer) when capture PDF from viewer window - #15 by mdbraber.

The problem though with OCR’ing is that you also loose all the links. So far it seems to be an either or approach, but I’d be interested to know if you’ve found a way to get the best of both worlds (I haven’t so far and am just accepting the text layer problems when they happen)

I played around with clipping in DT, thinking I could automate that, but then faced the usual banner/overlay issues.

If other issues come up, I’ll see what markdown has to offer. The downside being it could require some manual editing, eg, setting code block languages etc.

TBH I have found very few of those problem (having clipped >1500 pdfs so far). I think mostly Medium still shows banners, but for 90-95% of my clipped pages it works quite well. Have you enabled ‘block advertisements’ ? (Also: I’m using PiHole to block many of the advertisements on the network level, that might make a difference too)

Another thought about capturing and fonts: If you’re going to capture through Safari anyway and would be willing to accept a little less context, you can switch to Reader mode to probably prevent many of the font issues. Not every page has a reader view available though. An interesting aside is that capturing through Reader mode does give nicely paginated PDFs that are easier to read often on other devices.

Markdown for me offers a totally different experience and would mainly work for content that is text heavy, but doesn’t work for fully designed articles. But obviously use cases vary.

Thanks, I’ll take a look at that too.

Having a quick play with clipping inside DT. The first thing I notice is that I can’t get consistent widths. Sounded like ‘PDF Clipping Width’ might help but doesn’t.

That can be solved (I have that issue too) - check out my approach here: Automatically capture and annotate items (to use with Obsidian) (basically using the same approach you’re using with setting the bounds on Safari windows, but then on the DT window)

Nice resource, I’ll have a dig through when I have a few more mins. Assume it’s something like opening in a new window, scripting the resize, and then clipping?

I’ve just opened this page though in DT to try and the rendering’s incorrect: notably dark background that should be light.

Hmm - not with me. This is my automatically captured PDF

How to test React Components.pdf (194.3 KB)

Something else I’m just thinking about. As it’s often problematic that PDFs have the wrong text layer through non-standard fonts, a crude approach to fixing that might be to inject JS/CSS like this: document.head.appendChild(document.createElement("style")).innerHTML="body { font-family: sans-serif !important; }" after loading the page. A small experiment showed that indeed most font rendering issues I know go away (but pages are not exactly as you might see them in Safari).

1 Like

Latest version that should fix latest record and URL issue:

property currentUrl : null
property originalBounds : {}
property originalRecordCount : 0

on run
	tell application "Safari" to activate
	
	setUrl()
	setOriginalBounds()
	changeWidth()
	setOriginalRecordCount()
	exportAsPdf()
	restoreWidth()
	updateUrl()
end run

on waitFor(element)
	set i to 10
	
	repeat until exists element
		set i to i - 1
		if i = 0 then exit repeat
	end repeat
end waitFor

on setUrl()
	tell application "Safari"
		set currentUrl to URL of current tab of window 1
	end tell
end setUrl

on setOriginalBounds()
	try
		tell application "Safari"
			set originalBounds to bounds of the first window
		end tell
	on error number -1719
		display notification "Clip to DEVONthink: No open Safari windows!"
	end try
end setOriginalBounds

on changeWidth()
	tell application "Safari"
		copy originalBounds to newBounds
		set item 3 of newBounds to the (first item of originalBounds) + 1024
		set bounds of the first window to newBounds
	end tell
end changeWidth

on setOriginalRecordCount()
	tell application id "DNtp"
		set originalRecordCount to count of (search "additionDate:Today")
	end tell
end setOriginalRecordCount

on exportAsPdf()
	tell application "System Events"
		click menu item "Export as PDF…" of menu "File" of menu bar item "File" of menu bar 1 of application process "Safari"
		
		tell process "Safari"
			my waitFor(a reference to sheet 1 of window 1)
			
			tell sheet 1 of window 1
				if value of pop up button 1 is not "Inbox" then
					set inboxRow to null
					
					repeat with aRow in row of outline 1 of scroll area 1 of splitter group 1
						if name of first UI element of aRow starts with "Inbox" then
							set inboxRow to aRow
							select inboxRow
							
							exit repeat
						end if
					end repeat
					
					if inboxRow = null then
						keystroke "g" using {shift down, command down}
						my waitFor(a reference to sheet 1)
						keystroke "~/Library/Application Support/DEVONthink 3/Inbox"
						keystroke return
					end if
				end if
				
				click button "Save"
			end tell
		end tell
	end tell
end exportAsPdf

on restoreWidth()
	tell application "Safari"
		set bounds of the first window to originalBounds
	end tell
end restoreWidth

on updateUrl()
	tell application id "DNtp"
		repeat
			set todaysRecords to search "additionDate:Today"
			if (count of todaysRecords) > originalRecordCount then exit repeat
		end repeat
		
		script nullRecord
			property addition date : (current date) - 60
		end script
		
		set latestRecord to nullRecord
		
		repeat with aRecord in todaysRecords
			if (get addition date of aRecord) > (get addition date of latestRecord) then
				set latestRecord to aRecord
			end if
		end repeat
		
	    set URL of latestRecord to currentUrl
	end tell
end updateUrl

As mentioned, might be some other issues, but at least this gets around the overlay issues I’ve been having.

I’m trying if this might be a good workaround for some clipping issues (also e.g. when sites use hyphenating which breaks words and prevents them from being found):

do JavaScript "document.head.appendChild(document.createElement('style')).innerHTML='body, p, h1, h2, h3, h4, h5, h6 { font-family: sans-serif !important; hyphens: manual !important; -webkit-hyphens: manual !important; }'" in captureWindow

but been coding a while.

That explains your unusual approach of using all the handlers in your script. :smiley:

How do you execute that while browsing? As a JavaScript bookmark of some sort?

You execute the JavaScript by inserting it into a window (works for a DT window, as well as Safari windows). In DT:

set captureWindow to open window for record theRecord
do JavaScript "document.head.appendChild(document.createElement('style')).innerHTML='body, p, h1, h2, h3, h4, h5, h6 { font-family: sans-serif !important; hyphens: manual !important; -webkit-hyphens: manual !important; }'" in captureWindow

I actually like that. Makes the main code a lot easier to read and understand.

Why not using a@media print rule? And if it’s about printing to PDF, Helvetica can be used instead of sans-serif. That’s guaranteed to be available on all PDF devices.

Agreed - it’s not common in Applescript but it indeed makes it more readable.

@media print does not apply to capturing PDF - but using Helvetica sounds like a good tip - thanks for that!

Bummer. Indeed, it doesn’t. Well, I think it should. That’s what @media print is all about, isn’t it? Also, if a website does already provide a print style sheet, that will not take effect either. Weird.

It would if you would use the option “Print to PDF” - but capturing basically just uses the screen layout and saves that to PDF.

I got that. I just don’t think that it’s reasonable to do that. Given that the screen layout probably contains navigation and a lot of other things that just do not make sense in a PDF.

But then there are always those users who insist that the PDF “must look exactly like the web document” …

1 Like

I’m probably one of those :slight_smile: Not religiously, but it is why I’m using PDF to capture resource material. All the links, comments etc. provide context when I visit something later (e.g. when the website have disappeared). So for me it’s especially the reason why I don’t capture clutter-free or “Reader”-type PDFs. It’s basically a screen capture with the added benefit of searchable text :slight_smile: