ePub and Copy with Source Link

suavito · February 15, 2024, 4:55pm

My scenario: I had been reading newspapers on a Kindle since its introduction to the German market in 2011. Last year Amazon ditched newspapers as a whole and since then I purchase newspapers as ePub from the publishers directly. Since ePubs can be read and copied from in DEVONthink I am thinking about to convert all Kindle newspapers to ePub too and add them to a dedicated database.

PDF would be an alternative, but the PDF file size is bigger than ePub, which does matter when handling over a decade of newspapers. I am undecided yet and playing around with a small number of test files. Any suggestions are welcome.

Now to my specific question: When I open a newspaper article in Best Alternative view mode via the ePub’s table of content and select some text, and Copy with Source Link I get something like this:

(file:///Users/suavito/databases/DEVONthink/Test%20Database%201.dtBase2/TemporaryItems/Su%CC%88ddeutsche%20Zeitung%2029.9.2023%20(Ausgabe%20fu%CC%88r%20Kindle)/index_split_020.html)

When I just Copy Item Link I get

[Süddeutsche Zeitung 29.9.2023 \(Ausgabe für Kindle\)](x-devonthink-item://E7381809-6AB7-4788-8D71-F6DBDCA5C280)

The latter link works but only points at the ePub as a whole, while the former link due to its index_split_020.html part should be more precise but in fact does not work at all. Most certainly because the Best Alternative ePub is only a temporary file (as indicated by its TemporaryItems folder), existing only while it is displayed.

A search action shows results as Text Alternative ePubs in which Copy with Source Link works like Copy Item Link (plus selected text of course), pointing to the ePub as a whole and not a single text in it. Text Alternative view though works comfortably only with searches because it has no table of content.

So my question is: Does Copy with Source Link need a fix for ePubs or can there be an additional copy action called Copy with Item Link?

rmschne · February 15, 2024, 5:01pm

I didn’t go through your questions in great detail as it’s not an area where I have put much attention too. However, what occurs to me is to suggest you think of taking the “big” PDF’s you might create or have, and then shrink them. I use the third party tool “PDF Squeezer” and have impressive results using their “strong” and “medium” compression profiles without unacceptable loss of quality.

suavito · February 15, 2024, 5:02pm

Bonus question: Christian, could you make the format for the source customizable?

For example I prefer to have the source in a (Markdown) footnote like [^ref] … ^[ref]:.

suavito · February 15, 2024, 5:15pm

Thanks for the suggestion. Does PDF Squeezer batch compress? That would be of extreme importance of course.

I have PDF Expert which allows compressing too. In the case of the newspaper PDF the original size is 3,4 MB and compressed in PDF Expert to lowest quality it is 3,2 MB—which isn’t worth the effort at all. And not surprising because the Kindle newspapers contained mostly of text and very few images (you could get them without extra fee via mobile net).

By comparision, the ePub file of the same newspaper has 1,3 MB. And the PDF, generated by Calibre, does not even have a ToC.

chrillek · February 15, 2024, 5:20pm

As they have a CLI: yes, that should be possible.

From my (very superficial point of view) the main differences between ePub and PDF are

PDF has standardized “bookmark” functionality (aka “annotation”), ePub doesn’t
ePub does reflow naturally, PDF doesn’t (at least not really).

Perhaps because the ePub doesn’t lend itself easily to ToC generation? Did you read this explanation on Calibre and ToCs?

rmschne · February 15, 2024, 5:26pm

Yes.

What I do is put the PDF into DEVONthink from wherever it originates. Normally, “today’s” view shows all my work for the day. At some point I will batch convert by selecting them all, dragging them into PDF Squeezer window, and let it rip. Then “save” back into place in DEVONthink. I’ve gone back and compressing some older PDFs before I had this tool and I’ve done a few hundred at a time.

For the typical web article with a lot of text and one over-sized photo at the top … with “strong” i’ll get 90% or greater compression, mostly on the image which I don’t care if it comes out fuzzy. “Medium” still squeezes signficantly but leaves the images not fuzzy. I’m not very picky.

They also provide what they call “automation” tools, but I’ve not gone there due to an odd error message that occurs when I press the “grant access” button and so far the developer hasn’t gotten back to me in reply to my query. That apparent lack of support is my only reservation about this tool. There are surely other tools like this available.

Their downloaded version might have a test period. Dunno. I got my copy from the Apple app store.

cgrunenberg · February 15, 2024, 6:01pm

EPUB is currently not supported, the next release should fix this.

suavito · February 15, 2024, 6:13pm

That’s great news.

rfog · February 16, 2024, 9:12am

In the meantime, you can follow my instructions to get very readable and beautiful PDFs from non-DRM (aka non-BUG) ePubs with Calibre (you must translate the page if you cannot read Spanish):

And there, my latest footer/header for Calibre:

suavito · February 18, 2024, 11:19am

Thank you very much, @rfog! There are experts in these forums for nearly everything, sometimes I tend to forget that.

At the moment I have hit a brick wall, because Calibre uses the local time zone for displaying the publishing date in its main window (which is the correct date) but GMT time in export. Since I am not in Amazon’s time zone this means instead of having the publishing dates in the names of the exported files I get the publishing dates minus one day. Which is fatal for daily newspapers, of course.

I’m glad I had discovered this before converting and exporting 3,5k of newspapers.

suavito · February 18, 2024, 12:11pm

Beside from that I am almost certain I will keep the newspapers as ePubs and not PDF, and that’s why:

While the old .pobi/mobi files have to get converted somehow, the present day newspapers are ePub already and will not require any conversions.

PDF, unlike ePub, allows annotations, true. But it does not make any sense to annotate directly in a newspaper:

The newspaper database will only be open when I want to do a dedicated search in it. If it was always open the newspapers with their wide array of subjects would obfuscate my search results. If it is closed the annotations in it would not be available.

Unless I copied them to my working database, of course. But that does not make any sense, because I would not quote something like “As The Guardian said in it’s new year’s eve 2013 issue…”. What I will quote is “Author, Article Title, in: The Guardian (ePub/Kindle issue), 31.12.2013”.

The PDF has pages, but these are pages generated by me and therefore it would be absolutely useless to add them to the source when I quote from it.

What does make a lot of sense on the other hand is to only convert articles I find useful to PDFs and move them into my working database. This wouldn’t be any different to what I had been doing as long as Amazon sold newspapers for the Kindle—the .pobi format allowed to save single articles on the Kindle (in plain text). At that point in the workflow I will get all the PDF benefits like annotations, having a platform and software independent format, etc.

And then there is another thing that pushes me to ePub: In Best Alternative, i. e. rendered view, DT displays only one article at a time. While the PDF is one big document. With an ePub a simple cmd-a (or a script) would highlight the whole article I want while in a PDF this has to be done manually.

I started playing with different conversion options in DEVONthink. At first I thought I would go for Copy → Markdown → PDF. Because the intermediary Markdown would allow me to strip defunct links to “Previous arcticle”, “Category overview” from the top of the article and the corresponding ones at the bottom. I could also add the source, the newspaper issue, somewhere.

But although all newspapers have these kind of links, they all handle them differently and it would be difficult to have them removed by script. Plus, the Markdown files are not always good looking. One newspaper even has subheadings that say “Subheading starts here” and “Subheading ends here” with the actual text of the subheadings in between them!

Printing from the rendered ePub articles mostly looks very fine. The old Kindle files don’t look that good—yet. This is due to the conversion in Calibre I will have to optimize.

And then there is a feature of DEVONthink I had not used before—the imprinter. I will use it to put the source directly on the article PDF so I have a simple way of quoting properly in other apps like Scrivener.

I have not yet decided about the format of my source metadata. Just text or item link? And I want a script that does all in one: Print the item from the active editor window to PDF, imprint its source on it, move the PDF into the Global Inbox.

Sounds like a plan.

suavito · February 18, 2024, 12:29pm

All of this leads to follow-up questions for @cgrunenberg:

Would it be possible that a search shows ePubs in Best Alternative view and not in Text View? Maybe as a hidden preference?
When copying from an ePub in Best Alternative view into a Markdown file all links are put into {++ … ++}. Is that intended or a bug?
Could you give us “Only on the last page” as another option for the imprint?
Like I mentioned above, I want to introduce a custom metadata called Source or Reference or something like that. I have been planning to do so for a while now, independent from the newspaper database. The idea was to have one place to cite from. It would contain Bookends citations, or URLs (copied by a Smart Rule from the URL field if not empty), and DT item names. Or item links, and that’s the question: If I used Item Link as the field type would it be a problem if I added non item link references? Or should I better use Text as the field type—that would work with all of them, but the links would not be clickable. Or is the whole idea of one source field for all types just bad?

meowky · February 18, 2024, 3:50pm

You can try using item link with a search parameter for the selected text. e.g. x-devonthink-item://D633F2C7-1F40-4749-8BC0-000000000000?search=this%20is%20an%20example points to the first occurrence of the string this is an example in the document. You can write a script to generate this link with a keyboard shortcut.

Obviously there would be issues when the selected text appears more than once. This method should work most of the time, however, if you always select full sentences.

cgrunenberg · February 19, 2024, 6:24am

This is currently intentional as the Search inspector can only list (and jump to) all occurrences in this view.

Works fine over here both in dark and light mode, might depend on the EPUB’s styling. E.g. does the copied text have a background color?

I’ll forward this.

Item links should be indeed only item links, otherwise just use a generic URL field.

meowky · February 20, 2024, 11:02am

Here is the script to generate such a link. Hats off to @chrillek for guidance.

use framework "Foundation"
use scripting additions

tell application id "DNtp"
	-- Get item link of the document you are currently viewing
	set theRecord to the content record of front think window
	set theLink to "x-devonthink-item://" & (uuid of theRecord)
	
	-- URL encode the text selected by you
	set theSelectedText to the selected text of front think window
	set theParameter to ("?search=" & my encodeText(theSelectedText))
	
	-- Generate the "source link"
	set the clipboard to (theLink & theParameter)
end tell

-- Handler for encoding text. From https://developer.apple.com/library/archive/documentation/LanguagesUtilities/Conceptual/MacAutomationScriptingGuide/EncodeandDecodeText.html

on encodeText(theText)
	set theString to stringWithString_(theText) of NSString of current application
	set theEncoding to NSUTF8StringEncoding of current application
	set theAdjustedString to stringByAddingPercentEscapesUsingEncoding_(theEncoding) of theString
	return (theAdjustedString as string)
end encodeText

-- This code has been revised to address a problem pointed out by @chrillek.

chrillek · February 20, 2024, 11:34am

Please see my post in the other thread re “URL encoding characters”. The handler you’re using will not work for non-Latin1 encodings. Given that the Net is now mostly Unicode, it’ll fail with a lot of input (notably the diverse Asian scripts).

If you can live with JavaScript, I suggest this script, closely modelled after yours. Just a bit shorter and working correctly with Unicode selections.

(() => {
	const app = Application("DEVONthink 3");
    app.includeStandardAdditions = true;
	const frontWindow = app.thinkWindows[0];
	const record = frontWindow.contentRecord;
	const itemLink = `x-devonthink-item://${record.uuid()}`;
    const selectedText = frontWindow.selectedText();
	const searchParameter = `?search=${encodeURIComponent(selectedText)}`;
	app.setTheClipboardTo(itemLink + searchParameter);
})()

Alternatively, one could use NSString methods and the AppleScript-ObjC bridge to do the conversion correctly. Which also will result in a shorter script.
Found that on StackExchange:

on urlEncode(input)
    tell current application's NSString to set rawUrl to stringWithString_(input)
    set theEncodedURL to rawUrl's stringByAddingPercentEscapesUsingEncoding:4 -- 4 is NSUTF8StringEncoding
    return theEncodedURL as Unicode text
end urlEncode

meowky · February 20, 2024, 1:48pm

Thanks for pointing out my mistake. TIL

This indeed works for Chinese, and other unicode characters I have just tested. The code in my earlier post have been revised to avoid confusion.

chrillek · February 20, 2024, 2:41pm

You’re welcome. And I think it was rather Apple’s mistake than yours – their text is from 2016, and that was already well into the advent of Unicode. But it would have probably been far too demanding to write that stuff correctly in pure AppleScript