PDFs created in Calibre have "no text" in DT3

timj · June 1, 2019, 6:37pm

I expect this is related to this one (buggy-handling-of-pdf-text-due-to-pdfkit-engine/24411), but it is several months old.

I have some ebooks that I converted to PDF using Calibre 3.44 but, on import, DT3 says the PDFs have “no text”. They are not searchable, and though text can be highlighted and the TOC displays correctly, selected text does not copy (or paste) correctly. The link above seems to suggest that this is due to the way Calibre generates the PDF, combined with the way MacOS handles (or fails to handle) the PDF. Note: Preview also does not see any text when searching within the PDF (though it can select it easily), suggesting this may be a MacOS issue that is visible in DT3, as it uses the same engine.

Has anyone found a good solution for converting ebooks to PDFs that work in DT3? I need to convert several hundred ebooks and need PDFs that have the TOC (Calibre does this beautifully) and also work as expected in DT3, with full-text indexing, highlighting, copying/pasting, etc.

rfog · June 1, 2019, 7:19pm

Still there, and still in DT3 because they still use the buggy Apple stuff.

I resolve it this way:
-No embedded fonts.
-Load with preview.
-Duplicate.
-Save with “Create Generic PDFX-3 Document” or “Optimize Size” Quartz filter (any of those use to duplicate/tripicate the PDF size).

Result is a PDF that has TOC and text.

Perhaps you can script that.

timj · June 1, 2019, 10:52pm

I found a solution that accomplishes everything I need:

Text in PDF is searchable/copyable in Preview/DT3
TOC generated by Calibre remains intact
File size is optimized for digital devices
All links in the PDF remain functional
Scriptable

I used Ghostscript via Homebrew (brew install gs) with this script:

#!/bin/bash

# Compresses all PDFs in current directory for eBook usage, 
# preserving links and TOC, and fixing problems with PDFs generated 
# by Calibre not displaying on MacOS. Assumes "original" and "finished" 
# subdirectories

# loop through all PDFs in current directory
for pdf in *.pdf; do
    	[ -e "$pdf" ] || continue // confirm file exists

    	echo "--> Processing: $pdf"

    	gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE \
    	-dBATCH -dPrinted=false -dQUIET -sOutputFile="finished/$pdf" "$pdf"

	mv "$pdf" "original"

done

The resulting file size is equal or smaller than the original PDF.

MFDoom · September 3, 2019, 11:28pm

This method works for me with the exception of keeping the TOC links. Still experimenting on what else I can try to keep the links working.

timj · September 4, 2019, 10:15am

What version of ghostscript are you using? I am using 9.27 and the above script still works flawlessly for me. Note: I had to try several different configurations before finding that the one above does what is needed, including the preservation the TOC links (the flag -dPrinted=false is what does it, IIRC).

MFDoom · September 4, 2019, 8:26pm

I am using 9.27 but I must have had a typo. Your script is undefeated. Thanks again for sharing.

rcvd · October 13, 2019, 1:48pm

This method worked for me a long time. Now with the latest updates (gs 9.27, Calibre 4.1, macOS Catalina) all PDFs converted with gs are PDF only (and not PDF+Text).

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE \
-dBATCH -dPrinted=false -dQUIET

I can cut and paste content from the pdf but the search is very slow in DevonThink. Anyone else also having this problem?

timj · December 10, 2019, 11:21pm

Have you found a solution for this yet?

BLUEFROG · December 11, 2019, 1:04pm

No. Overwriting as PDFX-3 is still the most useful workaround IMHO.

timj · May 6, 2020, 3:28pm

Can you confirm this is working Catalina 10.15.4? I have tried to convert multiple Calibre-produced PDFs to PDFX-3 format as mentioned above, but they suffer from the same OCR corruption as mentioned in this thread.

BLUEFROG · May 6, 2020, 3:30pm

No, I can’t confirm it as there are still PDFKit issues Apple hasn’t resolved.

halloleo · May 25, 2020, 2:56am

@timj Why do you want to convert the ebooks to PDFs in the first place? I think DT3 can search through ePUBs as well.

rfog · May 25, 2020, 6:43pm

I can tell you some:

PDF has a standard annotation format that can be read in most of PDF viewers. Then you can annotate a PDF without any worry it won’t show annotations in any application, even Windows. ePub doesn’t.
DTTG does not have ePub viewer.
Fixed layout is interesting for reference, eg. “see page 36”.
PDF annotations are added inside PDF, ePub not (and each viewer uses its own format).

timj · May 25, 2020, 9:11pm

Good question, and @rfog has provided the main reasons I prefer PDF over EPUB. Essentially, it comes down to versatility of use and that highlighting is embedded in the PDF and accessible in any standards-compliant PDF viewer.

For example, I listen to PDFs using Voice Dream and insert bookmarks in real time for important sections. Then I review the PDF in PDF Expert (on iPad), adding annotations and converting bookmarks to highlights. Those PDFs (with embedded highlights and annotations) are then archived in DT3 for review, searching, and integrated research. Highlights can be extracted, displayed and reviewed in the Annotations tab, and automatically summarized.

timj · May 25, 2020, 9:31pm

Update: Workaround discovered

For anyone who may be interested, it seems that the root issue of all this headache has to do with how Chromium creates PDFs on MacOS. According to the developer, Calibre uses Chromium’s print-to-pdf routine to convert ebooks to PDF format, and something about the PDFs created by Chromium on MacOS (apparently having to do with CID fonts, etc.)

I confirmed this by printing a webpage to PDF from within Chromium (or Chrome) on MacOS and printing the same page to PDF from Firefox or Safari. Importing both PDFs into DT3 and highlighting them results in the PDF from Chromium changing from Kind “PDF+Text” to “PDF” (though it may only happen after restarting DT3). This happens on both Catalina and Mojave.

In the previous workaround above, the PDFs created by Calibre worked in DT3 after post-processing them with ghostscript—but only in Mojave. When highlighted by DT3 in Catalina, the OCR layer became corrupt.

The best workaround I have found so far is complicated to set up, but so far works flawlessly: use Calibre on Linux.

I am running Ubuntu 18.04 in VirtualBox and the PDFs created by Calibre are unbreakable in DT3 (they require no post-processing, either), on either Catalina or Mojave. I have tested several PDF viewers, adding highlights, restarting DT3, etc. and so far every file has remained solid, and the OCR layer has not been corrupted.

My conclusion is that something about Chromium’s implementation of printing to PDF on MacOS is problematic, apparently having to do with PDFKit. The PDFs printed from Chromium on Ubuntu have no problems in DT3, and neither do the PDFs created by Calibre (using Chromium’s print-to-PDF functionality).

BLUEFROG · May 26, 2020, 12:08am

My conclusion is that something about Chromium’s implementation of printing to PDF on MacOS is problematic, apparently having to do with PDFKit.

Yes, this is true and it does have to do with CID fonts.

rfog · May 27, 2020, 5:57pm

interesting… I have a Calibre Docker in my Synlogoy NAS… Double interesting.

PS: rfog, giving ideas.

Haim · May 10, 2024, 1:27pm

Has there be any advance in this issue? (other than using Linux)

BLUEFROG · May 10, 2024, 4:02pm

You’d have to check your own results in Calibre. We don’t develop it and can’t account for its behaviors.