For some time now I’ve noticed that a few PDFs of mine have had their text layer corrupted. I can select a range of text and the text in the clipboard is gibberish.
Although I had not yet pinpointed exact reproducible steps, I noticed that it was more common in PDFs I tried to split from some other larger PDF, either by printing as PDFs a portion of the original PDF or by dragging and dropping the thumbnails (I don’t even know if it has happened with other PDFs).
Yesterday I was trying the Tools > Split PDF > Into Chapter and noticed that a lot of the produced PDFs had this problem. The corresponding pages of the original PDF had no problems in the text layer. I tried the same thing with DEVONthink 4 and the problems were the same.
After using the Split tool, I tried to print the same problematic portions of the PDF as PDFs and also extracted the pages via the thumbnails. Same problems.
I’m using Sonoma, 14.7.1 (23H222).
DEVONthink 3.99
DEVONthink 4.0beta1
I downloaded PDFSam (https://pdfsam.org) and split the same PDF with no problems, which leads me to think it’s some problem with PDFKit.
I’m uploading the original PDF, and samples of the garbled output of both DEVONthink 3 and DEVONthink 4 and the correct output of PDFSam.
Unfortunately it is a PDFkit issue. Splitting is especially tricky as in this case workarounds that are useful for e.g. saving can’t be applied. And printing is actually one of the workarounds suggested to our users and only controlled by the system. But even this fails obviously.
The only remaining workaround is therefore to split the PDF first into chapters and then to OCR the created PDFs again.
Thanks for the prompt reply. Was this a know bug? Should I report to apple or did you guys already do so?
Are there other known bugs of the type (you mentioned there are workarounds for saving which suggests other forms of this bug)? What workarounds are those? I’m just trying to get some dos and donts to avoid inadvertently corrupting my pdfs. Does it happen with other operations (highlighting, anottarting etc)?
We reported several issues causing corrupted text layers (including example projects), none of them was ever fixed.
Only internal, coded workarounds of DEVONthink. But fonts required by PDF documents and neither embedded nor available on your Mac can cause troubles, just like duplicate or damaged fonts or non-latin languages. But even the creator of PDF documents might matter. Ideal are PDF documents created on the Mac of course.
Now that you mention this, these PDFs are generated by third parties, probably by a Linux server and if I check their properties they usually have warnings about missing fonts.
Does installing these fonts locally ameliorate the issue?