PDF highlighting and file size

I am new to DT. I plan to use it for organizing assorted PDFs for research and writing. I had imported a PDF with a small megabyte size (6.5 MB). It was a free pdf from Google Books. It was already editable and searchable prior to dragging it into DT. I simply dragged and dropped the document into DT. It seemed like all was working great. However, two things occurred:

  1. When I highlighted text in the document, it would initially show up in the annotation list with the actual text listed that I highlighted. However, after several minutes I could no longer highlight the text and the text I did highlight turned blank in the content column of annotations.

  2. After several minutes the 6.5 MB (labeled as a “PDF+text” document) would convert to a “PDF document” and would increase in size to a hefty 40-50 MB. Not only did this create a size issue but it also seemed to be related to the issue above (number 1) of not being able to highlight text throughout the document as able when I first imported it or even outside the program with a pdf viewer.

I am not sure if there is a setting issue or if I don’t understand how DT works. I would appreciate any help as this snag has become an initial frustration in using the program.

Any guidance or help for a newbie would be great.

Welcome @mattlynskey

A URL for the PDF so we can download and test it would be helpful.

Thanks for the help. This phenomenon occurs on several pdfs.

Here are a couple of links where a similar thing occurs:

  1. https://books.googleusercontent.com/books/content?req=AKW5QadDtKAOnVXMHKSa1fSjCBqL2W5gFvlm5TFlufGRP6X0XnD6__S9N6EqjIPD3MT86TwZ1cS-p3N0wkwZpzPS2Cnzkxg4Q5RoHBFhyQglELTDvytEr_zuslbzUM2kFI-ReGyZ_OyhuVL9KZX613PW7KVWa0uBuIj-8kKaWf7ifk0pzqO6CP6kpKG8DW3FF4X9LjwB2OHCWrtXfRIxwxzQ9Y-xp-LEaijZK5hY7dOtFjJj6cJzScwUK94rGcuJYqkRYwPkV6BwcTShNVF6Tu5JKEqmMczhcg

  2. https://books.googleusercontent.com/books/content?req=AKW5Qae6VT2UJjyFimzk3n6kcBFuFyJ9MfEOeOw_2rfvLHdWtxlbwD-0k4rsWgm5wTFlBMsePhU_iHPyX7qTw_wyd1_6le19qIKg73IZyPOywdWu1dlCO_TtjNGN2VJ3w1tLlFVs-toAxrKwLZxHCO4nK6TZJoE-MLi0_wdrdjoKOgOG5vo_tegHuUYdqlV_cZqIfvT-EzjLRsS1WczhBdpkDo-EsyURIL9-Bd-1UmOrNj3HcBohA-nmir3QI39gjHcpFlPC-treI7vQHTrPoVv18K9lBLHfAg

They are around 8-11 MB. I import them into DT. I can initially highlight text. The highlighted text shows up under the “Content” column in the Annotations tab of the document. Then it looks like DT does some sort of conversion. Then the file size jumps to 40-50 MB or 90 MB. After this, pages where text would be highlightable are not. I try to grab the text and it looks like the whole page is an image that is not editable. This is the basic summary of the problem.

Thank you for any help that you can provide.

These links aren’t accessible (permission denied).

Sorry for the broken links.

Attached are such pdfs from Google Books:
The_Divine_Enterprise_of_Missions.pdf (6.1 MB)
The_New_Acts_of_the_Apostles_Or_The_Marv.pdf (11.2 MB)

Which version of macOS do you use? I couldn’t reproduce this on 10.15.3.

I saw this on 10.15.3, though the Inspector didn’t immediately show the size change.

Post highlight in the Finder…

and the other post-highlight in DEVONthink…

The original is PDF 1.4 with WinAnsiEncoding with 6 million objects.
After highlighting, it’s PDF 1.3 with Flate encoding with 49 million objects.

In Preview, the encoding and PDF spec was preserved, with only a small increase in the number of objects (6.3 million to 6.4 million).

That’s completely handled by the PDFkit of macOS and not customizable.

Yeah. Just noting the statistics between the original and much larger results.

The phenomenon occurs in 10.15.3.

Yes. This is the file size issue/bulge that results after I import the pdf. Also, the document does not become editable as it was prior to import.

Just to provide some more info. Attached are two images. One is the pdf viewed via Preview. It is 11.7 MB and you can see the blue highlights where all of the text is selectable. The other image is after importing it into DT. It is 98.1 MB and the same page only gets selected as an image. Plus, you can see that text which was previously highlighted does not have the actual content of the highlighted annotation/text. When first highlighted, it shows up as the actual text under annotation content. However, after the change/conversion (a conversion which automatically occurs) of the document, the text of the quote goes away.


Can I provide any more information that would be helpful to troubleshoot these issues?

Just checking in again. Any more clarity on this situation?

Nothing to report at this time.

I can confirm it does the same thing for me (only tried first one). Looks like it drops the OCR text, I get Log saying No Text. i don’t think it’s a DT issue as PDFpen also dropped the text after first showing it in the preview layer, now all I see is this…

eg

Trying flatten in PDFpen and re-OCRing via DT.
[I cropped page size beforehand]

Had problems with DT OCR, think it must be the print quality, PDFpen did it. So probably the problem is with scan quality.

55.7 MB after convert.
Would upload, but too big.

Thanks for the help troubleshooting. It sounds like you are running into the same problem. I appreciate the collaboration.

That’s odd that the scan quality would create an OCR problem when the text is editable and searchable straight from the Google Books pdf and is only 6-7 MB. I can highlight, search, and copy/paste text straight from Preview. So the pdf is already editable prior to importing it into DT.

Any idea how to stop DT from running (what seems to be an automatic conversion)? Or any idea why it increases the size almost tenfold when it is already searchable? It seems a shame to have so much space taken up when the file is so small and editable prior to the DT import.

Thanks so much for any insights you have.

I searched another thread related to this issue. It seems a small sized pdf (6-7 MB) when it is imported into DT with show up under “Kind” as a “PDF+Text” document. As long as the document stays like this, it seems to be editable, searchable, and highlightable. (Note: Such pdf has already been OCRed prior to DT import).

When the text is highlighted, it shows up in the annotation column. However, shortly thereafter (maybe because of editing the document or DT’s automated process), the document looks to automatically convert and it turns in the “Kind” column to a “PDF document.”

It is after this “conversion” that the problems result: size increases, text becomes invisible, and annotations become blank. The thread addressing this was nearly two years ago and said it was addressed in a newer version of DT. However, this does not seem to be the case. Unless I am missing something.

Could anyone provide some guidance?

Only editing the document should modify it, the output is completely controlled by macOS’ PDFkit framework. Therefore the only workaround would be to use a third-party PDF editor in this case which doesn’t use the PDFKit (e.g. Preview and Skim use it too).