PDF highlighting and file size

Thanks for the help troubleshooting. It sounds like you are running into the same problem. I appreciate the collaboration.

That’s odd that the scan quality would create an OCR problem when the text is editable and searchable straight from the Google Books pdf and is only 6-7 MB. I can highlight, search, and copy/paste text straight from Preview. So the pdf is already editable prior to importing it into DT.

Any idea how to stop DT from running (what seems to be an automatic conversion)? Or any idea why it increases the size almost tenfold when it is already searchable? It seems a shame to have so much space taken up when the file is so small and editable prior to the DT import.

Thanks so much for any insights you have.

I searched another thread related to this issue. It seems a small sized pdf (6-7 MB) when it is imported into DT with show up under “Kind” as a “PDF+Text” document. As long as the document stays like this, it seems to be editable, searchable, and highlightable. (Note: Such pdf has already been OCRed prior to DT import).

When the text is highlighted, it shows up in the annotation column. However, shortly thereafter (maybe because of editing the document or DT’s automated process), the document looks to automatically convert and it turns in the “Kind” column to a “PDF document.”

It is after this “conversion” that the problems result: size increases, text becomes invisible, and annotations become blank. The thread addressing this was nearly two years ago and said it was addressed in a newer version of DT. However, this does not seem to be the case. Unless I am missing something.

Could anyone provide some guidance?

Only editing the document should modify it, the output is completely controlled by macOS’ PDFkit framework. Therefore the only workaround would be to use a third-party PDF editor in this case which doesn’t use the PDFKit (e.g. Preview and Skim use it too).

PDFpen must use it too, as it does the same thing as mattlynskey has described, maybe automatically. As when I first opened the document the text looks like this.

Then I did my usual delete blank pages, crop excess white border, no editing/highlighting of text yet at all, save, and it now when I look this is what I see.

Thank you for your patiences. Could I get some clarification? Sorry. I don’t understand all the nuances of how pdfs work.

  1. Are you saying that this is just the way it will work with DT? That is, are you saying that the problem is unresolvable because of how macOS works with pdfs and that there is no way for DT to preserve the original document from Google Books which is already editable?

  2. Also, what would I use a third-party PDF editor to do? The documents (i.e., Google Books attached) seem to be already converted: they are editable and searchable as is. It is only when put into DT that a separate “automatic conversion” occurs. Or, are you saying that I would need to use another program other than DT for such documents?

Thanks for the help.

But you do not only import it. Highlighting modifies the document and saving corrupts the text layer of the document due to bugs or limitations of the PDFkit framework of macOS. E.g. I filed a bug report, provided several sample documents and sample projects independent of DEVONthink. But so far Apple doesn’t care.

Yes. I agree. It seems that the modifications do initiate this process. That makes sense. I also understand now that this is a macOS issue. However perhaps you could provide some guidance?

I had invested (a lot) in DT excited and hope that it could be a mainstay for my research. I do work with a lot of older works that are public domain and open access. So these kind of documents will be a regular encounter for me. Could you clarify:

  1. Is there a workaround (I know you hinted at this before and could you reexplain) that could make such documents usable in DT AND a manageable file size? What would be a process for this?

  2. Or, since the issue is connected with macOS and DT is an Mac-only program, is it not feasible for DT to be used for such documents until macOS adjusts its PDFkit? If so, that would be disappointing as I was hopeful DT would be able to manage such research.

Again, thank you for your patience and clarification.

You could try to either OCR the document (slow) or print the PDF document into a new document using Preview. Sometimes this creates a PDF that is more compatible to PDFkit.

It’s not a generic issue, it depends highly on the used PDF documents and the software/platform used to create them.

@mattlynskey I have faced the exact same issues you describe, as my research also involves a lot of older public domain docs (from archive.org, hathitrust, google books, etc.)

Unfortunately, the only solution I have found is to use DT for organizing files only, and to never use it for annotating (highlighting, adding notes, etc).

Instead, I annotate using the (excellent) third-party app PDF Expert.

I know basically nothing about how pdfs work, but I know that this method has kept me from pulling my hair out so far.

Edit: I want to add that I have not been able to fix the documents that have been “broken” by DT/PDFkit annotations. I have had to suffer through manually transferring my annotations to a “fresh” version of the document. I hope to avoid that pain in the future by avoiding annotating in DT, at least until (if?) these issues are address.

Thank you @grosson for your feedback.

Could I ask about your workflow? Are you saying that you store two copies of each public domain book: one in DT and the other annotated copy (via PDF Expert) elsewhere?

Also, as long as you only store it in DT and don’t modify it at all, I am assuming that it remains searchable. Is this correct?

I might be missing something, but DT seems like a really powerful tool. It is such a shame that the only thing that can be done with the valuable public domain work is to store these in DT without the ability to edit/annotate them.

No no, I only have one copy of each file. I simply use DT as a file browser/organizer—which is where its real strengths are anyway. I can read my PDFs in DT, but if I want to annotate them, I can double click the file name to open in an external editor (you can identify an external editor in settings). Then annotate, close, and voila. If I don’t modify it in DT it remains searchable, at least so far.

Thank you @grosson! This is unfortunate that DT would be only a reader at this point. However, it is helpful to understand your workaround. I tried to find where to designate an external editor in settings/preferences but could not locate it.

Could you identify how I would set this up?

This would be an extremely helpful option to know that I could still edit/annotate documents in DT albeit through another program.

Thanks in advance for the guidance.

You don’t define external editors in DEVONthink. It uses whatever the default has been set to in the operating system.

You can also use Data > Open With, also available when you Control-click a file.

Yes, my apologies! I’d forgotten the order of operations. Set PDF Expert as the default application for opening PDFs in MacOS, then double-clicking a PDF in DT will open the file in a new PDF Expert window.

No worries!

then double-clicking a PDF in DT will open the file in a new PDF Expert window.

IF you have enabled Preferences > General > Double-click opens documents externally. If you don’t have it enabled, you can always use Shift-Command-O.

Thank you @BLUEFROG and @grosson for your help with this inquiry. I have played around with these suggestions and just wanted to summarize (perhaps for the help of future users) what I found:

  1. The problem with the increase in PDF size and inability to edit annotated versions has to do with macOS and not DT. This was tricky to discover at first from my end, as others explained in the thread, this is the case. Even if I make some annotations with other PDF programs outside of DT this phenomenon occurs. I would hope that this problem could get resolved, but I understand it is on the mac side of things.
  2. As @grosson explained, his workaround is helpful for the time being. I have worked through a large PDF around 450+ pages. I would ONLY annotate/highlight externally. That is I set my Preferences to “Double-Click Opens Documents Externally.” I chose to use Preview for this and made highlights. When I saved and closed the program, the file size stayed the same, the annotations were readable, and the document was searchable. However, anytime I would make an edit/annotation in DT it would make the PDF act the same way as described earlier in this post (i.e., increase in size, lose editable text, etc.)
  3. It seems (although I have not messed this as much), that when I edit/annotate the same PDF from DTTG on my iPad or iPhone that the edits stay are made on the desktop version when they sync, and the same problem does NOT occur. That is, for some reason I can edit a PDF using DTTG without a huge problem but when I use DT to highlight/annotate the problem occurs.

Anyway, it would be nice for the systemic problem to be resolved. However, for the moment, this at least brings me clarity to the issue at hand. I thank you all for your help in troubleshooting this!

You’re welcome.

Note DEVONthink To Go uses a third-party PDF framework, not Apple’s PDFKit, hence the difference in the results between our apps.

hey there, @grosson, quick question: how do you use those public domain docs within DT? Are you talking about the public-domain downloadable PDFs only? If you also integrate non-downloadable documents (on archive.org, Google Books, etc), how do you manage this in DT?

I’m only talking about PDFs that I have downloaded from archive.org. I don’t have a good way of incorporating non-downloadable content into DT (like material on Google books).

Good to know - I have seen murmured requests for plugins for Google Books, archive.org, etc. Would be great, no?