I am new to DT. I plan to use it for organizing assorted PDFs for research and writing. I had imported a PDF with a small megabyte size (6.5 MB). It was a free pdf from Google Books. It was already editable and searchable prior to dragging it into DT. I simply dragged and dropped the document into DT. It seemed like all was working great. However, two things occurred:
When I highlighted text in the document, it would initially show up in the annotation list with the actual text listed that I highlighted. However, after several minutes I could no longer highlight the text and the text I did highlight turned blank in the content column of annotations.
After several minutes the 6.5 MB (labeled as a “PDF+text” document) would convert to a “PDF document” and would increase in size to a hefty 40-50 MB. Not only did this create a size issue but it also seemed to be related to the issue above (number 1) of not being able to highlight text throughout the document as able when I first imported it or even outside the program with a pdf viewer.
I am not sure if there is a setting issue or if I don’t understand how DT works. I would appreciate any help as this snag has become an initial frustration in using the program.
They are around 8-11 MB. I import them into DT. I can initially highlight text. The highlighted text shows up under the “Content” column in the Annotations tab of the document. Then it looks like DT does some sort of conversion. Then the file size jumps to 40-50 MB or 90 MB. After this, pages where text would be highlightable are not. I try to grab the text and it looks like the whole page is an image that is not editable. This is the basic summary of the problem.
Just to provide some more info. Attached are two images. One is the pdf viewed via Preview. It is 11.7 MB and you can see the blue highlights where all of the text is selectable. The other image is after importing it into DT. It is 98.1 MB and the same page only gets selected as an image. Plus, you can see that text which was previously highlighted does not have the actual content of the highlighted annotation/text. When first highlighted, it shows up as the actual text under annotation content. However, after the change/conversion (a conversion which automatically occurs) of the document, the text of the quote goes away.
I can confirm it does the same thing for me (only tried first one). Looks like it drops the OCR text, I get Log saying No Text. i don’t think it’s a DT issue as PDFpen also dropped the text after first showing it in the preview layer, now all I see is this…
Trying flatten in PDFpen and re-OCRing via DT.
[I cropped page size beforehand]
Thanks for the help troubleshooting. It sounds like you are running into the same problem. I appreciate the collaboration.
That’s odd that the scan quality would create an OCR problem when the text is editable and searchable straight from the Google Books pdf and is only 6-7 MB. I can highlight, search, and copy/paste text straight from Preview. So the pdf is already editable prior to importing it into DT.
Any idea how to stop DT from running (what seems to be an automatic conversion)? Or any idea why it increases the size almost tenfold when it is already searchable? It seems a shame to have so much space taken up when the file is so small and editable prior to the DT import.
I searched another thread related to this issue. It seems a small sized pdf (6-7 MB) when it is imported into DT with show up under “Kind” as a “PDF+Text” document. As long as the document stays like this, it seems to be editable, searchable, and highlightable. (Note: Such pdf has already been OCRed prior to DT import).
When the text is highlighted, it shows up in the annotation column. However, shortly thereafter (maybe because of editing the document or DT’s automated process), the document looks to automatically convert and it turns in the “Kind” column to a “PDF document.”
It is after this “conversion” that the problems result: size increases, text becomes invisible, and annotations become blank. The thread addressing this was nearly two years ago and said it was addressed in a newer version of DT. However, this does not seem to be the case. Unless I am missing something.
Only editing the document should modify it, the output is completely controlled by macOS’ PDFkit framework. Therefore the only workaround would be to use a third-party PDF editor in this case which doesn’t use the PDFKit (e.g. Preview and Skim use it too).