I have an OCRed pdf file downloaded from Internet Archive. The source file size: 25M
20251201 began to read on DTTG4.0.1, made 9 annotations, and synchronized back to the computer that night.
20251202 updated in the morning. After the new 4.0.2, I found that it could no longer be annotated (text, HL, etc. could not be selected). (The problem was found to occur at DT4 at the end)
Kind is still PDF+Text.
At this time, I found that its size had become 476M. After checking the file size on the computer, I could be sure that 20251201 had become so large!
The larger size of the file seems to be only for the OCRed pdf file downloaded from the Internet Archive, because I have also read another OCRed pdf file of myself in DTTG4.0.1, and the size has not changed so exaggeratedly
I re-imported the source file (25M) (4.0.2) and made some annotations on DTTG4 (it can be annotated normally at this time). Synchronized back to the computer, I found that the file size has increased again (460M), but strangely, DTTG4.0.2 is still 25M, so the problem lies with DT4. When I synchronized back to DTTG4 again, I found that I couldn’t make annotations again.
I imported the source file to DT4 again. After directly annotating, I found that the file size had also increased
It seems that for OCRed pdf file downloaded from Internet Archive, try other pdf readers
Have you actually determined you need to do OCR on the PDF document?
If so, how? I just downloaded a PDF of The Federalist Papers and it already has an existing text layer and needs no OCR.
Doing OCR in DEVONthink To Go is not the same as doing it in DEVONthink. They are two different applications using different frameworks.
PDFs can be created with mechanisms that use very aggressive compression. Changing a document, including annotating, with another mechanism can use different compression techniques, often less aggressive ones.
Your OCR settings in DEVONthink matter.
Original document: 19.3MB (568 pages with a text layer).
After adding a highlight in DEVONthink 4 and saving: 68.7MB
The document contains very aggressive JBIG compression.
After the annotate and save, the JBIG compression is re-encoded with JPEG2000 (JPX encoding) compression, a much less aggressive form.
PS: If you see a document that’s e.g., 500+ pages, scanned from a book so it has page images, it most likely is highly compressed.
Compression rate! That’s clear. Thank you very much for your reply
As for the point 1, I don’t quite understand what you mean. I didn’t do any additional OCR for this pdf downloaded from the Internet Archive. But for other pdfs, I will first use PDFPen to remove their original OCR Layer, then use PDFPen for OCR, and then import DT, so that I don’t need to modify the ForceEditablePDFs parameters of DT to annotate (because the results of Chinese PDFs obtained by using the built-in OCR function of DT can’t be annotated by DT. I had feedback on this problem last year (ticket #963164), I don’t know if you still remember it)
After I modified ForceEditablePDFs to TURE, I now generally use the OCR function that comes with DT4, but I found another problem (do I need to create a new topic to discuss?)
After using DT4 for OCR, I imported it to DTTG 3 & 4 and found that the Chinese text could not be selected for annotation, but the English text could be selected. In DT4, both Chinese and English can be selected and annotated normally after OCR.
Then I tried to OCR files containing Chinese and English in DTTG 4, and there was also a problem.
Situation 1: OCR settings turn on Chinese + English
English can be OCR normally, and Chinese will be OCR to the English alphabet.
I think I’d better remove the OCR Layer of the pdf downloaded from the Internet Archive in the future, and then re-OCR.
Why?
Also, using PDF Resolution: As source is inadvisable unless you scanned the document or actually know the resolution. It in not uncommon for people to scan at too high a resolution and that setting will preserve that resolution, increasing the file size even more. I’d recommend 200dpi as a balance between quality and file size.
Because, as the feedback at the beginning of the topic, DT4 has increased the size (may increase by 10 times) after commenting on these (the situation I have encountered at present, it only refers to the OCR pdf file downloaded from the Internet Archive), and I used PDFPen to remove OCR layer, used PDFPen for OCR, and the processed files can be annotated normally when imported into DT4 and DTTG4.
But the PDFPen I bought many years ago is version 12, which will not be updated. It will not support the new macOS system in the future (it can no longer be installed on new computers above M4/macOS 15). The price of the new Nitro PDF is too high, so I have changed ForceEditablePDFs to TURE, and I don’t want to buy any other products (although some have been purchased, their upgraded version OCR files can’t be annotated by DT either.)
Thank you for your suggestion to PDF Resolution. In the past, I have tested more than a dozen documents, 200dpi and As Source, and the latter can maintain a smaller file size more often. Maybe the file I got is more suitable for this setting. Thank you.