DEVONthink make already OCRed pdfs several times larger

Per · December 26, 2020, 9:17pm

I am completely new to DT and already see huge benefits using it in my academic workflow (with Bookends and Tinderbox). It seems really great! However, I have a problem with DT multiplying the size of pdf files several times. I work very hard in Adobe Acrobat to get small pdfs with OCR and then DT make them bigger without me running the OCR function in DT on them. What is happening? How can I make it stop? I have so far almost 3000 pdfs, so it is a matter of GB getting lost

Thanks in advance!

rmschne · December 27, 2020, 7:09am

Best if you please explain in more detail, with some screen shots?, about your evidence of this problem. How are the files getting from Adobe into DEVONthink? Can you tell what happens by looking in the DEVONthink log? What does the log say as you enter the file into DEVONthink? What does DEVONthink report as “type” for these incoming files before DEVONthink does whatever you think it is doing? Are there any Rules setup to handle incoming files? Just saying “DT Makes them bigger without me running OCR” is not much to go on by anyone who might be able to help (except of course if coincidently they have see exactly that).

Per · December 27, 2020, 10:30am

Thanks for getting on this so quickly! Here’s the setup:
– I have the pdfs in Bookends’ iCloud folder.
– I have indexed the folder in DEVONthink.
– Most files were already made small and with OCR in Adobe Acrobat before indexed in DEVONthink, but I have also open files from DEVONthink’s “open with” and selected Adobe Acrobat (default) and then “improved” them in Adobe.
– The pdf files stay small and then suddenly become several times larger. This can also be forced if I open the pdf in Devonthink by double-clicking on its database entry in the list pane and save when closing it again. I have uploaded two screenshots of before and after opening a file, highlighting a sentence, and closing it again. Note the file size going from 0.206 to 1.1MB (similar ratios for large files. One grew up to 150MB). The log button stays grey but I do not know if there is a log to share from somewhere else (beginner).

Is this enough for you to figure out what is going on?

rmschne · December 27, 2020, 10:49am

the info in log is essential. see Help and/or Manual to see how to open it with the Menu Command.

Per · December 27, 2020, 11:22am

If you mean the log that can be opened like a window, I cleared it before the opening the file and it remains empty also after. Is there another “deeper” log somewhere? Thank you for being patient with my ignorance.

rmschne · December 27, 2020, 11:36am

Yes, I guess I mean that log. And the activity window. Both will help you. Only other thing I can suggest you look at is your settings for OCR in Preferences. Maybe DEVONthink believes these files are not yet OCR-ed, even though they seem to be (did you check that by doing a search?).

Others, as the earth turns and they wake up, will probably chip in with ideas and suggestions in due course.

Meantime, as you have hopefully cracked open the Help and/or Manual, have a read. Maybe something in there will give you a clue. It’s an oustanding Manual.

Per · December 27, 2020, 11:42am

Both log and activity windows are empty, so whatever DEVONthink is doing to make the pdf-files 7-8 times bigger it doesn’t show up there. I have turned off the “convert incoming scans” just to be sure, even if I am not importing the files per se. Thanks for helping out! I will have a close read through the manual, but I do hope the collective intelligence of this excellent user forum can help me out.

Amontillado · December 27, 2020, 1:42pm

When I add PDF files by dragging them into DT, they don’t change size. If I drag a PDF out to a different folder on my disk, “diff” shows the trip through DT didn’t change it.

However, when I OCR’d a 1.5 MB PDF, it grew to 16.3 MB, a factor of ~11. The original file was about 140 pages of almost nothing but text. Maybe the OCR process rewrites images at maximum quality.

It might help to check a setting in DT’s preferences. On the top Apple menu bar when Devonthink is active, go to Devonthink 3->Preferences.

From the preferences window, go into the OCR tab (you may have to click the >> button on the top row to see OCR).

Is “Convert incoming scans” set to “to searchable PDF”?

If so, change it to “No action.”

There is an option there to compress PDF’s, but I’m not sure exactly what that does or how to do that in DT outside of an import. There is an explanation in the manual, but on a brief read I wasn’t sure if images were changed in any way.

pete31 · December 27, 2020, 1:48pm

Does it also happen if you open and save a copy of such a PDF in Preview.app?

Per · December 27, 2020, 3:22pm

Hi @pete31 !
I can open, highlight a sentence, and close the pdf in Preview or Adobe Acrobat without the file size increasing (more than a few bytes anyway). It only happens within DEVONthink itself.

pete31 · December 27, 2020, 3:33pm

I‘m afraid in this case you‘ll have to wait til the DEVONtech developers are back as it‘s probably something they have to look at, see this thread

BLUEFROG · December 27, 2020, 3:49pm

Welcome @Per

Note: Acrobat can use very aggressive compression in making PDFs so it’s not a reliable comparison. Not all OCR applications produce PDFs as small as it can.

DEVONthink uses Apple’s PDFKit which uses different compression when saving a file. The file must be decompressed while it’s in use and will save with PDFKit after changes are made.

Also, Preview does use PDFKit, but there are things it can do that aren’t available to third-party developers.

Per · December 27, 2020, 4:35pm

Hi @BLUEFROG !
I have more than 7GB of pdfs even when compressed, so I worry what that would mean if there is nothing to do to stop DEVONthink multiplying the size of each file by 7 or 8. The AI is very useful for me (particularly “see also”) and it seems to work on the compressed files, but it really cripples how I can use DEVONthink if there is no way to solve this issue. What do you think? Is there any solution you think?

BLUEFROG · December 27, 2020, 4:36pm

Development would have to weigh in on this. Bear in mind, we have people on year-end holiday right now. Thanks for your patience and understanding.

rmschne · December 27, 2020, 5:05pm

Maybe for now you try a few experiments to import PDFs that have not had any treatment by Adobe for high compression and OCR-ing. Do all the OCR in side of DEVONthink. Or even try importing compressed PDF from Adobe but no OCR, then run OCR on that. See if different. And confirm and test OCR settings. Maybe the final size will be more acceptable from one or both of these experiments.

Per · December 27, 2020, 5:26pm

Yes, of course. There is no rush at all. I am simply exited over the mere glimpses I have got of DEVONthinks capabilities and would very much like to integrate it into my workflow if possible. Fingers crossed that your team can figure out a solution to my issue when back. I will revisit this post in January. Thanks for your quick replies!

BLUEFROG · December 27, 2020, 5:30pm

You’re welcome

Per · December 27, 2020, 5:43pm

Good idea @rmschne ! I will do that and see what happens.

Per · December 27, 2020, 5:56pm

Thanks @Amontillado ! I have tried that with no avail. I hope I can find away around this…

Per · December 29, 2020, 9:04am

@BLUEFROG,
I think your initial assessment of my issue is correct. When I annotate the compressed pdfs in Bookends, they expand similarly as in DEVONthink. It would obviously be great if third-party developers would get the same tools for compression as Adobe, but it seems like I will be stuck with a huge database until then, or the nuisance of either having to annotate the files in Adobe Acrobat or regularly re-compress them there after being annotated in DEVONthink or Bookends (which is my preferred workflow).