DEVONthink make already OCRed pdfs several times larger

Yes, I guess I mean that log. And the activity window. Both will help you. Only other thing I can suggest you look at is your settings for OCR in Preferences. Maybe DEVONthink believes these files are not yet OCR-ed, even though they seem to be (did you check that by doing a search?).

Others, as the earth turns and they wake up, will probably chip in with ideas and suggestions in due course.

Meantime, as you have hopefully cracked open the Help and/or Manual, have a read. Maybe something in there will give you a clue. It’s an oustanding Manual.

Both log and activity windows are empty, so whatever DEVONthink is doing to make the pdf-files 7-8 times bigger it doesn’t show up there. I have turned off the “convert incoming scans” just to be sure, even if I am not importing the files per se. Thanks for helping out! I will have a close read through the manual, but I do hope the collective intelligence of this excellent user forum can help me out.

When I add PDF files by dragging them into DT, they don’t change size. If I drag a PDF out to a different folder on my disk, “diff” shows the trip through DT didn’t change it.

However, when I OCR’d a 1.5 MB PDF, it grew to 16.3 MB, a factor of ~11. The original file was about 140 pages of almost nothing but text. Maybe the OCR process rewrites images at maximum quality.

It might help to check a setting in DT’s preferences. On the top Apple menu bar when Devonthink is active, go to Devonthink 3->Preferences.

From the preferences window, go into the OCR tab (you may have to click the >> button on the top row to see OCR).

Is “Convert incoming scans” set to “to searchable PDF”?

If so, change it to “No action.”

There is an option there to compress PDF’s, but I’m not sure exactly what that does or how to do that in DT outside of an import. There is an explanation in the manual, but on a brief read I wasn’t sure if images were changed in any way.

1 Like

Does it also happen if you open and save a copy of such a PDF in Preview.app?

Hi @pete31 !
I can open, highlight a sentence, and close the pdf in Preview or Adobe Acrobat without the file size increasing (more than a few bytes anyway). It only happens within DEVONthink itself.

I‘m afraid in this case you‘ll have to wait til the DEVONtech developers are back as it‘s probably something they have to look at, see this thread

Welcome @Per

Note: Acrobat can use very aggressive compression in making PDFs so it’s not a reliable comparison. Not all OCR applications produce PDFs as small as it can.

DEVONthink uses Apple’s PDFKit which uses different compression when saving a file. The file must be decompressed while it’s in use and will save with PDFKit after changes are made.

Also, Preview does use PDFKit, but there are things it can do that aren’t available to third-party developers.

1 Like

Hi @BLUEFROG !
I have more than 7GB of pdfs even when compressed, so I worry what that would mean if there is nothing to do to stop DEVONthink multiplying the size of each file by 7 or 8. The AI is very useful for me (particularly “see also”) and it seems to work on the compressed files, but it really cripples how I can use DEVONthink if there is no way to solve this issue. What do you think? Is there any solution you think?

1 Like

Development would have to weigh in on this. Bear in mind, we have people on year-end holiday right now. Thanks for your patience and understanding.

Maybe for now you try a few experiments to import PDFs that have not had any treatment by Adobe for high compression and OCR-ing. Do all the OCR in side of DEVONthink. Or even try importing compressed PDF from Adobe but no OCR, then run OCR on that. See if different. And confirm and test OCR settings. Maybe the final size will be more acceptable from one or both of these experiments.

Yes, of course. There is no rush at all. I am simply exited over the mere glimpses I have got of DEVONthinks capabilities and would very much like to integrate it into my workflow if possible. Fingers crossed that your team can figure out a solution to my issue when back. I will revisit this post in January. Thanks for your quick replies!

You’re welcome :slight_smile:

1 Like

Good idea @rmschne ! I will do that and see what happens.

Thanks @Amontillado ! I have tried that with no avail. I hope I can find away around this…

@BLUEFROG,
I think your initial assessment of my issue is correct. When I annotate the compressed pdfs in Bookends, they expand similarly as in DEVONthink. It would obviously be great if third-party developers would get the same tools for compression as Adobe, but it seems like I will be stuck with a huge database until then, or the nuisance of either having to annotate the files in Adobe Acrobat or regularly re-compress them there after being annotated in DEVONthink or Bookends (which is my preferred workflow).

1 Like

Maybe that can be scripted?

2 Likes

That would be great @chrillek !
I am new to DEVONthink and have not worked with scripts, but that would solve my issue.

I don’t have Adobe Acrobat. There’s online documentation available at Adobe’s site and a quite old example elsewhere. I’m sure Google will provide more information on that.
Update: There’s also a thread here.

1 Like

This is background information (or background noise). It won’t solve the problem, but it’s interesting.

You can create a low quality profile in the ColorSync utility.

Open the PDF in Preview and then choose “File->export” (not export to PDF, just export).

Set the export format to PDF and choose your low quality ColorSync filter in the “quartz filter” field.

That will reduce the size of a bloated PDF some, but won’t get all the worms back in the small economy size can.

I keep wanting to look closer with Python’s PDF library, but between training, a panicked rush to find a job, and hipster sloth, I haven’t gotten to it yet.

2 Likes

It’d be good if you could provide an example of such a PDF