Dealing With Very Large PDF(s)

snpower · April 22, 2023, 11:07am

I have a flight manual .pdf I use for work that is just under 3,000 pages and 800 mb. It contains highlights and a few notes. It loads correctly on DT and DTTG but when making edits on my Macbook Pro they do not seem to be syncing with DTTG and are also resulting in the “beachball” running for several minutes (worst case scenario). No other syncing issues with any files that I am aware of. Is there anything I can do to improve file performance in this case? I was considering using PDF compression…?

rmschne · April 22, 2023, 11:30am

Give it a try. The best compression will be on the images. I have my PDF Pen compression set to reduce to 150 dpi if greater than that. Going down any more seems to affect the readability.

Edit: Also, you don’t mention how you are synching. Bonjour probably the quickest and most reliable.

BLUEFROG · April 22, 2023, 4:29pm

PDF Compression may actually make the problem worse. Compression does not permanently reduce the size of a file. It does exactly what it says: it compresses it. What follows a compression? A decompression. In order to access compressed resources, they have to be decompressed. Decompression can also be slow.

When the application is stalled, do a Spotlight search for Activity Monitor . Select our application in the list of processes - it should show “(Not Responding)” and the name in red - and press Command-Option-S to run a sample on it. When the sample window opens, press the Save button and save it to your Desktop. Please attach this text file to a support ticket so we can inspect it. Thanks!

rmschne · April 22, 2023, 5:21pm

Would appreciate if you could elaborate. When I use PDF Pen to do what they call “optimise” and say I reduce all images to say 75 dpi more often than not those images in the new file are unreadable. The files are often drastically smaller which is my goal. But I have settled on 150 dpi to get good file size reduction but acceptable image quality for my file archive.

Even when opening the new file in Preview or DEVONthink PDF viewer the images are unreadable if 75 dpi used. I always concluded that was a permanent compression that is not undone by any subsequent process to “decompress”. If there is a decompress as you say would not the images be like they were before the optimisation (compression)?

Is nomenclature the issue? Is Compression different than PDF Pen Optimisation?

BLUEFROG · April 22, 2023, 6:15pm

Yes, file reduction such as downsampling images us vastly different than compression. Downsampling is removing data. Compression is analyzing and storing in a manner that takes less space to store. It’s likely more tightly packing something. But when you open it, the contents have to be unpacked. There are several compression algorithms for PDFs, some more aggressive / efficient than others.

rmschne · April 22, 2023, 6:57pm

So yes, I downsample and as it is for documents for reading and not for publication works fine for me to keep huge PDFs smaller.

rfog · April 24, 2023, 10:27am

Another solution to dramatically reduce a PDF size is use Abbyy MRC compression, that more than “compression” is “vectorization”. What it does is take the parts of the images recognized as text and vectorize that part, making it crisper (and corner escalated, but not much). Personally, I like it more than normal graphic compression or down sample because text remains crisper and easy to read.

And yes, in this case, the de-compression takes a lot of time and power, and it is appreciated in old iThinks and sometimes in Mac. The advantage here is that the process is only used to visualice the PDF, and the annotations go as normal at the end of the file. In my one-generation old iThings, the opening of one of those MRC-compressed documents is not appreciable except if you scroll fast, that you must wait one or two seconds to show the text in the page shown.

However, there is a second handicap: this option is only available in Abbyy paid versions, and macOS version performs a lot of worse than Windows one (worse in speed and in compression factor). Normally, in Windows, a 200 MB text scanned PDF is converted in 20 MB or sometimes less if you blank the background. As a sample, my last scanned book (“For us, the living”, Heinlein, from Virginia Edition) scanned images are 250 MB size, but the final not-background-whitened PDF generated is 16 MB with covers at full resolution.

nsflanagan · August 11, 2024, 10:15pm

I have an issue that may be related to MRC.

I am working with a lot of digitized historical publications. I prefer PDFs because no OCR is perfect, and I prefer to get them from the Internet Archive uses MRC and other compression algorithms to make its PDFs itty bitty. I generally go through and delete irrelevant articles and advertisements. Sometimes, I will remove 50% of pages and yet the PDF size will double. If the OCR gets corrupted and I reprocess it with DTP’s OCR, the file may double or triple again. I can use a compression tool like PDF Squeezer to get the size down but every resave with these lossy tools degrades the images.

Can anyone help me understand the resave behavior of these files?
Is this related to MRC or am I looking in the wrong place?
Is there a better way to delete pages?

BLUEFROG · August 11, 2024, 11:03pm

PDF engines vary in the types and amount of compression they apply (if any), so unless you were handling and saving the files with the same engine having the same settings, you won’t necessarily see smaller sizes. The closest thing to a user-definable setting is the resolution and compression settings if you redo OCR, but there is no built in PDF compressor for general use.

nsflanagan · August 12, 2024, 12:47am

Thank you for the reply. Your explanation makes perfect sense for the OCR step but as best as I can tell when I delete pages, nothing is changing in the embedded (full page) images elsewhere in the document.

Am I correct in inferring that compression is being applied to the entire PDF rather than to individual pages beyond the discrete images?

rfog · August 18, 2024, 7:45am

Do it with DTTG iOS version, or PDF Expert (both iOS/macOS), or PDF Viewer from PSPFkit (iOS version, they have macOS version but I stopped using it because I couldn’t make to the user interface).

BLUEFROG · August 18, 2024, 2:19pm

Development could chime in here, but the compression is applied to the document.
Also, when you open a PDF, you are using the PDF framework of that application (Apple’s PDFKit in DEVONthink’s case). Compressed PDFs are decompressed automatically and saving is done via the framework. So whatever compression is used by the framework is applied when saving the document. That includes are annotation, reordering or deleting pages, etc. Look at the Info > Properties inspector and you’ll likely see information about the PDF engine that last saved the document.

nsflanagan · August 18, 2024, 2:54pm

Can you share what the advantages of these approaches are? I do use DTTG in the archives but it’s hard to read these old files on my small phone, so this would be adding a step unless I get a tablet.

nsflanagan · August 18, 2024, 3:13pm

Thank you for confirming.The Internet Archive does use their own their own in-house PDF engine, and I do see it in identified in the properties of the raw download. Perhaps Apple will be tinkering with PDFKit now that Live Text is fully rolled out…

BLUEFROG · August 18, 2024, 4:38pm

Apple’s PDFKit has been languishing for many years. We remain guardedly hopeful they’ll address the issues it has.

rfog · August 18, 2024, 4:49pm

I meant, to remove unwanted pages, or add blank ones. Apple Kit in macOS is crap. Any of those tools won’t increase the PDF size too much.

I had PDFs scanned and OCRed with Abbyy in Windows that for the mere act of opening them in macOS preview, they grow from, say 100K to 10 MB or more.

kewms · August 18, 2024, 5:47pm

Can you break it into smaller pieces? Surely a 3,000 page document has chapters and other internal divisions? In the few times I’ve used it, DT’s built-in Split PDF commands have worked pretty well.