Sub-dividing a mega PDF

I frequently get “mega” PDF 's (up to thousands of pages long) that are a conglomeration of dozens if not hundreds of individual documents. This PDF file is like the owner took every piece of paper in a huge file folder and scanned them into a single PDF.

For my business, I need to skim all say 4000 pages, and pick out the half dozen documents that are important. I then make a DT folder with the original mega doc and the newly extracted half dozen sub-documents.

I’m looking for a way to semi-automate the process such that I can highlight say pages 21 through 56, extract it, give it a name and give it a red label. Then keep reading until I find the next important document and repeat the process.

Any ideas or directions in which you can point me???

Thanks in advance.
Larry

1 Like

I have the same thing - massive PDFs are a pain.

I bookmark each individual document as I skim through in Acrobat DC (if you highlight a word (assuming the document is OCR’d) and press Cmd + B Acrobat creates a bookmark with the hightlighted text as the bookmark name, which can be useful, otherwise it defaults to “ Untitled”).

You can then either split by bookmark in Acrobat (with each bookmark name as the file name) or split by “Chapter” in DTP (which helpfully sequentially numbers each split file).

It’s pretty quick once you get going. I tend to delete the split files I don’t need and rename what remains.

The reason for using Acrobat is (AFAIK) DTP can’t create bookmarks/ “Chapters”. PDF Expert can also create bookmarks (from recollection they call it creating a table of contents or something, bookmarks are something different).

PS: I appreciate I haven’t answered your question about automating the process. I haven’t found a way to do that that works, so far.

I must confess, I didn’t know DTP had that feature. Given that it does, automating for it @LDunville should be fairly straightforward. I will have a play tomorrow.

I’m also certain you could automate this with Keyboard Maestro from outside DTP in any event.

Thanks v much, @MDMaynard.

Any input/ thoughts/ suggestions would be very welcome.

:slightly_smiling_face:

Why don’t you just drag and drop the selected page thumbnail(s) from the Content > Thumbnails inspector to the item list? This generates a new PDF from the dropped pages.

3 Likes

That’s exactly where I’m at currently. Unfortunately, that leaves me with having to drag and drop to DT, apply a new name, and affix a label.

If I could do this process while already in DT, the “mother” file is already located in the right location, therefore I could just apply a name and automatically apply a DT label.

This would greatly speed up a process that involves dozens is not hundreds of cycles.

What’s “the right location” ?

Thankyou for the suggestion, Jim - I was sceptical at first but it works at least as well as my bookmark + split process I have been using for years in Acrobat.

This was one of the few remaining functions I have been regularly using Acrobat for (at $20 or so per month). DT clearly doesn’t charge enough.

I note DTP even gives the extracted file a name based on the first line of OCR’d text - nice! :grinning:

@LDunville - I don’t at the moment see how you could automate the page selection process (because it involves a human judgment) but if you open a second window for the destination group you want to the right of the “document review” window, once you’ve selected the pages to extract in the thumbnail view in the inspector it’s not too much bother to Option + drag them into the “destination group” window you want (if that makes sense).

It should also be possible to do a script (maybe in combination with Keyboard Maestro) (as @MDMaynard said) to copy the pages, paste them in the group you want and use the display name editor to input the text you want for the name and ?label which should speed up the process a bit. I have something similar which I use to rename and date documents once split but I’m a bit different to you in that I do the splitting/ extracting in one go and then the renaming/ redating/ labelling as a second process.

You’re very welcome! :slight_smile:
From years and years on Macs, drag and drop is instinctive to me so I found the feature quite easily. It’s actually been around since the 2.x days at least !

1 Like

You could use ghostscript to extract page ranges. See here:

For the OCR part, I’m using Abby that I wrapped it’s CLI version with a little Java script:

Dockerized version is here:

Needs an Abbyy License and a little technical knowledge to use it, but allows me to have any PDFs run through OCR (and, using PDF meta data, actually make sure to do it only once).

There is a better option in DTP though; you can just set up a smart rule like shown here:

Screenshot 2022-01-07 at 14.15.09

The funny “Word Count is less than 1” is to exclude documents that were already run through an OCR.

2 Likes