Split PDFs - Custom Break

mmoren10 · June 17, 2022, 3:44pm

Hey! The Split PDF into Chapters is extremely useful when the PDF has duly created chapters.

Is there a way to script something that takes a PDF and splits it into several sections at a specified text?

Use case: I receive a single PDF file with a bunch of documents merged; no chapters. Each of the merged documents have the same predictable structure: they start and end with a specific line of text. I want to automatically separate the single PDF file into several files, using such line of text as the breaking point.

Thanks!

Blanc · June 17, 2022, 4:06pm

Im pretty sure @pete31 wrote a script to do exactly that. Have you given search a try?

mmoren10 · June 17, 2022, 4:33pm

Thanks for the heads up! I did try searching, but couldn’t find anything immediately applicable to PDFs (not an expert on scripting).

Found some to split RTFs by @pete31. I guess I will have to combine a few scripts and try to adjust them to my needs. Would appreciate if anyone could point me in the right direction. Thanks!

Blanc · June 17, 2022, 5:19pm

Sorry, in that case my memory may have tricked me. Please excuse.

BLUEFROG · June 17, 2022, 5:38pm

No, this is not possible at this time and considering PDFs are not text-based, this would not be a simple thing to do.

chrillek · June 17, 2022, 5:47pm

I doubt that that’s feasible at all. Several reasons:

Splitting “at a specified text” requires the text to be identifiable in the PDF. But even if you have a PDF with a text layer, the underlying PDF consists of only graphic commands like
goto x,y; (some); goto (x + lengthOf('some')), y; (text) (in pseudo code). How could any piece of software find the text you’re looking for in this situation?
Even if it where possible to locate the text, the software would then have to
- start a new page
- gather all the PDF commands following the text and
- put them on this new page, until
- the page is full.

Wait – how exactly would the software know that? Simple answer: It can’t. Because every page is terminated with a showpage command (that is real code) which causes the page to be actually printed and the state of the interpreter to be reset. Now, in your scenario, the program or script would have to skip the showpage and continue to execute commands until the page is full. Which it can only do if it follows what the commands do to the current position. Which means that the program would have to understand the commands, which in turn would make it into a PDF interpreter.

These interpreters actually do exist in software (Ghostscript being a free example). But for the first of the two reasons given before, I doubt that they’ll be able to split a document reliably at arbitrary text.

I may of course be wrong. So just check out Apple’s PDFKit framework. Perhaps it gives you the tools to do what you want.

https://developer.apple.com/documentation/pdfkit?language=objc

But from what I understand is that you might be able to find the text but it’s not related to the underlying PDF structure.

And then there’s this

https://developer.apple.com/documentation/coregraphics/cgpdfscanner/

BLUEFROG · June 17, 2022, 5:55pm

Like I said… “not [be] a simple thing to do”

Blanc · June 17, 2022, 6:10pm

Now I feel really silly for having posted before thinking more. Even @pete31 only sometimes works real magic

pete31 · June 17, 2022, 6:28pm

It’s possible to write such a script but I don’t have time now. Next week, I think,

mmoren10 · June 17, 2022, 7:27pm

Thanks everyone! This might be a stupid question given the emphasized complexity of the matter, but as a workaround, would it be easier to first bookmark (or whatever the PDF chapters are called) those pages where the expected line of text is found; and then use the native Split PDF into Chapters?

mmoren10 · June 17, 2022, 8:49pm

Just to be clear, no break/delimiter would be in the middle of a page. The first/last page of the resulting PDFs would just be the one where the specified strings are found. And the PDFs do have a text layer.

Again, as no expert, this might be a dumb suggestion (just like the “bookmark” one above), but I reckon something like this might be feasible: 1. Identify the pages where the specified string is found (ex. Page 1, 16, 38…); and 2. Print/export the resulting page intervals (ex. File 1: From page 1 to 15; File 2: From page 16 to 37; File 3: From page 38 to…).

Thx again!

pete31 · June 17, 2022, 9:02pm

Yes. Could you upload an example PDF?

mmoren10 · June 17, 2022, 9:20pm

Here you go. The last page of each resulting PDF would have this string:

https://procesojudicial.ramajudicial.gov.co/FirmaElectronica

The first instance is on page 17.

Thanks again!

PROVIDENCIAS E-73 ABRIL 29 DE 2022.pdf (9.1 MB)

pete31 · June 21, 2022, 11:14pm

This script split PDFs at a delimiter (see property theDelimiter).

Resulting PDFs are created in the original record’s location.

-- Split PDF at delimiter

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

property theDelimiter : "https://procesojudicial.ramajudicial.gov.co/FirmaElectronica"

tell application id "DNtp"
	try
		set theRecords to selected records
		if theRecords = {} then error "Please select some PDF records."
		show progress indicator "Split PDF at delimiter... " steps (count theRecords) as string with cancel button
		
		repeat with thisRecord in theRecords
			set thisRecord_Type to (type of thisRecord) as string
			if thisRecord_Type is in {"PDF document", "«constant ****pdf »"} then
				set thisRecord_NameWithoutExtension to name without extension of thisRecord
				step progress indicator "... " & thisRecord_NameWithoutExtension
				set thisRecord_Path to path of thisRecord
				set thisRecord_LocationGroup to location group of thisRecord
				set thisRecord_URL to URL of thisRecord
				set theTempDirectoryURL to my splitPDFatDelimiter(thisRecord_Path, thisRecord_NameWithoutExtension, thisRecord_LocationGroup, thisRecord_URL)
			end if
		end repeat
		
		my deleteTempDirectory(theTempDirectoryURL)
		hide progress indicator
		
	on error error_message number error_number
		hide progress indicator
		if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
		return
	end try
end tell

on splitPDFatDelimiter(theRecord_Path, theRecord_NameWithoutExtension, theRecord_LocationGroup, theRecord_URL)
	try
		set thePDF to current application's PDFDocument's alloc()'s initWithURL:(current application's |NSURL|'s fileURLWithPath:theRecord_Path)
		
		set theResultSelections to (thePDF's findString:theDelimiter withOptions:0)
		set theResultSelections_Count to theResultSelections's |count|()
		if theResultSelections_Count = 0 then error "This PDF doesn't contain delimiter \"" & theDelimiter & "\""
		
		set theNextFirstPage to missing value
		set theTempDirectoryURL to my createTempDirectory()
		set thisPrefix to 0
		
		repeat with i from 0 to (theResultSelections_Count - 1)
			set thisResultSelection to (theResultSelections's objectAtIndex:i)
			set thisResultSelection_Page_Index to (((thisResultSelection's pages()'s firstObject())'s label()) as integer) - 1 -- "pageAtIndex" is zero based, "label" is not
			
			if i = 0 then
				set thisPDF_FirstPage_Index to 0
			else
				set thisPDF_FirstPage_Index to theNextFirstPage_Index
			end if
			set thisPDF_LastPage_Index to thisResultSelection_Page_Index
			
			set thisPDF_FirstPage to (thePDF's pageAtIndex:thisPDF_FirstPage_Index)
			set thisPDF_FirstPage_Data to thisPDF_FirstPage's dataRepresentation()
			set thisPDF to (current application's PDFDocument's alloc()'s initWithData:thisPDF_FirstPage_Data)
			
			set thisPDF_CurrentLastPage_Index to thisPDF_FirstPage_Index as integer
			
			repeat with i from 1 to ((thisPDF_LastPage_Index as integer) - (thisPDF_FirstPage_Index as integer))
				set thisPDF_CurrentLastPage_Index to thisPDF_CurrentLastPage_Index + 1
				(thisPDF's insertPage:(thePDF's pageAtIndex:thisPDF_CurrentLastPage_Index) atIndex:(thisPDF's |pageCount|()))
			end repeat
			
			set thisTempURL to ((theTempDirectoryURL's URLByAppendingPathComponent:(current application's NSProcessInfo's processInfo()'s globallyUniqueString()))'s URLByAppendingPathExtension:"pdf")
			(thisPDF's writeToURL:thisTempURL)
			set thisTempPath to (thisTempURL's |path|()) as string
			
			tell application id "DNtp"
				set thisPrefix to thisPrefix + 1
				set thisImportedRecord_Name to theRecord_NameWithoutExtension & space & "-" & space & (thisPrefix as string)
				set thisImportedRecord to import thisTempPath name thisImportedRecord_Name to theRecord_LocationGroup
				set URL of thisImportedRecord to theRecord_URL
			end tell
			
			set theNextFirstPage_Index to thisResultSelection_Page_Index + 1
		end repeat
		
		return theTempDirectoryURL
		
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"splitPDFatDelimiter\"" message error_message as warning
		try
			my deleteTempDirectory(theTempDirectoryURL)
		end try
		error number -128
	end try
end splitPDFatDelimiter

on createTempDirectory()
	try
		set theTempDirectoryURL to current application's |NSURL|'s fileURLWithPath:((current application's NSTemporaryDirectory())'s stringByAppendingPathComponent:("_Script - Split PDF at delimiter" & space & (current application's NSProcessInfo's processInfo()'s globallyUniqueString())))
		set {successCreateDir, theError} to current application's NSFileManager's defaultManager's createDirectoryAtURL:theTempDirectoryURL withIntermediateDirectories:false attributes:(missing value) |error|:(reference)
		if theError ≠ missing value then error (theError's localizedDescription() as string)
		return theTempDirectoryURL
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"createTempDirectory\"" message error_message as warning
		error number -128
	end try
end createTempDirectory

on deleteTempDirectory(theTempDirectoryURL)
	try
		set {successDeleteDir, theError} to (current application's NSFileManager's defaultManager()'s removeItemAtURL:(theTempDirectoryURL) |error|:(reference))
		if theError ≠ missing value then error (theError's localizedDescription() as string)
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"deleteTempDirectory\"" message error_message as warning
		error number -128
	end try
end deleteTempDirectory

mmoren10 · July 14, 2022, 5:25pm

This is awesome! Works perfectly. Thanks @pete31 !!!

mmoren10 · October 12, 2022, 12:43am

This has been extremely useful! Could @pete31 or anyone else point me out in the right direction on how to change the script so that the page with theDelimiter is not the last one but rather the first one of each resulting PDF? Tried a few changes with no luck. Thanks again for everyone’s help!

pete31 · October 13, 2022, 8:40pm

Well …

Writing and testing a script is fun but trying to re-create other users setup is not

mmoren10 · October 13, 2022, 9:10pm

Sorry if I came across that way. It was not my intention. You have been extremely generous with your time and knowledge, and your previous help has been invaluable.

I thought it might be a change as simple as changing the theDelimiter string, but I might have been fooled. Thought there might be a similar variable that I could just modify to make it work, but couldn’t make out which one controlled the split. Apparently it is much more complex, so no worries!

Thanks again!

pete31 · October 13, 2022, 9:35pm

No! No worries!

Really, no idea (yet). The point is: Although some users seem to think some other users would know how to do something most of the time it really is just: trying, trying, trying.

So, what I meant was: please upload a new example PDF (as it’s really “hard” to imagine what the input looks like. Or to put in another way: Without explicit input the output may not be what you’re looking for, which in turn would waste lifetime …). There was definitely no pun intended

mmoren10 · October 13, 2022, 11:11pm

Got it! Your generosity is definitely unmatched!

Although I can’t upload the specific documents to the internet, this PDF recreates the problem:

The document has several chapters (3 in the example). I want to split the document into chapters to improve DT search results and make it easier to manage.
The first page of each chapter has the same text as footnote. In the example, “Proceso 31-2891” (theDelimiter). The rest of the pages do not have this footnote.
So essentially, the first page of each of the resulting PDFs would be the page where the Delimiter was found; whereas the last page would be the page right before the next matching page (or the end of the document).

In the example PDF, the result would be 3 PDFs with the following pages:
First. p. 1 to 4.
Second. p. 5 to 8.
Third. p. 9 to 12.

Proceso 31-2891.pdf (112.7 KB)

Thanks again!