Split PDFs - Custom Break

Hey! The Split PDF into Chapters is extremely useful when the PDF has duly created chapters.

Is there a way to script something that takes a PDF and splits it into several sections at a specified text?

Use case: I receive a single PDF file with a bunch of documents merged; no chapters. Each of the merged documents have the same predictable structure: they start and end with a specific line of text. I want to automatically separate the single PDF file into several files, using such line of text as the breaking point.

Thanks!

Im pretty sure @pete31 wrote a script to do exactly that. Have you given search a try?

Thanks for the heads up! I did try searching, but couldn’t find anything immediately applicable to PDFs (not an expert on scripting).

Found some to split RTFs by @pete31. I guess I will have to combine a few scripts and try to adjust them to my needs. Would appreciate if anyone could point me in the right direction. Thanks!

Sorry, in that case my memory may have tricked me. Please excuse.

No, this is not possible at this time and considering PDFs are not text-based, this would not be a simple thing to do.

I doubt that that’s feasible at all. Several reasons:

  • Splitting “at a specified text” requires the text to be identifiable in the PDF. But even if you have a PDF with a text layer, the underlying PDF consists of only graphic commands like
    goto x,y; (some); goto (x + lengthOf('some')), y; (text) (in pseudo code). How could any piece of software find the text you’re looking for in this situation?
  • Even if it where possible to locate the text, the software would then have to
    • start a new page
    • gather all the PDF commands following the text and
    • put them on this new page, until
    • the page is full.

Wait – how exactly would the software know that? Simple answer: It can’t. Because every page is terminated with a showpage command (that is real code) which causes the page to be actually printed and the state of the interpreter to be reset. Now, in your scenario, the program or script would have to skip the showpage and continue to execute commands until the page is full. Which it can only do if it follows what the commands do to the current position. Which means that the program would have to understand the commands, which in turn would make it into a PDF interpreter.

These interpreters actually do exist in software (Ghostscript being a free example). But for the first of the two reasons given before, I doubt that they’ll be able to split a document reliably at arbitrary text.

I may of course be wrong. So just check out Apple’s PDFKit framework. Perhaps it gives you the tools to do what you want.

https://developer.apple.com/documentation/pdfkit?language=objc

But from what I understand is that you might be able to find the text but it’s not related to the underlying PDF structure.

And then there’s this

https://developer.apple.com/documentation/coregraphics/cgpdfscanner/

Like I said… “not [be] a simple thing to do” :wink:

1 Like

Now I feel really silly for having posted before thinking more. Even @pete31 only sometimes works real magic :see_no_evil:

1 Like

It’s possible to write such a script but I don’t have time now. Next week, I think,

Thanks everyone! This might be a stupid question given the emphasized complexity of the matter, but as a workaround, would it be easier to first bookmark (or whatever the PDF chapters are called) those pages where the expected line of text is found; and then use the native Split PDF into Chapters?

Just to be clear, no break/delimiter would be in the middle of a page. The first/last page of the resulting PDFs would just be the one where the specified strings are found. And the PDFs do have a text layer.

Again, as no expert, this might be a dumb suggestion (just like the “bookmark” one above), but I reckon something like this might be feasible: 1. Identify the pages where the specified string is found (ex. Page 1, 16, 38…); and 2. Print/export the resulting page intervals (ex. File 1: From page 1 to 15; File 2: From page 16 to 37; File 3: From page 38 to…).

Thx again!

Yes. Could you upload an example PDF?

Here you go. The last page of each resulting PDF would have this string:

https://procesojudicial.ramajudicial.gov.co/FirmaElectronica

The first instance is on page 17.

Thanks again!

PROVIDENCIAS E-73 ABRIL 29 DE 2022.pdf (9.1 MB)

1 Like

This script split PDFs at a delimiter (see property theDelimiter).

Resulting PDFs are created in the original record’s location.

-- Split PDF at delimiter

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

property theDelimiter : "https://procesojudicial.ramajudicial.gov.co/FirmaElectronica"

tell application id "DNtp"
	try
		set theRecords to selected records
		if theRecords = {} then error "Please select some PDF records."
		show progress indicator "Split PDF at delimiter... " steps (count theRecords) as string with cancel button
		
		repeat with thisRecord in theRecords
			set thisRecord_Type to (type of thisRecord) as string
			if thisRecord_Type is in {"PDF document", "«constant ****pdf »"} then
				set thisRecord_NameWithoutExtension to name without extension of thisRecord
				step progress indicator "... " & thisRecord_NameWithoutExtension
				set thisRecord_Path to path of thisRecord
				set thisRecord_LocationGroup to location group of thisRecord
				set thisRecord_URL to URL of thisRecord
				set theTempDirectoryURL to my splitPDFatDelimiter(thisRecord_Path, thisRecord_NameWithoutExtension, thisRecord_LocationGroup, thisRecord_URL)
			end if
		end repeat
		
		my deleteTempDirectory(theTempDirectoryURL)
		hide progress indicator
		
	on error error_message number error_number
		hide progress indicator
		if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
		return
	end try
end tell

on splitPDFatDelimiter(theRecord_Path, theRecord_NameWithoutExtension, theRecord_LocationGroup, theRecord_URL)
	try
		set thePDF to current application's PDFDocument's alloc()'s initWithURL:(current application's |NSURL|'s fileURLWithPath:theRecord_Path)
		
		set theResultSelections to (thePDF's findString:theDelimiter withOptions:0)
		set theResultSelections_Count to theResultSelections's |count|()
		if theResultSelections_Count = 0 then error "This PDF doesn't contain delimiter \"" & theDelimiter & "\""
		
		set theNextFirstPage to missing value
		set theTempDirectoryURL to my createTempDirectory()
		set thisPrefix to 0
		
		repeat with i from 0 to (theResultSelections_Count - 1)
			set thisResultSelection to (theResultSelections's objectAtIndex:i)
			set thisResultSelection_Page_Index to (((thisResultSelection's pages()'s firstObject())'s label()) as integer) - 1 -- "pageAtIndex" is zero based, "label" is not
			
			if i = 0 then
				set thisPDF_FirstPage_Index to 0
			else
				set thisPDF_FirstPage_Index to theNextFirstPage_Index
			end if
			set thisPDF_LastPage_Index to thisResultSelection_Page_Index
			
			set thisPDF_FirstPage to (thePDF's pageAtIndex:thisPDF_FirstPage_Index)
			set thisPDF_FirstPage_Data to thisPDF_FirstPage's dataRepresentation()
			set thisPDF to (current application's PDFDocument's alloc()'s initWithData:thisPDF_FirstPage_Data)
			
			set thisPDF_CurrentLastPage_Index to thisPDF_FirstPage_Index as integer
			
			repeat with i from 1 to ((thisPDF_LastPage_Index as integer) - (thisPDF_FirstPage_Index as integer))
				set thisPDF_CurrentLastPage_Index to thisPDF_CurrentLastPage_Index + 1
				(thisPDF's insertPage:(thePDF's pageAtIndex:thisPDF_CurrentLastPage_Index) atIndex:(thisPDF's |pageCount|()))
			end repeat
			
			set thisTempURL to ((theTempDirectoryURL's URLByAppendingPathComponent:(current application's NSProcessInfo's processInfo()'s globallyUniqueString()))'s URLByAppendingPathExtension:"pdf")
			(thisPDF's writeToURL:thisTempURL)
			set thisTempPath to (thisTempURL's |path|()) as string
			
			tell application id "DNtp"
				set thisPrefix to thisPrefix + 1
				set thisImportedRecord_Name to theRecord_NameWithoutExtension & space & "-" & space & (thisPrefix as string)
				set thisImportedRecord to import thisTempPath name thisImportedRecord_Name to theRecord_LocationGroup
				set URL of thisImportedRecord to theRecord_URL
			end tell
			
			set theNextFirstPage_Index to thisResultSelection_Page_Index + 1
		end repeat
		
		return theTempDirectoryURL
		
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"splitPDFatDelimiter\"" message error_message as warning
		try
			my deleteTempDirectory(theTempDirectoryURL)
		end try
		error number -128
	end try
end splitPDFatDelimiter

on createTempDirectory()
	try
		set theTempDirectoryURL to current application's |NSURL|'s fileURLWithPath:((current application's NSTemporaryDirectory())'s stringByAppendingPathComponent:("_Script - Split PDF at delimiter" & space & (current application's NSProcessInfo's processInfo()'s globallyUniqueString())))
		set {successCreateDir, theError} to current application's NSFileManager's defaultManager's createDirectoryAtURL:theTempDirectoryURL withIntermediateDirectories:false attributes:(missing value) |error|:(reference)
		if theError ≠ missing value then error (theError's localizedDescription() as string)
		return theTempDirectoryURL
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"createTempDirectory\"" message error_message as warning
		error number -128
	end try
end createTempDirectory

on deleteTempDirectory(theTempDirectoryURL)
	try
		set {successDeleteDir, theError} to (current application's NSFileManager's defaultManager()'s removeItemAtURL:(theTempDirectoryURL) |error|:(reference))
		if theError ≠ missing value then error (theError's localizedDescription() as string)
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"deleteTempDirectory\"" message error_message as warning
		error number -128
	end try
end deleteTempDirectory

3 Likes