Split PDFs - Custom Break

Sorry, in that case my memory may have tricked me. Please excuse.

No, this is not possible at this time and considering PDFs are not text-based, this would not be a simple thing to do.

I doubt that that’s feasible at all. Several reasons:

  • Splitting “at a specified text” requires the text to be identifiable in the PDF. But even if you have a PDF with a text layer, the underlying PDF consists of only graphic commands like
    goto x,y; (some); goto (x + lengthOf('some')), y; (text) (in pseudo code). How could any piece of software find the text you’re looking for in this situation?
  • Even if it where possible to locate the text, the software would then have to
    • start a new page
    • gather all the PDF commands following the text and
    • put them on this new page, until
    • the page is full.

Wait – how exactly would the software know that? Simple answer: It can’t. Because every page is terminated with a showpage command (that is real code) which causes the page to be actually printed and the state of the interpreter to be reset. Now, in your scenario, the program or script would have to skip the showpage and continue to execute commands until the page is full. Which it can only do if it follows what the commands do to the current position. Which means that the program would have to understand the commands, which in turn would make it into a PDF interpreter.

These interpreters actually do exist in software (Ghostscript being a free example). But for the first of the two reasons given before, I doubt that they’ll be able to split a document reliably at arbitrary text.

I may of course be wrong. So just check out Apple’s PDFKit framework. Perhaps it gives you the tools to do what you want.

https://developer.apple.com/documentation/pdfkit?language=objc

But from what I understand is that you might be able to find the text but it’s not related to the underlying PDF structure.

And then there’s this

https://developer.apple.com/documentation/coregraphics/cgpdfscanner/

Like I said… “not [be] a simple thing to do” :wink:

1 Like

Now I feel really silly for having posted before thinking more. Even @pete31 only sometimes works real magic :see_no_evil:

1 Like

It’s possible to write such a script but I don’t have time now. Next week, I think,

Thanks everyone! This might be a stupid question given the emphasized complexity of the matter, but as a workaround, would it be easier to first bookmark (or whatever the PDF chapters are called) those pages where the expected line of text is found; and then use the native Split PDF into Chapters?

Just to be clear, no break/delimiter would be in the middle of a page. The first/last page of the resulting PDFs would just be the one where the specified strings are found. And the PDFs do have a text layer.

Again, as no expert, this might be a dumb suggestion (just like the “bookmark” one above), but I reckon something like this might be feasible: 1. Identify the pages where the specified string is found (ex. Page 1, 16, 38…); and 2. Print/export the resulting page intervals (ex. File 1: From page 1 to 15; File 2: From page 16 to 37; File 3: From page 38 to…).

Thx again!

Yes. Could you upload an example PDF?

Here you go. The last page of each resulting PDF would have this string:

https://procesojudicial.ramajudicial.gov.co/FirmaElectronica

The first instance is on page 17.

Thanks again!

PROVIDENCIAS E-73 ABRIL 29 DE 2022.pdf (9.1 MB)

1 Like

This script split PDFs at a delimiter (see property theDelimiter).

Resulting PDFs are created in the original record’s location.

-- Split PDF at delimiter

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

property theDelimiter : "https://procesojudicial.ramajudicial.gov.co/FirmaElectronica"

tell application id "DNtp"
	try
		set theRecords to selected records
		if theRecords = {} then error "Please select some PDF records."
		show progress indicator "Split PDF at delimiter... " steps (count theRecords) as string with cancel button
		
		repeat with thisRecord in theRecords
			set thisRecord_Type to (type of thisRecord) as string
			if thisRecord_Type is in {"PDF document", "«constant ****pdf »"} then
				set thisRecord_NameWithoutExtension to name without extension of thisRecord
				step progress indicator "... " & thisRecord_NameWithoutExtension
				set thisRecord_Path to path of thisRecord
				set thisRecord_LocationGroup to location group of thisRecord
				set thisRecord_URL to URL of thisRecord
				set theTempDirectoryURL to my splitPDFatDelimiter(thisRecord_Path, thisRecord_NameWithoutExtension, thisRecord_LocationGroup, thisRecord_URL)
			end if
		end repeat
		
		my deleteTempDirectory(theTempDirectoryURL)
		hide progress indicator
		
	on error error_message number error_number
		hide progress indicator
		if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
		return
	end try
end tell

on splitPDFatDelimiter(theRecord_Path, theRecord_NameWithoutExtension, theRecord_LocationGroup, theRecord_URL)
	try
		set thePDF to current application's PDFDocument's alloc()'s initWithURL:(current application's |NSURL|'s fileURLWithPath:theRecord_Path)
		
		set theResultSelections to (thePDF's findString:theDelimiter withOptions:0)
		set theResultSelections_Count to theResultSelections's |count|()
		if theResultSelections_Count = 0 then error "This PDF doesn't contain delimiter \"" & theDelimiter & "\""
		
		set theNextFirstPage to missing value
		set theTempDirectoryURL to my createTempDirectory()
		set thisPrefix to 0
		
		repeat with i from 0 to (theResultSelections_Count - 1)
			set thisResultSelection to (theResultSelections's objectAtIndex:i)
			set thisResultSelection_Page_Index to (((thisResultSelection's pages()'s firstObject())'s label()) as integer) - 1 -- "pageAtIndex" is zero based, "label" is not
			
			if i = 0 then
				set thisPDF_FirstPage_Index to 0
			else
				set thisPDF_FirstPage_Index to theNextFirstPage_Index
			end if
			set thisPDF_LastPage_Index to thisResultSelection_Page_Index
			
			set thisPDF_FirstPage to (thePDF's pageAtIndex:thisPDF_FirstPage_Index)
			set thisPDF_FirstPage_Data to thisPDF_FirstPage's dataRepresentation()
			set thisPDF to (current application's PDFDocument's alloc()'s initWithData:thisPDF_FirstPage_Data)
			
			set thisPDF_CurrentLastPage_Index to thisPDF_FirstPage_Index as integer
			
			repeat with i from 1 to ((thisPDF_LastPage_Index as integer) - (thisPDF_FirstPage_Index as integer))
				set thisPDF_CurrentLastPage_Index to thisPDF_CurrentLastPage_Index + 1
				(thisPDF's insertPage:(thePDF's pageAtIndex:thisPDF_CurrentLastPage_Index) atIndex:(thisPDF's |pageCount|()))
			end repeat
			
			set thisTempURL to ((theTempDirectoryURL's URLByAppendingPathComponent:(current application's NSProcessInfo's processInfo()'s globallyUniqueString()))'s URLByAppendingPathExtension:"pdf")
			(thisPDF's writeToURL:thisTempURL)
			set thisTempPath to (thisTempURL's |path|()) as string
			
			tell application id "DNtp"
				set thisPrefix to thisPrefix + 1
				set thisImportedRecord_Name to theRecord_NameWithoutExtension & space & "-" & space & (thisPrefix as string)
				set thisImportedRecord to import thisTempPath name thisImportedRecord_Name to theRecord_LocationGroup
				set URL of thisImportedRecord to theRecord_URL
			end tell
			
			set theNextFirstPage_Index to thisResultSelection_Page_Index + 1
		end repeat
		
		return theTempDirectoryURL
		
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"splitPDFatDelimiter\"" message error_message as warning
		try
			my deleteTempDirectory(theTempDirectoryURL)
		end try
		error number -128
	end try
end splitPDFatDelimiter

on createTempDirectory()
	try
		set theTempDirectoryURL to current application's |NSURL|'s fileURLWithPath:((current application's NSTemporaryDirectory())'s stringByAppendingPathComponent:("_Script - Split PDF at delimiter" & space & (current application's NSProcessInfo's processInfo()'s globallyUniqueString())))
		set {successCreateDir, theError} to current application's NSFileManager's defaultManager's createDirectoryAtURL:theTempDirectoryURL withIntermediateDirectories:false attributes:(missing value) |error|:(reference)
		if theError ≠ missing value then error (theError's localizedDescription() as string)
		return theTempDirectoryURL
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"createTempDirectory\"" message error_message as warning
		error number -128
	end try
end createTempDirectory

on deleteTempDirectory(theTempDirectoryURL)
	try
		set {successDeleteDir, theError} to (current application's NSFileManager's defaultManager()'s removeItemAtURL:(theTempDirectoryURL) |error|:(reference))
		if theError ≠ missing value then error (theError's localizedDescription() as string)
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"deleteTempDirectory\"" message error_message as warning
		error number -128
	end try
end deleteTempDirectory

4 Likes

This is awesome! Works perfectly. Thanks @pete31 !!!

1 Like

This has been extremely useful! Could @pete31 or anyone else point me out in the right direction on how to change the script so that the page with theDelimiter is not the last one but rather the first one of each resulting PDF? Tried a few changes with no luck. Thanks again for everyone’s help!

Well …

Writing and testing a script is fun but trying to re-create other users setup is not :slight_smile:

Sorry if I came across that way. It was not my intention. You have been extremely generous with your time and knowledge, and your previous help has been invaluable.

I thought it might be a change as simple as changing the theDelimiter string, but I might have been fooled. Thought there might be a similar variable that I could just modify to make it work, but couldn’t make out which one controlled the split. Apparently it is much more complex, so no worries!

Thanks again!

1 Like

No! No worries!

Really, no idea (yet). The point is: Although some users seem to think some other users would know how to do something most of the time it really is just: trying, trying, trying.

So, what I meant was: please upload a new example PDF (as it’s really “hard” to imagine what the input looks like. Or to put in another way: Without explicit input the output may not be what you’re looking for, which in turn would waste lifetime …). There was definitely no pun intended :slight_smile:

1 Like

Got it! Your generosity is definitely unmatched!

Although I can’t upload the specific documents to the internet, this PDF recreates the problem:

  • The document has several chapters (3 in the example). I want to split the document into chapters to improve DT search results and make it easier to manage.
  • The first page of each chapter has the same text as footnote. In the example, “Proceso 31-2891” (theDelimiter). The rest of the pages do not have this footnote.
  • So essentially, the first page of each of the resulting PDFs would be the page where the Delimiter was found; whereas the last page would be the page right before the next matching page (or the end of the document).

In the example PDF, the result would be 3 PDFs with the following pages:
First. p. 1 to 4.
Second. p. 5 to 8.
Third. p. 9 to 12.

Proceso 31-2891.pdf (112.7 KB)

Thanks again!

Thanks @pete31! I am getting single page PDFs, which contain theDelimiter. The pages between delimiters are being discarded.

I spent a couple of hours trying to understand the logic behind your genius script. I think the difficulty might be in defining thisPDF_LastPage_Index. In the previous script defining the last page was “straightforward”, as it was the page that contained the theDelimiter; so was definining the next starting page: the one right after the previous theDelimiter.

Here, the first page should be the one containing the first instance of theDelimiter, and the last page the one before the next instance of theDelimiter. With my (very) limited understanding of loops, seems tough to get. Nice try though! Thanks!

For the sake of clarity, here are the expected results from the example PDF.
Proceso 31-2891 - 2.pdf (70.5 KB)
Proceso 31-2891 - 3.pdf (70.2 KB)
Proceso 31-2891 - 1.pdf (68.2 KB)

Ok, no idea what I did there …

This one should work

-- Split PDF at delimiter (using the next page after the delimiter as first page)

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

property theDelimiter : "Proceso 31-2891"

tell application id "DNtp"
	try
		set theRecords to selected records
		if theRecords = {} then error "Please select some PDF records."
		show progress indicator "Split PDF at delimiter... " steps (count theRecords) as string with cancel button
		
		repeat with thisRecord in theRecords
			set thisRecord_Type to (type of thisRecord) as string
			if thisRecord_Type is in {"PDF document", "«constant ****pdf »"} then
				set thisRecord_NameWithoutExtension to name without extension of thisRecord
				step progress indicator "... " & thisRecord_NameWithoutExtension
				set thisRecord_Path to path of thisRecord
				set thisRecord_LocationGroup to location group of thisRecord
				set thisRecord_URL to URL of thisRecord
				set theTempDirectoryURL to my splitPDFatDelimiter(thisRecord_Path, thisRecord_NameWithoutExtension, thisRecord_LocationGroup, thisRecord_URL)
			end if
		end repeat
		
		my deleteTempDirectory(theTempDirectoryURL)
		hide progress indicator
		
	on error error_message number error_number
		hide progress indicator
		if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
		return
	end try
end tell

on splitPDFatDelimiter(theRecord_Path, theRecord_NameWithoutExtension, theRecord_LocationGroup, theRecord_URL)
	try
		set thePDF to current application's PDFDocument's alloc()'s initWithURL:(current application's |NSURL|'s fileURLWithPath:theRecord_Path)
		
		set theResultSelections to (thePDF's findString:theDelimiter withOptions:0)
		set theResultSelections_Count to theResultSelections's |count|()
		if theResultSelections_Count = 0 then error "This PDF doesn't contain delimiter \"" & theDelimiter & "\""
		
		set theTempDirectoryURL to my createTempDirectory()
		set thisPrefix to 0
		
		repeat with i from 0 to (theResultSelections_Count - 1)
			set thisResultSelection to (theResultSelections's objectAtIndex:i)
			set thisResultSelection_Page_Index to (((thisResultSelection's pages()'s firstObject())'s label()) as integer) - 1 -- "pageAtIndex" is zero based, "label" is not
			
			if i = 0 then
				set thisPDF_FirstPage_Index to 0
			else
				set thisPDF_FirstPage_Index to thisResultSelection_Page_Index
			end if
			
			if (i < theResultSelections_Count - 1) then
				set theNextResultSelection to (theResultSelections's objectAtIndex:(i + 1))
				set theNextPDF_FirstPage_Index to (((theNextResultSelection's pages()'s firstObject())'s label()) as integer) - 1 -- "pageAtIndex" is zero based, "label" is not
				set thisPDF_LastPage_Index to theNextPDF_FirstPage_Index - 1
			else
				set thisPDF_LastPage_Index to ((thePDF's |pageCount|()) - 1)
			end if
			
			set thisPDF_FirstPage to (thePDF's pageAtIndex:thisPDF_FirstPage_Index)
			set thisPDF_FirstPage_Data to thisPDF_FirstPage's dataRepresentation()
			set thisPDF to (current application's PDFDocument's alloc()'s initWithData:thisPDF_FirstPage_Data)
			
			set thisPDF_CurrentLastPage_Index to thisPDF_FirstPage_Index as integer
			
			repeat with i from 1 to ((thisPDF_LastPage_Index as integer) - (thisPDF_FirstPage_Index as integer))
				set thisPDF_CurrentLastPage_Index to thisPDF_CurrentLastPage_Index + 1
				(thisPDF's insertPage:(thePDF's pageAtIndex:thisPDF_CurrentLastPage_Index) atIndex:(thisPDF's |pageCount|()))
			end repeat
			
			set thisTempURL to ((theTempDirectoryURL's URLByAppendingPathComponent:(current application's NSProcessInfo's processInfo()'s globallyUniqueString()))'s URLByAppendingPathExtension:"pdf")
			(thisPDF's writeToURL:thisTempURL)
			set thisTempPath to (thisTempURL's |path|()) as string
			
			tell application id "DNtp"
				set thisPrefix to thisPrefix + 1
				set thisImportedRecord_Name to theRecord_NameWithoutExtension & space & "-" & space & (thisPrefix as string)
				set thisImportedRecord to import thisTempPath name thisImportedRecord_Name to theRecord_LocationGroup
				set URL of thisImportedRecord to theRecord_URL
			end tell
			
			set theNextFirstPage_Index to thisResultSelection_Page_Index + 1
		end repeat
		
		return theTempDirectoryURL
		
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"splitPDFatDelimiter\"" message error_message as warning
		try
			my deleteTempDirectory(theTempDirectoryURL)
		end try
		error number -128
	end try
end splitPDFatDelimiter

on createTempDirectory()
	try
		set theTempDirectoryURL to current application's |NSURL|'s fileURLWithPath:((current application's NSTemporaryDirectory())'s stringByAppendingPathComponent:("_Script - Split PDF at delimiter" & space & (current application's NSProcessInfo's processInfo()'s globallyUniqueString())))
		set {successCreateDir, theError} to current application's NSFileManager's defaultManager's createDirectoryAtURL:theTempDirectoryURL withIntermediateDirectories:false attributes:(missing value) |error|:(reference)
		if theError ≠ missing value then error (theError's localizedDescription() as string)
		return theTempDirectoryURL
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"createTempDirectory\"" message error_message as warning
		error number -128
	end try
end createTempDirectory

on deleteTempDirectory(theTempDirectoryURL)
	try
		set {successDeleteDir, theError} to (current application's NSFileManager's defaultManager()'s removeItemAtURL:(theTempDirectoryURL) |error|:(reference))
		if theError ≠ missing value then error (theError's localizedDescription() as string)
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"deleteTempDirectory\"" message error_message as warning
		error number -128
	end try
end deleteTempDirectory
1 Like

Worked perfectly!!! Thanks again!

1 Like