Script: Split PDF at outline (including the next outline item's first page)

This script splits a PDF at its outline, including the next outline item’s first page.

Problem

PDF outlines do know on which page they start but they don’t know on which page they end.

That’s often no problem, but it is a problem if

  • the last page of one outline item and the first page of the next item are on the same page and
  • the PDF’s creator missed to set proper destination points

Background: A destination point is the point to which a page is opened after clicking an outline item.

Examples of PDFs with proper destination points are the DEVONtech documentation PDFs. Clicking an item in such well built outlines scrolls the page to a point defined by the PDF creator (in most cases it’s a heading).

If clicking an outline item doesn’t scroll to a specific point it’s likely that it either got unspecified destination points or the points are set to somewhere at the beginning of the page.

So what’s the problem? If a PDF outline doesn’t know where it ends and the PDF’s creator didn’t set exact destinations points then there’s no way to reliably tell where on a page the next outline item starts, thus there’s also no way to tell whether a page only contains text that belongs to the next outline item or not. Splitting such a PDF can yield results that are missing large parts:

Even with proper destination points it’s probably not possible to reliably decide whether the first page of the next outline item should be included, e.g. in case of multi column PDFs.

Approach

The only way to work around this is to include the next outline item’s first page.

This means it’s necessary to remove last pages that are not needed manually. To make this easier the script creates bookmarks to the last pages.

If a PDF got proper destination points it seems to be possible to guess whether the next outline item’s first page is needed by checking them. Not sure how well this works, though. If you don’t want to experiment set alwaysIncludeNextPage to true.

Properties

  • alwaysIncludeNextPage: always include the next outline item’s first page (i.e. don’t try to guess whether the next page is needed).

  • removeLinebreaksInLabels: remove linebreaks in outline labels.

  • includeFirstLevel: include the outline item’s container as first item in the new outline. Makes navigating to the PDF’s first page easier. Hard to explain, test it.

  • usePrefix: use counter prefix in name.

  • useRecordName: use original record’s name in name.

  • createLastPageBookmarks: create bookmarks to the last page. Makes deleting of unnecessary last pages easier: select all bookmarks, open them, delete what’s not needed.
    Bookmarks are only created if necessary.


-- Split PDF at outline (including the next outline item's first page)

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

property alwaysIncludeNextPage : false -- always include the next outline item's first page (i.e. don't try to guess whether the next page is needed).
property removeLinebreaksInLabels : true -- remove linebreaks in outline labels.
property includeFirstLevel : false -- include the outline item's container as first item in the new outline. Makes navigating to the PDF's first page easier.
property usePrefix : true -- use counter prefix in name.
property useRecordName : true -- use original record's name in name.
property createLastPageBookmarks : true -- create bookmarks to the last page. Makes deleting of unnecessary last pages easier: select all bookmarks, open them, delete what's not needed. Bookmarks are only created if necessary.

tell application id "DNtp"
	try
		set theRecords to selected records
		if theRecords = {} then error "Please select some records"
		
		repeat with thisRecord in theRecords
			set thisRecord_Type to (type of thisRecord) as string
			if thisRecord_Type is in {"PDF document", "«constant ****pdf »"} then
				set thisRecord_Path to path of thisRecord
				my splitPDFAtOutline(thisRecord_Path, thisRecord)
			end if
		end repeat
		
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
		return
	end try
end tell

on splitPDFAtOutline(thePath, theRecord)
	try
		set thePDF to current application's PDFDocument's alloc()'s initWithURL:(current application's |NSURL|'s fileURLWithPath:thePath)
		set thePDF_PageCount to thePDF's pageCount()
		set thePDF_OutlineRoot to thePDF's outlineRoot()
		if thePDF_OutlineRoot = missing value then return
		set thePDF_OutlineRoot_numberOfChildren to thePDF_OutlineRoot's numberOfChildren()
		if thePDF_OutlineRoot_numberOfChildren > 1 then
			set thePDF_Outline_StartLevel to 0
		else
			set thePDF_Outline_StartLevel to 1
		end if
		set thePDF_Outline to my getOutlineItem(thePDF_OutlineRoot, thePDF_Outline_StartLevel, missing value)
		set thePDF_OutlineArray to my getOutlineArray(thePDF_Outline, current application's NSMutableArray's new(), thePDF)
		set thePDF_OutlineArray_filteredByDestination to (thePDF_OutlineArray's filteredArrayUsingPredicate:(current application's NSPredicate's predicateWithFormat:("!self.OutlineItem_Destination = " & (current application's NSNull's |null|()))))
		set thePDF_ValidOulineIndexes to thePDF_OutlineArray_filteredByDestination's valueForKeyPath:"OutlineItem_Index"
		set thePDF_ValidOulineIndexes_Count to thePDF_ValidOulineIndexes's |count|()
		if thePDF_ValidOulineIndexes_Count < 2 then
			tell application id "DNtp" to log message info "Script \"Split PDF at Outline\": No valid outline" record theRecord
			return
		end if
		
		tell application id "DNtp"
			set theRecord_NameWithoutExtension to name without extension of theRecord
			set theParents to parents of theRecord whose location does not start with "/Tags/"
			if theParents ≠ {} then
				set theGroup to create record with {name:theRecord_NameWithoutExtension, type:group} in (item 1 of theParents)
			else
				log message info "Script \"Split PDF at Outline\": All parents are tags" record theRecord
				return
			end if
			set theProgressSteps to thePDF_ValidOulineIndexes_Count
			show progress indicator "Splitting... " & theRecord_NameWithoutExtension steps theProgressSteps with cancel button
		end tell
		
		set theMetadataDictionary to current application's NSMutableDictionary's new()
		set theTempDirectoryURL to my createTempDirectory()
		set thisPrefix to 0
		
		repeat with i from 0 to (thePDF_ValidOulineIndexes_Count - 1)
			tell application id "DNtp" to step progress indicator
			set thisPDF to (current application's PDFDocument's alloc()'s initWithURL:(current application's |NSURL|'s fileURLWithPath:thePath))
			(thisPDF's setDocumentAttributes:theMetadataDictionary)
			set thisPDF_OutlineRoot to thisPDF's outlineRoot()
			
			set thisPDF_OutlineItem_Index to (thePDF_ValidOulineIndexes's objectAtIndex:i)
			set thisPDF_OutlineItem to my getOutlineItem(thisPDF_OutlineRoot, thePDF_Outline_StartLevel, thisPDF_OutlineItem_Index)
			set thisPDF_OutlineItem_Properties to (thePDF_OutlineArray's objectAtIndex:thisPDF_OutlineItem_Index)
			set thisPDF_OutlineItem_PageIndex to (thisPDF_OutlineItem_Properties's valueForKey:"OutlineItem_PageIndex")
			set thisPDF_OutlineItem_Label to (thisPDF_OutlineItem_Properties's valueForKey:"OutlineItem_Label") as string
			
			if i < (thePDF_ValidOulineIndexes_Count - 1) then
				set thisPDF_NextOutlineItem_Index to (thePDF_ValidOulineIndexes's objectAtIndex:(i + 1))
				set thisPDF_NextOutlineItem_Properties to (thePDF_OutlineArray's objectAtIndex:thisPDF_NextOutlineItem_Index)
				set thisPDF_NextOutlineItem_PageIndex to (thisPDF_NextOutlineItem_Properties's valueForKey:"OutlineItem_PageIndex")
				if alwaysIncludeNextPage then
					set thisPDF_OutlineItem_PageIndex_LastPage to (thisPDF_NextOutlineItem_PageIndex as integer)
					set thisPDF_OutlineItem_createLastPageBookmark to true
				else
					set thisPDF_NextOutlineItem_Point_isEqualToCropBox to (thisPDF_NextOutlineItem_Properties's valueForKey:"OutlineItem_Point_isEqualToCropBox") as boolean
					set thisPDF_NextOutlineItem_Point_IsInUpperThirdLeftHalf to (thisPDF_NextOutlineItem_Properties's valueForKey:"OutlineItem_Point_IsInUpperThirdLeftHalf") as boolean
					if not thisPDF_NextOutlineItem_Point_isEqualToCropBox and thisPDF_NextOutlineItem_Point_IsInUpperThirdLeftHalf then
						set thisPDF_OutlineItem_PageIndex_LastPage to (thisPDF_NextOutlineItem_PageIndex as integer) - 1
						set thisPDF_OutlineItem_createLastPageBookmark to false
					else
						set thisPDF_OutlineItem_PageIndex_LastPage to (thisPDF_NextOutlineItem_PageIndex as integer)
						set thisPDF_OutlineItem_createLastPageBookmark to true
					end if
				end if
			else
				set thisPDF_OutlineItem_PageIndex_LastPage to thePDF_PageCount
				set thisPDF_OutlineItem_createLastPageBookmark to false
			end if
			
			set thisPDF_RemovePagesFromStart to (thisPDF_OutlineItem_PageIndex as integer)
			if i = 0 then set thisPDF_RemovePagesFromStart to 0
			set thisPDF_RemovePagesFromEnd to (thePDF_PageCount - (thisPDF_OutlineItem_PageIndex_LastPage as integer)) - 1
			
			if (thePDF_PageCount > (thisPDF_RemovePagesFromStart + thisPDF_RemovePagesFromEnd)) then
				
				repeat thisPDF_RemovePagesFromStart times
					(thisPDF's removePageAtIndex:0)
				end repeat
				
				repeat thisPDF_RemovePagesFromEnd times
					(thisPDF's removePageAtIndex:((thisPDF's pageCount()) - 1))
				end repeat
				
				if removeLinebreaksInLabels then my removeLinebreaksInOutlineLabels(thisPDF_OutlineItem)
				
				if not includeFirstLevel then
					(thisPDF's setOutlineRoot:thisPDF_OutlineItem)
				else
					set thisPDF_OutlineItem_numberOfChildren to (thisPDF_OutlineItem's numberOfChildren())
					if thisPDF_OutlineItem_numberOfChildren > 0 or (i = 0 and (thisPDF_OutlineItem_PageIndex as integer) ≠ 0) then
						set thisPDF_Outline to my getOutlineItem(thisPDF_OutlineRoot, thePDF_Outline_StartLevel, missing value)
						set thisOutlineRoot to current application's PDFOutline's new()
						set thisOutlineItem to current application's PDFOutline's new()
						(thisOutlineItem's setLabel:(thisPDF_OutlineItem's label()))
						(thisOutlineItem's setDestination:(thisPDF_OutlineItem's destination()))
						(thisOutlineRoot's insertChild:(thisOutlineItem) atIndex:0)
						repeat with j from 0 to (thisPDF_OutlineItem_numberOfChildren - 1)
							(thisOutlineRoot's insertChild:(thisPDF_OutlineItem's childAtIndex:j) atIndex:(thisOutlineRoot's numberOfChildren()))
						end repeat
						(thisPDF's setOutlineRoot:thisOutlineRoot)
					else
						(thisPDF's setOutlineRoot:(current application's PDFOutline's new()))
					end if
				end if
				
				set thisTempURL to ((theTempDirectoryURL's URLByAppendingPathComponent:(current application's NSProcessInfo's processInfo()'s globallyUniqueString()))'s URLByAppendingPathExtension:"pdf")
				(thisPDF's writeToURL:thisTempURL)
				set thisTempPath to (thisTempURL's |path|()) as string
				set thisPDF_PageCount to thisPDF's pageCount()
				
				tell application id "DNtp"
					set thisImportedRecord_Name to thisPDF_OutlineItem_Label
					if usePrefix then
						set thisPrefix to thisPrefix + 1
						set thisImportedRecord_Name to (thisPrefix as string) & "." & space & thisImportedRecord_Name
					end if
					if useRecordName then set thisImportedRecord_Name to theRecord_NameWithoutExtension & space & "-" & space & thisImportedRecord_Name
					set thisImportedRecord to import thisTempPath name thisImportedRecord_Name to theGroup
					set URL of thisImportedRecord to URL of theRecord
					if createLastPageBookmarks and thisPDF_OutlineItem_createLastPageBookmark then
						set thisLastPageBookmark_Name to thisImportedRecord_Name & space & "[Last Page]"
						set thisLastPageBookmark_URL to ((reference URL of thisImportedRecord) & "?page=" & (thisPDF_PageCount - 1)) as string
						set thisLastPageBookmark to create record with {name:thisLastPageBookmark_Name, type:bookmark, URL:thisLastPageBookmark_URL} in theGroup
					end if
				end tell
				
			else
				tell application id "DNtp" to log message info "Script \"Split PDF at Outline\": No valid outline" record theRecord
			end if
		end repeat
		
		set {successDeleteDir, theError} to (current application's NSFileManager's defaultManager()'s removeItemAtURL:(theTempDirectoryURL) |error|:(reference))
		if theError ≠ missing value then error (theError's localizedDescription() as string)
		tell application id "DNtp" to hide progress indicator
		
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"splitPDFAtOutline\"" message error_message as warning
		try
			current application's NSFileManager's defaultManager()'s removeItemAtURL:(theTempDirectoryURL) |error|:(missing value)
		end try
		error number -128
	end try
end splitPDFAtOutline

on getOutlineItem(thisPDF_OutlineRoot, thePDF_Outline_StartLevel, theIndex)
	try
		if theIndex ≠ missing value then
			if thePDF_Outline_StartLevel = 0 then
				set thisPDF_OutlineItem to (thisPDF_OutlineRoot's childAtIndex:theIndex)
			else
				set thisPDF_OutlineItem to ((thisPDF_OutlineRoot's childAtIndex:0)'s childAtIndex:theIndex)
			end if
		else
			if thePDF_Outline_StartLevel = 0 then
				set thisPDF_OutlineItem to thisPDF_OutlineRoot
			else
				set thisPDF_OutlineItem to (thisPDF_OutlineRoot's childAtIndex:0)
			end if
		end if
		return thisPDF_OutlineItem
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"getOutlineItem\"" message error_message as warning
		error number -128
	end try
end getOutlineItem

on getOutlineArray(theOutlineItem, theOutlineArray, thePDF)
	try
		repeat with i from 0 to ((theOutlineItem's numberOfChildren()) - 1)
			set thisOutlineItem to (theOutlineItem's childAtIndex:i)
			set thisOutlineItem_Label to thisOutlineItem's label()
			set thisOutlineItem_Destination to thisOutlineItem's destination()
			if thisOutlineItem_Destination ≠ missing value then
				set thisOutlineItem_Point to thisOutlineItem_Destination's |point|()
				set thisOutlineItem_Page to thisOutlineItem_Destination's page()
				set thisOutlineItem_PageIndex to (thePDF's indexForPage:thisOutlineItem_Page)
				set thisOutlineItem_Point_isInUpperThirdLeftHalf to my isOutlineDestinationPointInUpperThirdLeftHalf(thisOutlineItem_Point, thisOutlineItem_Page)
				set thisOutlineItem_Point_isEqualToCropBox to my isOutlineDestinationPointEqualToCropBox(thisOutlineItem_Point, thisOutlineItem_Page)
			else
				set {thisOutlineItem_Point, thisOutlineItem_Page, thisOutlineItem_PageIndex, thisOutlineItem_Point_isInUpperThirdLeftHalf, thisOutlineItem_Point_isEqualToCropBox} to {missing value, missing value, missing value, missing value, missing value}
			end if
			(theOutlineArray's addObject:{OutlineItem_Label:thisOutlineItem_Label, OutlineItem_PageIndex:thisOutlineItem_PageIndex, OutlineItem_Point:thisOutlineItem_Point, OutlineItem_Point_IsInUpperThirdLeftHalf:thisOutlineItem_Point_isInUpperThirdLeftHalf, OutlineItem_Point_isEqualToCropBox:thisOutlineItem_Point_isEqualToCropBox, OutlineItem_Index:i, OutlineItem_Destination:thisOutlineItem_Destination})
		end repeat
		return theOutlineArray
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"getOutlineArray\"" message error_message as warning
		error number -128
	end try
end getOutlineArray

on isOutlineDestinationPointInUpperThirdLeftHalf(theOutlineItem_Point, theOutlineItem_Page)
	try
		set theOutlineItem_Point_isInUpperThirdLeftHalf to false
		if theOutlineItem_Point ≠ missing value then
			if (theOutlineItem_Point's x) ≠ ((current application's kPDFDestinationUnspecifiedValue)) then
				set theCropBox to theOutlineItem_Page's boundsForBox:(current application's kPDFDisplayBoxCropBox)
				set theCropBox_MinX to current application's NSRect's NSMinX(theCropBox)
				set theCropBox_MidX to current application's NSRect's NSMidX(theCropBox)
				set theCropBox_MaxY to current application's NSRect's NSMaxY(theCropBox)
				set theCropBox_Height to current application's NSRect's NSHeight(theCropBox)
				set theCropBox_UpperThird_MinY to theCropBox_MaxY - (theCropBox_Height / 3)
				set theCropBox_UpperThird_Height to theCropBox_MaxY - theCropBox_UpperThird_MinY
				set theCropBox_UpperThird_LeftHalf to current application's NSRect's NSMakeRect(theCropBox_MinX, theCropBox_UpperThird_MinY, theCropBox_MidX, theCropBox_UpperThird_Height)
				set theOutlineItem_Point_isInUpperThirdLeftHalf to current application's NSRect's NSPointInRect(theOutlineItem_Point, theCropBox_UpperThird_LeftHalf)
			end if
		end if
		return theOutlineItem_Point_isInUpperThirdLeftHalf
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"isOutlineDestinationPointInUpperThirdLeftHalf\"" message error_message as warning
		error number -128
	end try
end isOutlineDestinationPointInUpperThirdLeftHalf

on isOutlineDestinationPointEqualToCropBox(theOutlineItem_Point, theOutlineItem_Page)
	try
		set theCropBox to theOutlineItem_Page's boundsForBox:(current application's kPDFDisplayBoxCropBox)
		set theCropBox_MinX to current application's NSRect's NSMinX(theCropBox)
		set theCropBox_MaxY to current application's NSRect's NSMaxY(theCropBox)
		set theOutlineItem_Point_isEqualToCropBox to missing value
		if (theOutlineItem_Point's x = theCropBox_MinX) or (theOutlineItem_Point's y = theCropBox_MaxY) then
			set theOutlineItem_Point_isEqualToCropBox to true
		else
			set theOutlineItem_Point_isEqualToCropBox to false
		end if
		return theOutlineItem_Point_isEqualToCropBox
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"isOutlineDestinationPointEqualToCropBox\"" message error_message as warning
		error number -128
	end try
end isOutlineDestinationPointEqualToCropBox

on createTempDirectory()
	try
		set theTempDirectoryURL to current application's |NSURL|'s fileURLWithPath:((current application's NSTemporaryDirectory())'s stringByAppendingPathComponent:("Script Split PDF at Outline" & space & (current application's NSProcessInfo's processInfo()'s globallyUniqueString())))
		set {successCreateDir, theError} to current application's NSFileManager's defaultManager's createDirectoryAtURL:theTempDirectoryURL withIntermediateDirectories:false attributes:(missing value) |error|:(reference)
		if theError ≠ missing value then error (theError's localizedDescription() as string)
		return theTempDirectoryURL
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"createTempDirectory\"" message error_message as warning
		error number -128
	end try
end createTempDirectory

on removeLinebreaksInOutlineLabels(theOutlineItem)
	try
		set theOutlineItem_Label to theOutlineItem's label()
		set theOutlineItem_Label_clean to theOutlineItem_Label's stringByReplacingOccurrencesOfString:("( +)?\\R") withString:(space) options:(current application's NSRegularExpressionSearch) range:{location:0, |length|:theOutlineItem_Label's |length|()}
		theOutlineItem's setLabel:theOutlineItem_Label_clean
		repeat with i from 0 to ((theOutlineItem's numberOfChildren()) - 1)
			set thisOutlineItemChild to (theOutlineItem's childAtIndex:i)
			my removeLinebreaksInOutlineLabels(thisOutlineItemChild)
		end repeat
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"removeLinebreaksInOutlineLabels\"" message error_message as warning
		error number -128
	end try
end removeLinebreaksInOutlineLabels

3 Likes

Definitely an interesting script! Do you have an example file so that I could compare the results to Tools > Split PDF into Chapters? Thanks!

Sure, here

Thank you for the file! Currently Tools > Split PDF into Chapters tries indeed to guess whether the next page is needed or not. In case of your document this works for several chapters but not all.