Splitting pages of PDFs?

halloleo · March 5, 2025, 6:37am

I have scannend PDFs with two real A4 pages on one A3 PDF page, like this:

Can DEVONthink cut pages like this into two?

Or is there annother tool out there doing this easily?

Many thanks for any pointers!

cgrunenberg · March 5, 2025, 7:21am

DEVONthink doesn’t support this but you could use Preview.app and Tools > Rectangular Selection. Select & copy the first page, then use File > New from Clipboard and save the page as a PDF, afterwards process the second page the same way.

NickLowe · March 5, 2025, 10:19am

I use PDF Scissor, a cheap one-trick app that does exactly and only this.

halloleo · April 15, 2025, 7:43am

Works well indeed. Thanks for the tip.

lerone · April 15, 2025, 12:42pm

FWIW – I was just sitting above a very similar challenge (problem), but, I have a slightly different use case. but, I thought this might also be interesting / useful for people
a) with a similar requirement (auto-splitting of PDFs)
b) a workflow in ‘the image department’. but I also think this could well be used outside of the image centered domain.

this script splits pdf around pages and uses the first text line found (in a sanitized way) as title for the resulting (page-)PDFs.

this was done with AI. so, first it might just be a base to work with. then, I am sure the coding priests here might have some ways to make this more elegant.

the script:

-- PDF Split & Rename Script with Advanced Format Removal
tell application "Finder"
	set pdfFile to (choose file with prompt "Select PDF to process" of type {"PDF"}) as alias
	set outputFolder to (choose folder with prompt "Select output directory")
end tell
set pdfPath to POSIX path of pdfFile
set outDir to POSIX path of outputFolder
if character -1 of outDir is not "/" then set outDir to outDir & "/"
-- Define expanded PATH
set shellPath to "PATH=/usr/local/bin:/opt/homebrew/bin:/usr/bin:/bin:/sbin:/usr/sbin; "
-- Create detailed log file
set logFile to outDir & "debug_log.txt"
do shell script "echo 'PDF PROCESSING LOG - STARTED AT $(date)' > " & quoted form of logFile
do shell script "echo 'PDF File: " & pdfPath & "' >> " & quoted form of logFile
do shell script "echo 'Output Dir: " & outDir & "' >> " & quoted form of logFile
do shell script "echo '-----------------------------------' >> " & quoted form of logFile
-- Split PDF into individual pages
try
	do shell script "echo 'Splitting PDF...' >> " & quoted form of logFile
	do shell script shellPath & "pdfseparate " & quoted form of pdfPath & " " & quoted form of (outDir & "page_%03d.pdf") & " 2>> " & quoted form of logFile
	do shell script "echo 'Split operation completed.' >> " & quoted form of logFile
on error errMsg
	do shell script "echo 'ERROR DURING SPLIT: " & errMsg & "' >> " & quoted form of logFile
	display dialog "Error splitting PDF. Check log for details." buttons {"OK"} default button 1
	return
end try
-- List all the split files
set fileList to do shell script "ls -1 " & quoted form of outDir & "page_*.pdf"
set splitFileList to paragraphs of fileList
set processingErrors to 0
set processedCount to 0
repeat with filePath in splitFileList
	try
		set fullPath to filePath
		if filePath does not start with "/" then
			set fullPath to outDir & filePath
		end if
		
		set fileName to do shell script "basename " & quoted form of fullPath
		
		-- Document processing
		do shell script "echo '--------------------------------------' >> " & quoted form of logFile
		do shell script "echo 'Processing: " & fileName & "' >> " & quoted form of logFile
		
		-- Extract text
		do shell script "echo 'Extracting text...' >> " & quoted form of logFile
		set extractCmd to shellPath & "pdftotext " & quoted form of fullPath & " - | grep -v '^\\s*$' | head -n 1"
		set extractedText to do shell script extractCmd
		
		do shell script "echo 'Extracted raw text: " & extractedText & "' >> " & quoted form of logFile
		
		-- Step-by-step format cleanup with detailed logging
		-- Step 1: Strip format suffixes that are directly attached to words
		set step1 to do shell script "echo " & quoted form of extractedText & " | sed -E 's/([A-Za-z0-9]+)(jpg|jpeg|tif|tiff|png|gif|bmp|pdf|psd|raw|heic|webp)([^A-Za-z0-9]|$)/\\1\\3/gi'"
		do shell script "echo 'After format suffix removal: " & step1 & "' >> " & quoted form of logFile
		
		-- Step 2: Remove file extensions with dots
		set step2 to do shell script "echo " & quoted form of step1 & " | sed -E 's/\\.(jpg|jpeg|tif|tiff|png|gif|bmp|pdf|psd|raw|heic|webp)([^A-Za-z0-9]|$)/\\2/gi'"
		do shell script "echo 'After dot-extension removal: " & step2 & "' >> " & quoted form of logFile
		
		-- Step 3: Remove standalone format words
		set step3 to do shell script "echo " & quoted form of step2 & " | sed -E 's/\\b(jpg|jpeg|tif|tiff|png|gif|bmp|pdf|psd|raw|heic|webp)\\b//gi'"
		do shell script "echo 'After standalone format removal: " & step3 & "' >> " & quoted form of logFile
		
		-- Step 4: Remove image dimensions
		set step4 to do shell script "echo " & quoted form of step3 & " | sed -E 's/[0-9]+ ?[xX×] ?[0-9]+ ?(pixels?|px)?//gi'"
		do shell script "echo 'After dimension removal: " & step4 & "' >> " & quoted form of logFile
		
		-- Step 5: Clean up extra spaces and trim
		set processedText to do shell script "echo " & quoted form of step4 & " | sed 's/  */ /g' | sed 's/^ *//' | sed 's/ *$//'"
		do shell script "echo 'After space cleanup: " & processedText & "' >> " & quoted form of logFile
		
		-- Step 6: Prepare for filename
		set processedText to do shell script "echo " & quoted form of processedText & " | cut -c 1-200 | tr ' ' '-'"
		
		-- Sanitize text for filename
		set cleanName to do shell script "echo " & quoted form of processedText & " | " & ¬
			"tr -d '/:\\\"\\\\|?*<>.' | " & ¬
			"tr -d '\\r\\n' | " & ¬
			"sed 's/^-//g' | sed 's/-$//g' | sed 's/-\\{2,\\}/-/g'"
		
		do shell script "echo 'Sanitized text: " & cleanName & "' >> " & quoted form of logFile
		
		-- Generate new filename
		if cleanName is "" then
			set newName to fileName -- Keep original if no text
			do shell script "echo 'No text found, keeping original name' >> " & quoted form of logFile
		else
			set newName to cleanName & ".pdf"
			
			-- Avoid filename collisions
			set counter to 1
			set baseName to text 1 thru -5 of newName
			
			set checkCmd to "[ -e " & quoted form of (outDir & newName) & " ] && echo 'exists' || echo 'new'"
			set fileExists to do shell script checkCmd
			
			repeat while fileExists is "exists"
				set newName to baseName & "_" & counter & ".pdf"
				set checkCmd to "[ -e " & quoted form of (outDir & newName) & " ] && echo 'exists' || echo 'new'"
				set fileExists to do shell script checkCmd
				set counter to counter + 1
			end repeat
			
			-- Rename the file
			do shell script "echo 'Renaming to: " & newName & "' >> " & quoted form of logFile
			do shell script "mv " & quoted form of fullPath & " " & quoted form of (outDir & newName)
		end if
		
		set processedCount to processedCount + 1
	on error errMsg
		do shell script "echo 'ERROR PROCESSING FILE: " & errMsg & "' >> " & quoted form of logFile
		set processingErrors to processingErrors + 1
	end try
end repeat
-- Final status
do shell script "echo '-----------------------------------' >> " & quoted form of logFile
do shell script "echo 'Processing completed at $(date)' >> " & quoted form of logFile
do shell script "echo 'Files processed: " & processedCount & "' >> " & quoted form of logFile
do shell script "echo 'Errors encountered: " & processingErrors & "' >> " & quoted form of logFile
if processingErrors > 0 then
	display dialog "Completed with " & processingErrors & " errors. Check the log file for details." buttons {"OK"} default button 1
else
	display dialog "Successfully processed " & processedCount & " pages!" buttons {"OK"} default button 1
end if

an advantage vis-a-vis the application based processing: the script can be used in contexts like building macros (e.g. via Keyboard Maestro), or even inside DT and its smart rules etc.

the price one has to pay: you need to have Poppler installed (via Homebrew), which is a a PDF rendering library based on the xpdf-3.0 code base. see here

the way the AI/LLM describes its achievement/code artifact:

The Problem Solved
The script now correctly handles cases where image format names like “jpg” are directly attached to words (such as “Dry0719jpg”) in the extracted text.
Key Elements of the Solution

Step-by-Step Format Cleaning:
・ Format suffixes directly attached to words are now detected and removed
・ The pattern ([A-Za-z0-9]+)(jpg|jpeg|…)([^A-Za-z0-9]|$) specifically matches and removes these embedded formats

Enhanced Debugging:
・ Each cleaning step is logged separately
・ Makes it easier to diagnose any future issues

Comprehensive Format Handling:
・ Removes file extensions with dots
・ Removes standalone format words
・ Removes image dimensions
・ Cleans up extra spaces and special characters
What You Can Do With This Script
You can now:
・ Split any PDF into individual pages
・ Have each page automatically named based on its content
・ Process documents with image names and technical metadata in the text
・ Get clean, readable filenames without format suffixes

There are especially two “extravagancies” built into this, that one might want to get rid of:

there is an error log production included; I needed that to get here… and decided to keep it
some amount of the code is tasked with removing file-format syntax from the resulting file names. I needed that bec everything is based on another process based on image metadata. so, you might want or not want to keep this in your use contexts.

otherwise, looking forward to any improved community version, variation or fork…

SebMacV · April 15, 2025, 12:49pm

In case you need to do this in bulk, the stand-alone version of ABBYY Finereader will do this easily – you can load many pages of double spread PDF, and it will process the entire file into single pages and OCR at the same time. It’s saved me a lot of time on the occasions that I use my phone to scan a book chapter by openings.

chrillek · April 15, 2025, 1:22pm

As usual, the code is just terrible. Four do shell script calls to write four log entries because the AI has never heard about newline. Reliance on an external program (pdfseparate). Using ls to get a list of files that the script itself generated in the first place. Goodness, what do we have computers for?

Sorry, but what is the point in posting terrible code like that – here or anywhere on the net? Everyone can tell an AI to produce something they think is “code”. It might even run. But why publish crap like that? No one can even learn anything useful from it (except how not to code).

If I were a “coding priest”, I would not spend a second on making this “more elegant”. Just write something good from scratch.

Unfortunately, it is not even clear what this code is supposed to do. Instead of having a bunch of log entries, some comments in the code would have been helpful (without marketing blurb like “advanced format removal”)

What does “application-based processing” mean? Is pdfseparate not an application? Or xpdf?

Whatever this stuff is supposed to do can be done with less lines of code and without relying on any external program, just using AppleScript/JXA and PDFKit.

lerone · April 15, 2025, 2:34pm

thanks for your comment, which kind of speaks for itself, really.

the code, as you say, works.
and it covers a relevant case.

happy to discuss, revise it w/ anyone interested in healthy community discussion (including about the use of AI for ‘patching’ problems).

also, everything about it was transparent – which I kind of cannot say for the motivation bringing such kinds of comments.

actually, it’s a kind of energy I would label as at least as questionable, as openly posting non-elegant code.

chrillek · April 15, 2025, 2:43pm

Others and I already posted several examples of how to split PDFs, eg. Splitting a PDF at a recurring string/expression and Split PDFs - Custom Break - #14 by pete31 and Splitting PDFs | JavaScript for Automation (JXA). They are fairly short, clear and instructive. We can talk about them, eg how to modify them to do what you want.

But for me (!), there’s no point in talking about crap code like the one produced by whatever AI. That’s just not worth my time and energy.

lerone · April 15, 2025, 2:56pm

first, you actually do not have to talk about it.
you obviously chose to do so. freely.

then, I am happy to now see this thread, re. the topic of the OP, enriched like that.
as you know, I am all for enriching threads.

… and if the impulse for something like that needs to be ‘crap code’ (and potential discussions about cultures of coding in times of AI, prosumers etc), I am totally fine with that.

leaves open, the kind of ‘adressing’ and ‘being spoken to’ by someone at least resembling an ‘angry priest’. that part, I actually do not accept.
but I can live w/ people displaying such a kind of ‘culture’ and ‘karma’. but personally, I do not accept it at all – or even think it’s tolerable within some other ‘cultural ethics’ aside from the ‘ethics of code’, namely the culture of forums.

others can make up theor own minds on that part.

Stephen_C · April 15, 2025, 3:10pm

I do not wish to stir up any argument and appreciate everyone has their own view. The only comment I’d make is that in my time of using DEVONthink I’ve come across some amazingly well-written and helpful scripts on the forum. I found many of those when I was in my infant stage of AppleScript programming (I’m little past that now!). I think in a case like that it’s helpful that anyone picking up a script here also picks up, with it, some good teaching/learning. I would not have known, at that stage, what was "good” and what simply might have "done the trick”.

So maybe there is some point in comments about "good” and "bad scripts” so some us know what we’re doing when we copy or adapt them.

This is not a personal criticism of your views but merely a comment to show that some of us need still need constructive help!

Stephen

lerone · April 15, 2025, 4:11pm

appreciated.
I had the same experience as to be provided w/ good code, and even more good ‘teaching’ alongside, esp thinking of @pete31 – but also of others.. and always appreciated that. explicitly. from heart as much as a ‘cultural way of interacting’.

then, I also sympathize with what you are saying, as long as we are speaking about approaches as to 'how to learn to code´ and cultures / stances around that. of course, one can learn from others if he/she wants to (learn to) code.

but this is not the sole context of this.

first, and I feel very strange having to say something like that in an open forum of adults:
neither did my comment offend anyone. nor was it an objective offense against anything.
so, I do not know, why I have to be talked to like that, especially not if this really is about ‘learning’ – and a culture of ‘teaching’ (and I was teaching myself). so, no teacher would (should) speak like that. (let aside forum netiquette.)

then, you are right when you say " in a case like that [when one is on a trajectory of learning to code] it’s helpful that anyone picking up a script here also picks up, with it, some good teaching/learning". I would be the last to contradict.

but here, there are more and other contexts – even if @chrillek personal perspective is that this is about ‘good coding’ and ‘learning it’ purely. but nowhere did I subscribe to be a ‘learner of pure coding’, or even any type of personal/professional coding.

for one, and I made abundantly clear this is AI code(!), we are living in times where AI is a reality. it has just been implemented into DT. so besides it being used for ‘summarizing’ (and we could have a philosophical or a writers discussion about whether that ever should be done by a machine), AI is and will be used for (personal and professional) coding.
so, if there is to learn something new, it is how to (collectively) deal with that reality from now on, or how to give people the right tools and mindset to handle the new dualities and affordances, including pitfalls. given that a lot of people don´t want to be coders, or don´t have the capacities for it, people will in large swathes use such ‘makeshift’ code to solve their very practical, hand-on problems as they encounter them in the use of a potent app like DT.
– so, we are not (only) talking about ‘learning to code’. at least that is not my perspective.

then, this was as much about the practical solution (what I labeled a ‘digital artifact’), as it was about bringing in the conceptual issue of using PDF content for renaming split PDFs on the basis of their content – which is a very practical, challenging, and in many contexts relevant requirement.
– so, this could have been a start of a very fruitful discussion in that direction, as long as it is not forcefully framed as ‘coding ethos’ and bringing people into ‘teacher’ and ‘learner’ positions.

also, the first thing a ‘teacher’ has to learn, is to not shout at their pupils (that is, if they hand themselves in for that).
while I am happy to discuss all the issues involved, and also to ‘learn’ (where I want/can), I am very unwilling to make this a case where the onum is on me ‘justifying’ posting such posts in this way.
and very generally, I refuse to accept this tonality being brought to any forums ‘Discourse’.

PS: it should also be clear, that my initial post was ‘a personal criticism’ of @chrillek – nevertheless he chose to treat it like that, or deduce the right to launch quite offensive and transgressing tonality w/o any need whatsoever… so, I´d wish for a more balanced approach to an – welcome – attempts of mediation.

uimike · April 15, 2025, 4:16pm

lerone,

This forum has contributors from many countries. Some people are, because of their background, much more direct than others. @chrillek is very direct, very no-nonsense - and if you can not take his comments too much to heart, you will benefit from them, and from his (chrillek) super useful contributions all over this forum. He’s a programmer, and I’ve worked with programmers for decades. If there’s one thing they are known for, is parsimony, code economy - and always shorter, more elegant, better ways of coding stuff. You can use AI to create code, but it is a ways from creating short, elegant, less error-prone code. I do write Applescript - and I believe not too shabbily - but when I need something, the first thing I do is SEARCH. @Bluefrog, @cgrunenberg , @troejgaard , someone else most of the time has created some stuff that I can study and build on.

Back in design school - yes, in the US of A - we’d stick our stuff on the wall, and the criticism was NASTY (profs calling your stuff shit, tearing it to pieces physically, not kidding). The idea was you learned that no one was criticizing YOU - they were criticizing the stuff that was on the wall.

chrillek · April 15, 2025, 4:18pm

We’ve been over this before. AI code is notoriously bad. And I haven’t found yet someone here willing to fix that stuff. Nor “discuss” it. There’s simply no point. It’s like telling a two-year-old that they can’t paint like Picasso, although on the surface, some of their drawings might resemble deconstruvist paintings.

Talking about real code might teach people something. Treating gibberish as if it were something else will not teach anyone anything. And I don’t care if AI is a reality if it produces gibberish.

Not by your fault, of course.

lerone · April 15, 2025, 4:29pm

folks,
I made my points about this being about more than one context (above post).
I can´t make you open up to that kind of exchange, if you decide this is ‘purely about code’ and ‘discussing it (code)’.

I am still quite … astonished… that all this – even if it were correct (which it isn´t by any standards of discourse), you are so willing to apeace all this ‘communicative overreach’, while solely always arguing with those being on the other side of this kind of communicative transgression.
if that is your stance, in such times, then so be it. for me this culture of ‘patrimony’ and ‘buddying’ are outdated.

I know design school people, as I know a lot of other professional and cultural contexts. nowhere would I make up with this old narrative of ‘people ripping others up in communication’ are just out there for the good. again, everyones choice.

I made my arguments, and would quite look forward to seeing them being discussed, likewise (i.e. having collective, non-forced, understanding about how to deal w/ ‘makeshift-’ and ‘AI-code’; as much as the problem of slicing PDFs in ways that are fit for context (including meaningful automatic renaming).

as to communication culture (and I wouldn´t see myself fit to speak for ‘coding culture’, even though I have worked – productively – w/ many coders myself), you make your choice about style and 'codes´. and I make mine.

but appreciate the demonstrated effort to infuse some senseful approach to the ‘learning to code’ discussion, @Stephen_C ! appreciated.
PS: fun fact: I have a hunch, that @chrillek and me might even be from the same country – – I never thought this argument about ‘this is his/her culture’ – in this way – is any good; especially as I have worked in many transcultural contexts myself…

uimike · April 15, 2025, 5:03pm

Sorry you don’t seem to understand, but rather feel hurt. Yes it is a cultural thing, and I can guarantee I worked for much longer and with many more cultures, and professions, that you did. Not to brag. I have even taught that. But that is besides the matter - pride is the issue. And when you are ready to swallow your pride, you can get ready to learn - especially from the tough ones, the ones that will rip your work. I’d rather take a grumpy Nobel Prize winner than a nice tinkerer, anytime.

lerone · April 15, 2025, 5:16pm

no, I am not ‘personally’ hurt.
you don´t seem to understand that.
I am making arguments about the cultures and codes of collective communication, and about multidimensional exchange/communication/rationalities, especially in forums.

otherwise I don´t want to question your personal confidence, even though I don´t know wherefrom you take your knowledge about me. also, things like ‘Tinkerers’ vs. ‘Nobel Prize Laureates’ I think is a very unhelpful and misplaced framing here (again, rather talking about a very loaded culture of clear-cu(l)t ‘personalities’, all with ‘their place’ to speak).

and these ‘personalizations’ are really of no interest to me and I see them – in face of the muliplicity of aspects raised – as reductive framings. especially, as they are all leading into the same river: sanctioning offensive and topically narrowed down and ‘closing down’ (instead of ‘opening’ / welcoming) communication. That is: ‘slanted’, non-reciprocal communication.

I am not hurt. but I surely learn about standards of communication here. and the approach to collective, shared and mutual learning.

BLUEFROG · April 15, 2025, 5:24pm

I think we’ve talked this one out. Closing shop…