What are people’s thoughts and/or tips regarding splitting a multi-page PDF into a group of single-page PDFs in order to the improve the relevance of DEVONthink’s Classify and See Also results?
Merging PDFs is straightforward in DEVONthink but I haven’t found an easy way to automatically break up a large PDF into a bunch of smaller files.
It would be great if DEVONthink returned a single PDF page when you hit the “See Also” button for a related text chunk, as opposed to the whole PDF document.
I tried to create an Automator script using the “Extract Odd & Even Pages” action to do this. But I couldn’t get it working.
What I finally found was that Adobe Acrobat 8’s “Document > Extract Pages …” feature does the trick. (Apparently, Acrobat 9 has an improved version of this feature called Splitting.)
But I’d like to find something that doesn’t involve using a paid software like Acrobat.
Any thoughts or suggestions?
Perhaps this answer to my similar question from Bill DeVille a few months ago in this thread posting.php?mode=quote&f=7&p=38206 will help you:
I’ve found this tip to be such a huge help in my own document organization. And yes, you have an excellent point about improving relevance of Classify & See Also. I hadn’t thought of that angle myself.
@kunmingstreet: Side comment - I’ve been to Kunming the city but where is a street named Kunming? Near Stone Forest Blvd?
I’m not sure why that would matter. The relevant data are the data, whether they are found on one page or twenty. Let’s say the document you are classifying is about cold fusion, and you have a PDF that mentions “cold fusion” twenty times - it’s always going to be 20 times no matter how many pages you split the document into.
I have begun splitting documents for exactly this reason. Here’s are some use cases:
- I get some email newsletters with a collection of unrelated tips. Splitting into one file per tip helps me categorize each tip, and also makes it easier for see also and autoclassify to understand that what a unit of content is.
- I have some magazines in PDF. Splitting them up by article makes more sense than keeping the whole magazine.
The way I split things into one-page PDFs is to open the file in Acrobat Pro, and do Extract Pages and check the box to split each page into a separate file. I use DTPO to split them when it’s more than one page per output PDF, and I do those manually.
@ Wally, er, Korm:
Perhaps the most relevant bits about cold fusion are on the last few pages where the first pages introduce history leading up to cold fusion and other “front” or explanatory matter. I’m stretching the example a bit and I probably wouldn’t split an article such as that but I have run across PDFs that contain more than one discrete topic (unfortunately).
BTW: The easiest option to split a PDF into two parts is to select the page after the split point in the PDF’s sidebar, then right-click it and select “Split Document.”
Stephen Berlin Johnson has written about his practice of maintaining small documents in the range of 50 to 500 words, as he feels that’s the ‘sweet spot’ to maximize the powers of See Also to help him find related ideas in his DEVONthink databases.
That approach seems to work well for him. He’s a prolific writer. However, he has also confessed that he has assistants to do the labor of carving out all those little snippets from books and articles.
I don’t do that, in part because I’m lazy and in part because I don’t like to ‘vandalize’ my references.
However, my practice of making rich text notes about important references when I’m working on a project tends to produce similar snippets in roughly that size range, so over time I end up with both book-length references and scatterings of snippets that isolate particular ideas contained in them, and contain hyperlinks to one or more related references. I’m quite satisfied with my approach.
@Everyone - Thanks for the speedy, thoughtful responses! I apologize for not replying sooner myself. I don’t think I have the “notify of replies” function set properly.
@twicks, @eboehnisch - Thanks for the tips! I’ll definitely incorporate that into my workflow, especially for short PDFs.
But what I had in mind actually was something that would automate the process for larger PDF files, i.e. > 20 pages.
So splitting the PDF would be matter of simply running a script or an Automator workflow.
I tried to do this with Automator but couldn’t get it to work properly. But like I wrote in my original post, Acrobat does the job just fine. I was just wondering if there were any built-in Mac solutions.
@twicks -There’s a Kunming Street in virtually every city in China!
@korm - Good point. I’m thinking of it more for improving the See Also results. (You’re absolutely right about it not improving the Classify results.)
With See Also, it’s my understanding that DEVONthink’s AI returns the entire document in its search results and not the individual instances in that file of a keyword (or semantic) match.
If it did, I wouldn’t have to break up the file in the first place.
@alanshutko - Thanks! Acrobat to DTPO is exactly what I came up with. And being able to tag individual pages is super useful.
@Bill_DeVille - Johnson’s example is exactly what I was thinking of! But not having the luxury of an assistant, I’ve had to come up with a more computerized solution. In terms of vandalizing the references, I make sure to have a fully intact version of the PDF, in addition to the single page references, and then grouping all the files together in a DEVONthink Group.
Ha! That’s the common assumption but they are really signposts pointing out the route to Kunming! Sort of like telling a traveler to any German city to head over to “Einbahnstrasse” if they want a good time (or whatever).