Splitting a PDF

rickl · August 20, 2005, 9:57am

I just followed Bill DeVille’s suggestion for combatting writer’s block in the Usage Scenarios forum: copy a section of text into a temporary note, then click “See Also”. The most relevant document turned out to be a 227-page PDF (a complete journal volume), and a quick glance through failed to reveal where the relevant passage might be. That got me thinking that I really ought to split all such PDFs into separate articles, so as to leverage DT Pro’s capabilities. Does anyone know if there’s any reasonably cheap software that can do this?

rollo · August 20, 2005, 10:29am

What I do is bring my pdfs into Acrobat Pro, save the file as Rich Text, then import the resulting RTF file into DTPro, where I then split it down into a number of sub-files, all housed in a sinlge group or folder that relates to the pdf. It does what I want, but doesn’t work with protected pdfs.

Rollo

howarth · August 21, 2005, 2:15am

Have you tried working with Preview in OS 10.4? It reads PDFs faster than Acrobat Reader, allows annotation, and you may copy text and create new files. Use the Copy command, followed by File: New from Clipboard, then Save As…with a new file name. Then drag them into DTP. All the texts are searchable.

rickl · August 21, 2005, 5:45am

Thanks. Now you say it it sounds like an obvious way to go (though it hadn’t occurred to me ), though it might be a little more labour-intensive than something that could actually split the PDFs. An added bonus would be that I could get rid of the PDFs, thus hopefully lightening the load on the database a bit.

Not much to do with DT, but I discovered after reading an O’Reilly article on Preview that the Bookmarks menu contains all the bookmarks you make, not just those in open documents, which makes Preview a simple PDF cataloging application: you don’t need to know where a document is in the Finder, as long as you’ve made at least one bookmark in it.

rollo · August 21, 2005, 6:09am

I had never thought of this, but have now tried it. It only seems to work with one page at a time, which is more than a pain if you have, say, a 300 page document. Have I missed something?

Based on what I’ve tried so far it doesn’t begin to compare for flexibility of output with saving as RTF from Acrobat Pro, or indeed converting to text/rtf/word within Omnipage Pro X. Now if it were possible within Preview to save as rtf … THAT would do the job!!!

Rollo

rickl · August 21, 2005, 7:10am

rollo:

howarth:

Have you tried working with Preview in OS 10.4? It reads PDFs faster than Acrobat Reader, allows annotation, and you may copy text and create new files. Use the Copy command, followed by File: New from Clipboard, then Save As…with a new file name. Then drag them into DTP. All the texts are searchable.

I had never thought of this, but have now tried it. It only seems to work with one page at a time, which is more than a pain if you have, say, a 300 page document. Have I missed something?

Based on what I’ve tried so far it doesn’t begin to compare for flexibility of output with saving as RTF from Acrobat Pro, or indeed converting to text/rtf/word within Omnipage Pro X. Now if it were possible within Preview to save as rtf … THAT would do the job!!!

Rollo

The one-page-at-a-time limitation is a major problem. I don’t have OmniPage Pro, but do have ReadIris Pro (is OmniPage better?). I’ve just tried reading in a moderately lengthy PDF, and it certainly takes a long time. It looks like I might have to bite the bullet and buy Acrobat. But I’ll see if there are any smaller, more specialized utilities first.

rollo · August 21, 2005, 10:15am

I have played around with ReadIris Pro, but for me it doesn’t compare with OmniPage. With OmniPage I can scan 500 or more pages at a time, approximately one every 10 seconds (I have a relatively fast scanner specifically for this) automatically set up specific recognition zones of different types for text content, tables, graphics etc. and then just leave it to do its OCR thing over the next hour or whatever length of time it takes. It’s excellent.

ReadIris Pro in contrast didn’t seem to allow me to get all the scanning done in one hit so I could get the manual intervention out of the way. Instead it seemed to scan each page then recognise it, without allowing for the zones I wanted to recognise, before I could scan the next page. that makes it much more cumbersome for me. Again I might have missed something here, if anyone knows different, perhaps you could correct me.

Rollo

Ariew · August 23, 2005, 1:07am

On another thread I extoll the virtues of Circus Ponies’ Notebook (NB) for splitting up PDFs in order to simulate what Steven Johnson describe as hitting the “sweet spot” for usable text that is searchable from Devonthink (see “Revisting Steven Berlin Johnson” in “Usage Scenarios” in this forum or see the article directly at: stevenberlinjohnson.com/mova … 00230.html.

What I do is convert the PDF into RTF and then copy and “paste text as an outline”. This results in NB splitting a PDF file into separate paragraphs, one paragraph to one cell (those familiar with NB will know what I mean). Then, I utilize NB’s “indexing” function to look up important keywords. So, for example, in an article on the recent history of population genetics, I was particularly interested in what the author had to say about R.A. Fisher (one of the fathers of Darwinism as we know of it today). So, I looked up the index for all paragraphs that mention “Fisher” then I save each back to the Devonthink database (using the shortcut key, command-"(" for plain text or command-")" for rich text. The result is a sequence of notes (I sorted them sequentially by date created) that hit the highpoints of the article. Next time I do a global search across my entire database on “Fisher” or “population genetics” in return I’ll get small snippets of text (paragraphs, to be precise) from that article. Now, I don’t know whether that will be useful to me (my writing project is at its infancy). But, based on Johnson’s experience, I’m betting that it will be invaluable. Compare this method to my old way of writing–rely on memory!

I hope this helps.

Andre Ariew

rollo · August 23, 2005, 6:13am

Ariew, you didn’t mention what method you used to convert the pdf into RTF. Can you elaborate. I hsve used Acrobat Pro, Omnipage Pro X and Textlightning … did you use another method?

Rollo

rickl · August 23, 2005, 8:35am

Ariew:

What I do is convert the PDF into RTF and then copy and “paste text as an outline”. This results in NB splitting a PDF file into separate paragraphs, one paragraph to one cell (those familiar with NB will know what I mean). Then, I utilize NB’s “indexing” function to look up important keywords. So, for example, in an article on the recent history of population genetics, I was particularly interested in what the author had to say about R.A. Fisher (one of the fathers of Darwinism as we know of it today). So, I looked up the index for all paragraphs that mention “Fisher” then I save each back to the Devonthink database (using the shortcut key, command-“(” for plain text or command-“)” for rich text. The result is a sequence of notes (I sorted them sequentially by date created) that hit the highpoints of the article.

Thanks, Andre, for taking the time to repeat and expand upon your method. The details are very important here, so let me report back on my successes and otherwise.

I tried your method with 2 different, quite long (around 100 pages), PDFs. Exporting to RTF from DT Pro resulted in a 4 KB file with no text both times. Exporting to text, however, seemed to work fine. However, each cell in CP Notebook consisted of only 5 or 6 words. Looking back in the exported text document, it seems that pretty much every line in the PDF had become a new paragraph. I tried fiddling about with line breaks and soft wrapping in TextWrangler, but all I could accomplish was to make one mega-paragraph. The way searching from the text index works in CP Notebook (or at least the way I use it) doesn’t seem very helpful. I found multiple instances of one word I was looking for. Clicking on one of those instances in the index takes me to that instance, but there doesn’t seem to be an easy way to go back to where I was in the index, so I find searching this way very time-consuming. If I buy Acrobat, though, I imagine the paragraphs will come out OK, and I’d like to give the CP Notebook way another try at that time.

On VersionTracker, I found a utility called PDFLab. This can merge and split PDFs. I tried it out by putting 3 short PDFs together and adding an image file, and arranging the pages how I wanted. It worked fine. But when I fed it my large journal issue PDF and stipulated which pages I wanted it to export, I got an error message every time. So I went back to VersionTracker, this time downloading PDFPen. Because there were 3 articles in the PDF that I wanted to keep, I made 3 duplicates of the original PDF, and with each one used PDFPen to delete everything but one of the articles I wanted to keep. I saved the resulting files to a folder with a Save to DT script attached, with the result that I now have 3 separate PDFs, one for each article, and searches now work a lot better. But I’ll have to think about whether I can make the process a little more streamlined.

I also found another utility called joinPDF (free, versus the $50 of PDFPen) that can join or split PDFs. I haven’t yet had time to compare the 2. I imagine either of these would be a fair bit cheaper than Acrobat, though it’s possible that going the Acrobat way (exporting to RTF and then splitting) would be more efficient than splitting the PDFs directly.

Since I’m into PDF utilities at the moment, I’ll also mention one that I bought some time ago: PDFShrink. As the name suggests, it reduces the size of PDFs considerably. So far, I’ve only used when emailing or uploading PDFs, and haven’t tried preprocessing PDFs destined for DT, for fear that DT may be unable to recover the text from files tampered with in this way.

Ariew · August 23, 2005, 5:18pm

Rollo–I’m sorry that I forgot to mention it. I use Acrobat to convert PDFs to RTFs. Acrobat is expensive, but it also serves as an OCR which is helpful to convert image PDF files to searchable ones. So, I find it quite a valuable tool.

Thanks too to “Rickl” for writing up your trials with this method. You are right that Acrobat solves some of the problems for you. I tried various work arounds like highlight the PDF and make a rich text note from Devonthink, then copy and paste the rich text note to NB as an outline. That has the undesirable effect that you got: it makes each line a separate cell. Instead of PDF splitting to get a “sweet spot” you get something more like hair splitting!

If you want to see how it works without having to purchase Acrobat, try the NB method on an RTF text (or a Word text that you save as RTF). I look forward to hearing your results.

Andre

hardcat · August 28, 2005, 11:36pm

Hello,

Have you tried PDFLab. It will Join or split PDF as you require. It is fairly basic but does what it’s designed for very well. It is free, well if you use it you might want to donate to help support continued development.

macupdate.com/info.php/id/15818

hardcat

rickl · August 29, 2005, 10:01am

Thanks for the suggestion, hardcat. As I mentioned a couple of messages above, I had success with PDFLab when joining PDFs and images, but it choked when trying to split a very large PDF. However, PDFPen worked well.

If anyone knows any other PDF utilities that may be worth a look, please keep us updated.

kevmck · August 29, 2005, 4:07pm

Hi Rick/all,

There’s an excellent little freeware app that allows you to stitch together PDF files, remove pages, and could just as easily be use to split PDFs into smaller page groupings. It’s called CombinePDF and can be found at: http://www.monkeybreadsoftware.de/Freeware/CombinePDFs.shtml

Hope this helps!

Kevin

rickl · August 30, 2005, 12:52am

It sure does! Thanks. What an excellent app. I’m finding it a lot faster than PDFPen, because as soon as a PDF is dragged in it gets split into as many files as there are pages, and the pages are all listed along with their page numbers. So I can just delete all the pages that aren’t included in the article I want very quickly without checking thumbnails, etc. A nice little touch is that it automatically opens each PDF I generate in Preview so that I can check it’s OK.

Going slightly off-topic, is it acceptable to create sub-folders as desired in the OS X Applications folder? I’d like, for example, to put all my PDF-related apps together.

howarth · August 30, 2005, 1:31am

The Applications folder does not like subfolders. Your apps will work, but their Help files often get confused. To get around this limitation of the OS, I place only Apple apps in Applications and all my third-party apps in • My Applications, grouped by function: Backup, Compress, Finance, Images, Maps, Text, etc.

I had to make some exceptions to this plan, for the few apps that insist on living in Applications. But all the others work fine elsewhere, including Acrobat, Filemaker, Quicken, Photoshop, DreamWeaver, and Microsoft Office.

hardcat · August 30, 2005, 4:02am

Why not leave them all in Applications where there will be no issues and then group them in an application launcher. I do this and it works very well. I use F10 Launch studio although there are several others.

chronosnet.com/Products/f10_product.html

hardcat

rickl · August 30, 2005, 8:12am

Mmm, that seems like the safest course. I’ll look into F10 Launch Studio.

rickl · September 4, 2005, 8:28am

Ariew:

Rollo–I’m sorry that I forgot to mention it. I use Acrobat to convert PDFs to RTFs. Acrobat is expensive, but it also serves as an OCR which is helpful to convert image PDF files to searchable ones. So, I find it quite a valuable tool.

Thanks too to “Rickl” for writing up your trials with this method. You are right that Acrobat solves some of the problems for you. I tried various work arounds like highlight the PDF and make a rich text note from Devonthink, then copy and paste the rich text note to NB as an outline. That has the undesirable effect that you got: it makes each line a separate cell. Instead of PDF splitting to get a “sweet spot” you get something more like hair splitting!

If you want to see how it works without having to purchase Acrobat, try the NB method on an RTF text (or a Word text that you save as RTF). I look forward to hearing your results.

Andre

Hoping that people haven’t become bored with this topic, here’s an update on my experiments with Andre’s method.

I remembered that I actually had an old copy of Acrobat: Acrobat Professional 5.0 (Japanese version). So I tried installing it on my Tiger computer, without any problems. Then I opened a journal issue PDF and exported to RTF, and opened the RTF in Word. Then I tried the Copy – Paste text as an outline routine in CP Notebook, but was rather frustrated to find again that every line in the RTF document was a new cell. Turning on invisibles in Word, sure enough I see a Return mark after every line.

In another thread, I’ve outlined my inability to get the PDF2RTF Service working, in addition to difficulties changing the PS/PDF import preference in DT to RTF (it insists on continuing to import PDF + Text). So it seems that the only option for me at the moment remains using PDFLab or, better, CombinePDF, which is pretty time-consuming and doesn’t have the elegance and functionality of Andre’s method. Does anyone have any additional ideas for overcoming these problems?

Ariew · September 4, 2005, 12:07pm

Rikl writes, "I remembered that I actually had an old copy of Acrobat: Acrobat Professional 5.0 (Japanese version). So I tried installing it on my Tiger computer, without any problems. Then I opened a journal issue PDF and exported to RTF, and opened the RTF in Word. Then I tried the Copy – Paste text as an outline routine in CP Notebook, but was rather frustrated to find again that every line in the RTF document was a new cell. Turning on invisibles in Word, sure enough I see a Return mark after every line. "

I have tried this method numerous times with rarely a problem (I have trouble with pdf with multiple columns, converting to RTFs gives you a messy document). The results I get (I have Acrobat 6–would this make a difference?) is that every cell in Notebook contains a whole paragraph.

The other part of my original post suggested that by using Notebook’s index function we can extract the most important paragraphs, ones that are associated with the keywords we deem most important. This still works, but I rarely use it anymore. When I do it is for an article that I’ve read before and don’t feel like rereading for the sake of extracting information. For the most part, nothing beats a careful reading of a text, cutting and pasting into DevonThink as I go.

Nevertheless, as others in this thread report, I am encouraged that there are other free programs that split pdfs.

Andre