separate PDF files

saltedgreens · July 18, 2006, 3:34pm

Does DEVONthink work better searching 100 separate PDF files or a 100 page PDF file?

If separate files are the case, how does one go about separating the PDF file into single pages?

Bill_DeVille · July 18, 2006, 6:49pm

There’s been a lot of discussion about “sweet spot” document sizes, following Johnson’s essay a year or so ago.

Generally I don’t bother to split documents, although there could be advantages for searches or “See Also” operations in some documents.

sjk recommended a utility that can either split or combine PDF documents in a thread in the same topic area.

howarth · July 19, 2006, 12:23am

Bill,

I have a large webarchive, 11.7 mb, that is too big for Safari or DTP to browse easily. I would like to split it into 10 or more files, each with a resource as well as data fork. Do you know of any utility that will handle that task?

Ultimately I will place the html files in DTP for fast searching. But right now it’s hard to scroll around in the big one.

Will

Bill_DeVille · July 19, 2006, 2:52am

Will, I don’t know of any utility off the top of my head.

But here’s one of my workarounds: select a portion of the page, save it to DT Pro as a rich text note, move down to another portions and do the same, etc. Name the segment files so they order properly in a group that holds them.

I would have suggested Data > Convert - to plain or rich text for searching, but there was a bug in Apple’s WebKit code that bit fairly often on WebArchive conversions. You might try it, though.

If you’ve captured the content of the big WebArchive file as rich text, you don’t even need to include it in your database. In the first segment of rich text you can Command-Option-Drag the WebArchive from the finder to the insertion point in the text window and create a clickable external link to the Web Archive file.

sjk · July 19, 2006, 3:15am

howarth gets credit for suggesting PDFLabs first, then I mentioned Combine PDFs(because of Java-related problems with PDFLabs on my system in case others encountered them).

Bill_DeVille · July 19, 2006, 3:51am

sjk: Yes, I had run into Java problems with PDFLabs. Combine PDFs works great.

I’ve got Acrobat, but that’s expensive if someone just needs to split or merge PDFs.

I’ve got several programs that shrink PDF file sizes. I’ve settled on PDFShrink ($35) because I do this a lot and it works better than Acrobat’s Reduce File Size option. Preview can reduce file sizes, but the results are sometimes blurry.

sjk · July 19, 2006, 4:52am

That’s helpful to know, thanks.

howarth · August 6, 2006, 1:40pm

Bill_DeVille:

Will, I don’t know of any utility off the top of my head.

But here’s one of my workarounds: select a portion of the page, save it to DT Pro as a rich text note, move down to another portions and do the same, etc. Name the segment files so they order properly in a group that holds them.

I would have suggested Data > Convert - to plain or rich text for searching, but there was a bug in Apple’s WebKit code that bit fairly often on WebArchive conversions. You might try it, though.

If you’ve captured the content of the big WebArchive file as rich text, you don’t even need to include it in your database. In the first segment of rich text you can Command-Option-Drag the WebArchive from the finder to the insertion point in the text window and create a clickable external link to the Web Archive file.

Bill, I’ve been trying to work with the webarchive file and find that it’s very slow going in Safari…does not scroll well at all, and copy/paste operations are tedious.

My best way to proceed is copy the text and paste it into a new plain-text window in DTPro. That works best if the original is in html instead of webarchive. (And faster if I use Camino instead of Safari to browse the html text).

Alas, I saved the originals mostly in webarchive instead of html and the originals are now erased. Do you know of any utility that will convert a Safari webarchive to plain html? Camino cannot open the webarchive files.

Thanks for your advice.

Will

howarth · August 6, 2006, 4:10pm

Bill,

I found the solution to my problem. A shareware utility called File Juicer will extract text, rich text, html, and images from many kinds of files, including Safari webarchives.

echoone.com/filejuicer/

To process my 11 mb webarchive, FJ took about 10 minutes and produced a 7.2 mb text file. I’m reading that now in Safari and copy-pasting segments into my DTPro database, as plain text files.

I may use it only this once, but I’m happy to pay the shareware fee of $12.

Will

Bill_DeVille · August 6, 2006, 8:59pm

Hi, Will:

DT Pro has a command: Data > Convert - plain text or rich text that will convert WebArchives to text.

In the past – and probably still in the current release of DT Pro – there has been an Apple Webkit bug that sometimes bites in such a conversion.

I’m using a beta and I think Christian has by-passed that WebKit bug, as I’ve just successfully converted several large WebArchives to rich text without any problem.

Data > Convert also works for other file types such as HTML and PDF.

So keep that in mind for a future release of DT Pro.

br11 · August 20, 2006, 11:52am

Hi!

I used PDFLab http://iconus.ch/fabien/pdflab/ for splitting.

Files with up to 400 pages were split OK, a bigger one (751 pg) always caused a crash, even when selecting only a small number of pages.

For a workaround I used Automator to split it into odd and even pages, then split these with PDFLab into single pages and renamed them with the applescript below (requires the MacPackToolbox <osaxen.com

)

BTW, how about expanding this forum with with a “Useful software” Section (Tables or a Wiki), and a “Best Practices” collection?

Greetings!
Br@


(*you must replace all text in "<>" with your parameters*)

on run
	tell application "Finder"
		set mylist to files of folder "<Path to the folder with the odd and even pages files>"
		repeat with myfile in mylist
			set myname to name of myfile
			set myname to (MP Replace ".pdf" using "" in string myname)
			if myname starts with "<Even>" then
				set myname to (MP Replace "<Even Page>" using "" in string myname)
				set mynumber to ((myname as number) * 2)
			else if myname starts with "<Odd>" then
				set myname to (MP Replace "<Odd Page" using "" in string myname)
				set mynumber to ((myname as number) * 2 - 1)
			end if
			set the name of myfile to "<Name for all files - page >" & mynumber & ".pdf"
		end repeat
	end tell
end run