Splitting multiple page PDFs for better searching?

KevinC · February 9, 2007, 1:27pm

Seeing this thread on splitting large text files made me think about doing the same for large PDFs.

As Steven Berlin Johnson pointed out: “If I had whole eBooks in there, instead of little clips of text, the tool would be useless.”

I’ve found PDFPen which will split PDFs into individual pages, but before I launch into that on my research collection, I thought I’d tap the collected wisdom here.

(1) Splitting into pages is not necessarily the same as splitting into “sense units”. A relevant thought might easily split across pages. So is this a good idea? Splitting into two or three page chunks would reduce the risk of splitting across sense units, without making the record too large for useful searching, so might be a sensible compromise.

(2) Any thoughts on splitting PDFs into paragraphs? Possible? Or advisable? It would lead to thousands of records…

(3) Is there anything on the horizon with DTP 2 that should give me pause before I do decide to split PDFs? eg. At the moment, I don’t find searching within a document as easy as searching across documents.

I’m still fairly new to DTP so may not be taking advantage of all of its features.

Many thanks for any thoughts

Kevin

Bill_DeVille · February 9, 2007, 8:21pm

Hi, Kevin. While I understand Johnson’s argument for limiting the size of documents in order to improve the functioning of “See Also”, I haven’t segmented the hundreds of large PDF documents in my databases, many of which are hundreds of pages in length.

Yet I’m a heavy user of “See Also” and find it very useful in my collections of references.

Yes, a very large and ‘diffuse’ PDF can show up on suggested lists with a higher ranking than it should have. I’m amused that my automobile owner’s manual can show up as ‘similar’ to many topics, and I simply ignore it.

If I ever get the time, I may experiment with several of my large PDFs by segmenting them at the chapter level, which should be quite sufficient for my needs. I wouldn’t segment them down to the 500-word size; I think that would drive me nuts, as it would disrupt continuities of thought and “concept design” in those longer documents.

If I’m looking at a 500-page PDF and want to look for related references, I usually don’t use “See Also”. Instead, I’ll select a portion of the PDF’s text – perhaps a section or a chapter – and use the contextual menu option, “See Selected Text”. That performs the same way as “See Also” but on a portion of the PDF.

I subscribe to one general rule for topics like this: Design and use your own databases so that they most efficiently meet your own needs. Johnson’s philosophy and practices worked well for him with his database of short quotations and excerpts. My philosophy and practices work well for me with a very different content of references of interest to me. So DT Pro isn’t a “one trick pony”; it has a considerable range of flexibility.

Maak · February 10, 2007, 12:03am

I tried to plit a large PDF into single files with the PDFpen Pro demo. There is a neat Applescript included in the app, but I found the result quite uncomfortable. If you import it in a folder and double-click the page, you can click to the next page while reading, but this is really annoying because
-the zoom level is different when you switch documents
-page breaks
-loading up a new pdf page takes time

I would prefer cutting out bigger chunks of pages, and only if the PDF exceeds a certain size. It doesn’t make sense, for example, to import a whole e-Book, if you only want three pages out of 400.

Because I rarely have to import big pdfs, I am looking for free and easy solutions / workarounds:
-Print/Preview of the PDF with a defined page range. Print/save to DevonThink (if you HAVE a page range in mind)
-other solution: freeware app “Combine PDFs” from monkeybread software:

monkeybreadsoftware.de/Freew … PDFs.shtml

despite its name, it can also rotate or split PDF files. Just drag-and-drop the PDF file into the app window, double-click individual pages to see what they are, select pages (as in finder), delete pages (backspace).

(By the way, as featured in lifehacker, some pdfs are password-protected. If you open them with color sync and save them again, the protection is gone, so you should not do this )

Mark

milhouse · February 10, 2007, 4:26am

PDFLab is a full-featured and free program that lets you manipulate PDFs more easily, IMO.

Maak · February 10, 2007, 10:36am

thank you for the tip: the program is really useful!

AND it has a better UI, which in fact can be a reason to switch (if the functions are the same).

Mark

parlar · February 11, 2007, 4:39am

I think one of the nice things about splitting it into small parts is that the “See Also” would essentially return small, relevant quotations, instead of the full document. Sure, it’s great to do “See Selected Text” and find full documents that are similar, but when a document is 500 pages, you don’t necessarily know why it’s similar.

Jay P.

Maak · February 11, 2007, 2:26pm

I can confirm that for my RTF documents. I often work in the library with books I can not borrw, so I write short abstracts with some quotes, which I add to the database. First, this resulted in big rtf files, which were quite uncomfortable to read / scroll when they turned up as results at “see also”.
Since I put each quote (with my comments) into individual files (in a group with the name of the book), the “see also” results get a lot more accurate - plus it saves scrolling through the document. File naming can save some time later if the files appear in the original order.

with individual files, “see also” also becomes more interesting, because often I have only one relevant quote in a book, which ranks higher in the see also list than the complete file would.

Mark

Maak · March 24, 2007, 9:27pm

Since I upgraded to DTPO, and upgraded to a snapscan (or scansnap? hmm…) my total number of PDF increases. And with more and longer PDF files in the database, results for see also tend to become blurry, since “see also” does not tell me where in this doc I should look. So I split files (pdflab works great…), and the reading becomes unpleasant - if I re-combine the files for sub-sections, I lose too much time (so I don’t).

A new idea today: Having both files in the db (in sub-folders): One large pdf file plus all the smaller files: Which improves my results but the large-pdf-problem is still there.

So I wonder, if it is possible to exclude the large pdf file from the index completely. I know that I can “exclude from classification” in the info drawer, but the large pdfs still appear in see also results. And they are searchable.

So what can I do?
The help file is a bit unclear about it -how does “exclude from classification” work? Is there any other way to exclude the pdf from the index?

Thank you,
Mark

danzac · March 24, 2007, 10:53pm

This is a very interesting conversation. I have wondered about Maak’s question as well for excluding certain documents from the search.

I had another idea as I was reading through this thread: Wouldn’t it be great if you could indicate within PDF’s (well any text file in DT) “sections” within the file? And the section markers would tell DT to treat the section as a document “unto its own”. Thus, the user while reading the PDF could make marker indications as they desire.

Mind you, I have no idea what this would entail. But I agree with the tediousness of having to break up your PDF’s on the one hand, and the disconitinuity of a document when deciding to do that on the other.

Bill has indicated elsewhere (If I remember correctly) that the DT top dogs are hoping to increase the ability of working with PDF’s in upcoming releases. If the ability to ‘add’ info to PDF’s becomes available, perhaps this type of “sectioning” ability may become possible.

Maak · March 25, 2007, 12:19am

meanwhile, my workaround so far is to keep the long picture-only pdf (the one that scansnap exports by default) for reading, and splitting up the converted PDF into multiple files, deleting the long converted pdf afterwards - one doc with pics, multiple files with text.
Downside: This involves a lot of switching apps so far and, even worse, time. A button to exclude something from the index would be better (other ideas? Scripting magic?)

Mark

rickl · August 6, 2007, 11:31pm

In my experience, the problem isn’t having a large PDF appear in the results; it’s not having a smaller, more focused one appear. I experimented once with automated systems for splitting large PDFs and didn’t get much success. At the moment my partial solution is to pull out obvious sections that are of use to me into new PDFs, and of course import those into DT, and leave the larger PDF intact, on the assumption that there may be something else of use in there, too. With most search results, I simply ignore the large PDF if there’s a more useful, smaller PDF already extracted from it.
What has saved me a lot of grief is my decision to give up the idea of splitting up all my PDFs in one mega-session: I just split up the ones that will obviously have some lasting benefit as time and inclination allow.