PDF eBooks - intended usage?

I have imported a couple of books, 2-10MB, 200-600 pages. When I select one of them in the document list, I get the rainbow spinner for anything from 5 to 30 seconds, making it quite unusable for these.

Is DT intended to handle book-size PDFs? If it is, any suggestion on work-around for now, or what to expect in subsequent releases?

Thanks.

EDIT: Did some more tests. This delay shows up when selecting a big document from search results, not from just a browse list. This makes me wonder if DT’s approach to search results highlighting might be the underlying cause. Is DT computing highlights for the entire document, rather than just the 1 (or few) pages that are visible?

I’ve got more than 50 PDFs in the 200 to 600+ page range in my main DT Pro Office 2.0 database. So long as I’ve got free RAM available, they display right away, no spinning ball.

Some PDFs have compressed images, which requires decompression to display them. Try opening such a PDF in Preview, then saving it again. That would increase the stored file size, but make the edited PDFs open faster.

EDIT: Yes, I did a search for an exact text string that appeared on page 391 of a 477 page PDF. No spinning ball, but it took about half a second to scroll down to and display page 391, with the query string highlighted. That was the only occurrence of that string in the document.

Machine has 4G ram, 2G free, tons of disk, nothing else consuming CPU cycles. The delay shows up from the search list; can you check if your large books show in the preview window instantly from a Search?

Outside DT, my books preview instantly in cover-flow, and open instantly in Preview and PDF. I just did Preview open/save-As, and it made no difference.

Thanks

Hmm, are you starting with the target book already selected? I wonder if we may be doing slightly different tests. Here is what I do:

  • Select main database document folder
    • see a list of many documents, including the target book
    • make sure target book is not selected
  • Enter search string from within target book
  • Does the target book preview instantly (or, when selected from search results, if it was not top of result list)? I get a long delay here, repeatably.

Thanks.

No, I followed your procedure. I’ve got several large PDFs about Louisiana coastal restoration studies. I used this search query: ‘Caminada Headland and Shell Island Reaches’.

If I enclose that text string within quotation marks, I get three hits in my database. I selected each hit, and the PDFs were opened, scrolled to the first occurrence, in no more than two seconds (no spinning ball). Two of the hits contained that exact string only once, one had three occurrences. I used the keyboard shortcut Control-Command-Left Arrow to advance to the next occurrence.

All three search results contained more than 200 pages. The largest was 505 pages in length.

I repeated the search with the quotation marks removed, so that the query looked for documents that simply contained all of the words in the string. I got two additional documents in the results list, neither of which contained the exact query string. Of course, many more words were highlighted. But again, each document opened and displayed the first term found within 2 seconds or less. No spinning ball.

I’m running on a ModBook (a custom Mac tablet based on a MacBook) with 2.2 GHz CPU and 4 GB RAM. After the searches, I had 2,035.6 MB free physical RAM (and 375.9 MB inactive RAM). In addition to DTPO2 and the Finder, I had DEVONagent, Mail, iCal and ScanSnap Manager open. I had two databases open, one with 2.4 GB, the second with 120.22 MB size. The found PDFs were all in the larger database.

I’m running a pre-release beta of DTPO2.0, pb4 build 4. But I’m not aware of any tweaking Christian may have made for display of PDFs from a Search results list.

Kay provided me the PDF that takes a very long time to display when selected from a Search list. I confirmed the problem. Seems to take forever, with a spinning ball.

Interesting point: When selected from, e.g., Three Panes view, the command Data > Convert > to Rich Text results in a Log message, ‘conversion failed’.

I was able to capture text when this PDF was opened in Preview. Portions of the text were run-together, without spaces between words.

This is a version 1.6 PDF, produced by a Windows version of Acrobat Distiller. Seems there are compatibility problems with OS X PDFKit.

Another thread noticed improvements in certain cases when saving the file again from Preview and reimporting it. Have you tried that?

I tried that, it made no difference. Bill tried it as well, same result.

I’ve played with that PDF for hours. The only way it can be properly set up for searching (without lots of run-together text) is to re-OCR it.

When ABBYY rasterized the image layer during OCR, my suspicions were confirmed. This is a PDF specially coded for a print shop. Note the page size/layout markers that show up in the image below, after OCR.

In other words, this isn’t a “standard” PDF. Sure, one can read the image layer. But I wouldn’t have released it to the general public, as anyone who wishes to do text layer searches will run into problems.

Page size-layout markers.jpg

That PDF is an extreme case among the ones I have, but I’m not sure DT having difficulty means that all searches will have problems e.g. I made a new copy of the PDF, opened in Preview and did a search; Preview put up an “Indexing” progress bar which took 30+ seconds. After that, the searches in Preview were instant. If I do a Spotlight search (I tried a simple unique string from the end of the book) it is also instant – it finds the copy of the PDF that I have outside DT; but it does not find the copy I have inside DT.

I think the SOA file is an extreme case, but several other of my PDF books (without publishing crop-marks) have delays of 2-6 seconds; if I follow the same sequence as above, Spotlight finds them instantly. So Spotlight is apparently able to index them and search without any trouble. If the 2-6 seconds is normal for DT, I can probably live with it. If it isn’t, I’ll be glad to provide the files for testing.

I’m not sure how to interpret this, particularly the differences between Spotlight and DT. Maybe someone from DT can make use of this info.

The highlighting is causing the delay (not searching), therefore please send some examples to cgrunenberg - at - devon-technologies.com and I’ll check this.

Just emailed you a sample file, let me know when you get it.

Thanks.