File Size?

Another question as I get accustomed to DTPO:

Before importing an image-only PDF into DTPO, the size of the PDF might be something like 300KB. After importing the PDF into DTPO, the file size (showing inside DTPO when the file is selected) is a much smaller size – something like 30KB. Has the file actually been reduced that much (which would be good), or is it perhaps just indicating the size of the thumbnail?

I’d like the much smaller size, but it doesn’t make sense that a color document of several pages would be reduced that much. It would be helpful to know what’s going on in this process.

Thank you.

The document’s Info panel displays two results for the size of a PDF. The leftmost one is the space taken up for memory in the database. The rightmost one (File) is the storage space of the file on your hard drive.

In many cases you will see the File size increase after OCR, because the PDF has been re-rasterized and the IRIS engine doesn’t necessarily use the most efficient procedures for rasterization of the image layer of the PDF.

The file info I’m looking at is left-aligned and positioned just below DTPO’s main menu bar. and it only shows the following info:

FileName.pdf ( PDF+Text, Size 40KB, Modified: 1/26/08 )

Where is the “info panel”?

Yes, I was expecting the files to get larger (not significantly smaller). I want to establish practical procedures from the outset so as not to unnecessarily bloat my database.

As far as overall database size, is there a recommended maximum size? Or is that based on the computer’s memory and hard drive capacities?

Currently, the OCR Image Resolution is set at 300dpi, and the Image Quality is set at 50%. However, maybe 300dpi is a higher resolution than I need to use. I suppose, if I’m usually only going to view content onscreen, I should consider downsizing most things using the “Compress PDF to 96dpi workflow”? Or is there a better route to consider?

So I’m still a little fuzzy on how to determine the best resolution, and likewise how to gauge whether my files are larger than they need to be.

The Info panel is available from the Toolbar and by use of the command Shift-Command-I.

Scan at a resolution of 300 dpi or higher if you wish to use OCR, as accuracy declines below 300 dpi.

When DT Pro Office runs OCR on a PDF, the PDF is rasterized again. You can select the resolution of the PDF stored in the database after OCR in Preferences > OCR. The default is 150 dpi, a compromise between image sharpness and file size.

The file size of a PDF is not especially relevant to database size. More relevant is the memory required by the PDF in the database, which is usually much smaller than file size.

The amount of physical RAM is the primary limiting factor for database size (the Finder size of the database isn’t especially relevant). Although much larger databases can be run, once all of the physical RAM has been used the Mac begins to use Virtual Memory, which is much slower than use of RAM memory, as it involves swapping data back and forth between RAM and the hard drive. On my MacBook Pro with 2 GB RAM my databases remain responsive up to a total word content of about 22- to 24-million words. Above that, the database begins to use Virtual Memory more frequently. OWC is promising that I’ll have a new ModBook with 4 GB RAM next week. On that computer I could run considerably larger databases while maintaining quick responsiveness, as does my Power Mac G5 with 5 GB RAM.

Alright, found the Info Panel! Thanks for your explanation regarding which number represents memory and which represents physical space on the hard drive. That explains that the very small number that was showing below the main menu bar refers to the memory usage. Thanks also for explaining how important memory is versus file size.

Now, armed with your additional info regarding resolution, I’ll refine my procedures and continue exploring DTPO!

Thanks, again!