Newbie - trying to understand basic PDF db for [EarthClass M

kaliban · September 11, 2008, 8:11pm

Hello:

I’m taking a shot at my first DTP DB Setup. It’s ultra simple. I use Earthclass mail for most of my mail, and have most of it scanned into PDFs.

Since the filenames are just sequential numerical index such as xxxxxxxxxxxxxx.pdf, it’s not quick and easy to organize the files in any really useful way in folders. It seems like DT would be a great solution…

I’m thinking I can either index or import the latest pdf files each day, and then I can automatically find things such as - all invoices from Chase; all bank statements from BofA; all mail addressed to me vs my wife; and who knows what else.

Problem: I don’t understand how to have the pdfs be searchable.

I’ve created two sample databases. One I indexed the folder containing my 99 pdfs; the other I imported the same folder. My preferences are set to the defaults.

When I open either database and type a simple query such as the name of my credit union or my first name, DTP returns ‘no items found’. Not very useful.

When I open the see also drawer, and select an invoice from Chase, then scroll through the see also list, like magic all of the Chase invoices are there in a row. Now that is awesome!

What I don’t understand is, how can DT recognize that a group of files are each Chase invoices, but if I search the word ‘Chase’ or my first name, which is typed neatly into each invoice, why do I not find anything at all?

Would someone kindly explain this to me, or point me at the information?

Thanks!

Kenny

Bill_DeVille · September 11, 2008, 8:52pm

Kenny, your scanned PDFs are only images of your paper documents. As such, they do not contain any searchable text.

OCR (Optical Character Recognition) “looks at” the images of text and produces a searchable text layer in the PDF file. That text layer can then be indexed and searched by DEVONthink.

DEVONthink Pro Office has a built-in OCR engine, so that it can either receive and OCR PDF images as they are created by a scanner, or OCR already existing PDFs in the database and convert them to searchable PDFs.

kaliban · September 11, 2008, 9:41pm

Oh, thank you so much, Bill. So do I understand you correctly that if I use DTPOffice, that the search functionality I’m expecting will happen automatically for any pdf’s I import or index?

Thanks,

Kenny

Bill_DeVille · September 12, 2008, 12:48am

Not exactly. DT Pro Office will let you Import image-only PDFs without OCR. But it has an option to Import image-only PDFs with OCR (File > Import > Images (with OCR)). With many scanners, the output of a scan of paper copy can be sent directly to DT Pro Office for OCR and storage of a searchable PDF.

There’s also an option to convert a selected image-only PDF(s) already existing in a database to searchable PDF (Data > Convert > to Searchable PDF).

I don’t know the resolution of your scanned images, nor the resolution at which the EarthClass images are created. We recommend a resolution of 300 dpi for good OCR accuracy.