Importing PDF's

Smitty · May 11, 2015, 9:58pm

Good Afternoon! I am new to DT and have a couple of questions.

I imported about 90 PDF’s. I simply dragged them in bulk to the DT icon. I was unsure how to have the PDF’s be imported via OCR to be able to search them. So once they were imported, I selected them all and then went to Data > Convert > to Searchable PDF.

What I ended up with was (2) copies of the PDF in DT. They both say “PDF + Text.” However, one has yesterdays date, and the other is the original date the PDF was placed on my computer (I assume this is the original at any rate). The size difference is different. For example, I have one PDF + Text that is 253.7 KB (which is the original I assume based on the date stamp), and another PDF-Text that is 4.4 MB with yesterdays date. The two PDF’s also have different number of words, and it is not consistent. For example, in the referenced PDF’s above, the original has 8639 words, and the one with yesterdays date has 8650. However, with another PDF it’s reversed: The original has 9082 words and the same PDF that has yesterdays date has 8743 words.

I’m confused as to how the word counts are different (in general) and inconsistent across original PDF’s and those that are the same but dated yesterday. I also don’t need them both and I’m unclear why this occurred and which one I should keep. Could you advise what I did wrong? Ironically, when I ask DT to see if these are duplicates, it doesn’t see the double-PDF’s that way–though clearly they are except one is much larger in size with a different word count.

If the import only says PDF, is it fair to assume it has not been OCR’d? I tried selecting Convert > to Searchable PDF, but it remained saying in the “kind” it was only a PDF. Can you advise?

What is the best way to import existing PDF’s from my HD into DT, and have them automatically imported so they’re searchable (using OCR)?

What is the pro/con’s of having one, massive DT db, and use groups for different things (e.g., journal articles, GTD reference, personal finances (i.e., tax returns, receipts, etc), versus, multiple databases (e.g, db for all research/journal articles, a separate db for GTD reference stuff, a separate db for personal finances, etc?

Finally, if I import an entire folder of documents, does the folder become a group by default or do I need to select the folder and tell DT it is a group?

In advance, many thanks for your help!

Smitty

Cassady · May 12, 2015, 9:19am

Hi Smitty,

And welcome!

I’m sure someone far more knowledgeable might be able to give you advice about the possible issues/options with the OCR process. I imported the bulk of my PDFs are PDFPen had already done the conversion process. Any new PDFs are mostly already OCR’ed (coming from Academic DBs) – and those few that aren’t, I do manually. If you want to do bulk-processing, that will require a different approach, I would assume, and hopefully someone who has done that will wade in.

The differing word count would be what was picked up in the OCR process. Were two PDF’s originally imported, or are you suggesting DTPO “created” two copies? What settings have you selected under the OCR tab in the Preferences menu?

Re DTPO not seeing it as duplicates – all I can think is that to the ‘naked eye’ they appear to be duplicates, but since DTPO examines the content of the PDF, which would include the OCR text, if their word count is different, then the files are different – and would not be picked up as a ‘true’ duplicate, if that makes sense.

Re your other queries – most of them you will find discussed at length within the forums. Have a look around – plenty of information is available if you search.

See here for instance, for a very recent addition to a well-established topic on the very point of pro’s/con’s of splitting DB’s: [url]Database - why use more than one?]

Good luck!

Bill_DeVille · May 13, 2015, 3:24am

Don’t assume that all PDFs imported to a database require OCR to make them searchable. On the contrary, very nearly 100% of the PDFs that I acquire from the Web or as attachments in email are searchable already.

Subjecting an already searchable PDF to OCR isn’t merely a waste of the computer’s resources and the user’s time. It may introduce new errors in the text content of the PDF and reduce the view/print quality of the page images.

One way to see whether or not a PDF is searchable is to add the Kind sort column to a DEVONthink view window. In the menubar choose View > Columns > Kind.

For image-only, non-sesrchsble PDFs, Kind is PDF.

For searchable PDFs, Kind is PDF+Text.

JMichaelTX · May 13, 2015, 3:40am

Obviously, it is easy enough to determine if a PDF is searchable by opening it PRIOR to import to DTP, and doing a FIND on text obviously in the PDF.

Smitty · May 28, 2015, 5:14pm

Thank you to everyone for your replies!! I really appreciated it

Smitty · May 29, 2015, 12:04am

Hi Cassady. I’m not sure what I did other then selecting a folder, left-clicked and scrolled down to Services > Add to DEVONthink Pro Office.

Since I was learning/experimenting, I deleted it and then imported again. Problem appears to have resolved. Thanks again

Smitty · May 29, 2015, 12:07am

Hello Bill…Ahhh…I understand. Thank you for your input!