Confusion over searchable PDFs

Hi,

I have a number of PDF documents that are the result of scans that I’d like to OCR to make them searchable.

To get a feel for how many documents there are, I created a smart group using the Data>New From Template>Smart Groups>PDFs (not searchable) menu item.

The smart group lists a good number of documents, some images, some scans and some documents that have a kind value of “PDF Document” but are definitely searchable and have been indexed.

I’m curious as to how a document that doesn’t have a kind of “PDF+Text” is both searchable and has been indexed. I know very little about the details of PDF documents however, so if someone could explain how this happens, it’d be appreciated.

An indexed document does not necessarily contain searchable text. Are you able to select text in such a document and copy it to other apps/documents?

1 Like

Yes, I can select text in the PDF and paste it into a text editor.

Welcome @Sazhen86

There are many PDFs in many different flavours in the world. (It’s quite an old format even though it has evolved over the years.) Some good; some not so good. Some broadly accessible to PDF apps; some with proprietary code that closes out other apps from using the data.

  • Where did the PDF originate?
  • How large is it and how many pages?

Hi @BLUEFROG

The PDF came from here https://cliutils.gitlab.io/modern-cmake/modern-cmake.pdf and has 80 pages and is around 940KiB.

It seems to have been created with calibre. Some of the other documents were created with MS Word 2007.

I have a lot to learn about PDF it seems.

It seems to have been created with calibre. Some of the other documents were created with MS Word 2007.

It’s not unheard of that PDFs coming from Calibre and Chrome can be problematic with Apple’s PDFKit, the framwork we use in DEVONthink.

Try dragging the file out of DEVONthink and dragging it back in. Does the file still report as a PDF Document instead of PDF+Text?
Here is comes through as PDF+Text.

I have a lot to learn about PDF it seems.

Not really. I’m just old friends (and sometimes enemies :stuck_out_tongue: ) with the format, having come from many years in graphic arts and printing (which they were originally made for).

Many PDFs you’ll encounter will be fine. You just have to be aware there could be issues occasionally.

Hi @BLUEFROG

I did as you suggested and dragged the file out of DEVONthink and back in again, and now it shows as PDF+Text. I’ll try it with some of the other documents, thanks.

Any idea why that would happen? Isn’t the file I dragged out the same one I imported before? Maybe newer versions of DT handle it better?

Version 3.8 is current. You running this? And as @bluefrog pointed out, there are many variations of PDFs out there.

Yes, I’m using version 3.8. I took the bold move of rebuilding the database (after taking a backup) as I figured it would re-import and re-index everything. It took a while but my patience was rewarded. The documents in question are now identified as PDF+Text and everything seems to be back where it was before. My PDFs (not searchable) smart group has also around 700 less entries, so I’m really happy.

1 Like

Do you remember how you added this PDF initially?

It was a while ago, but I suspect I just dragged it from Finder into a group in DEVONthink. That’s how I import pretty much all of the documents.

Thank you! And did you already use DEVONthink 3.x? Or an older version?

Now that is a good question. The modified date of that file is 8 June 2019 and I bought the upgrade to version 3.x on 22 November 2019, so it would be version 2.x when I imported that file.

That’s good to know. Please let us know if the same issue should occur using version 3.x - thank you!

I’ll be sure to let you know if it happens again, but everything seems happy now. Thank you for your support, it’s appreciated. And thanks for producing DEVONthink, it’s a one of a kind product.

3 Likes

Happy databases are what we like to see :heart: :slight_smile: