I have a number of PDF documents that are the result of scans that I’d like to OCR to make them searchable.
To get a feel for how many documents there are, I created a smart group using the Data>New From Template>Smart Groups>PDFs (not searchable) menu item.
The smart group lists a good number of documents, some images, some scans and some documents that have a kind value of “PDF Document” but are definitely searchable and have been indexed.
I’m curious as to how a document that doesn’t have a kind of “PDF+Text” is both searchable and has been indexed. I know very little about the details of PDF documents however, so if someone could explain how this happens, it’d be appreciated.
There are many PDFs in many different flavours in the world. (It’s quite an old format even though it has evolved over the years.) Some good; some not so good. Some broadly accessible to PDF apps; some with proprietary code that closes out other apps from using the data.
It seems to have been created with calibre. Some of the other documents were created with MS Word 2007.
It’s not unheard of that PDFs coming from Calibre and Chrome can be problematic with Apple’s PDFKit, the framwork we use in DEVONthink.
Try dragging the file out of DEVONthink and dragging it back in. Does the file still report as a PDF Document instead of PDF+Text?
Here is comes through as PDF+Text.
I have a lot to learn about PDF it seems.
Not really. I’m just old friends (and sometimes enemies ) with the format, having come from many years in graphic arts and printing (which they were originally made for).
Many PDFs you’ll encounter will be fine. You just have to be aware there could be issues occasionally.
I did as you suggested and dragged the file out of DEVONthink and back in again, and now it shows as PDF+Text. I’ll try it with some of the other documents, thanks.
Any idea why that would happen? Isn’t the file I dragged out the same one I imported before? Maybe newer versions of DT handle it better?
Yes, I’m using version 3.8. I took the bold move of rebuilding the database (after taking a backup) as I figured it would re-import and re-index everything. It took a while but my patience was rewarded. The documents in question are now identified as PDF+Text and everything seems to be back where it was before. My PDFs (not searchable) smart group has also around 700 less entries, so I’m really happy.
Now that is a good question. The modified date of that file is 8 June 2019 and I bought the upgrade to version 3.x on 22 November 2019, so it would be version 2.x when I imported that file.
I’ll be sure to let you know if it happens again, but everything seems happy now. Thank you for your support, it’s appreciated. And thanks for producing DEVONthink, it’s a one of a kind product.