Do I need to OCR PDFs and images to make them searchable?

CoachKidd · September 8, 2023, 3:27pm

Hi folks …

I’ve only just discovered DT after a recommendation by a friend and I’m easing my way into the app. So please forgive my ignorance. It will pass …

I’ve migrated my data (documents, images and PDFs) from Google Drive to Dropbox, and have a database that has indexed that content. When I used GDrive to search my content with keywords, I would get results which included images and PDFs, so I’m guessing Google performed some OCR on them.

I’m not getting the same results from DT and so I’m wondering if I have to manually OCR my content myself.

I have the standard edition, do I need to upgrade to Pro? If so, I’m happy to do this, and I’d not done so only because I understood the Pro features were ficussed on people wanting to scan documents (which I don’t).

Thanks in advance for any guidance here.

rmschne · September 8, 2023, 4:00pm

OCR is a feature of DEVONthink Pro. See DEVONtechnologies | DEVONthink Editions

If the incoming files already have OCR text layers in them they should be searchable (but I cannot test this as I have the Pro edition).

Any reason you are indexing rather than importing. Indexing is a bit more complicated and unless you have a reason to do it that way I suggest you not index but import files. There is a difference. The DEVONthink Manual covers the issues related to indexing. please read it.

Also, as a new user you should also if not already have a read of the “Take Control of DEVONthink” available free from DEVONtechnologies.

See their web site for both documents.

MsLogica · September 8, 2023, 4:56pm

I’m just going to go back a step here. A lot of search functions are looking for a text layer and then scanning that for your search term. Images don’t have a text layer a standard (they are… an image!), and search can’t “read” them like a human can,

If we go forward a step, GDrive (and a couple of other softwares, e.g. Apple Notes) create their own “text index” (not the correct term I think, but it is distinct from an embedded text layer) to run their image searches, but they don’t alter the original image file (as far as I’m aware). It is essentially a sneaky bit of their own processing and not a reflection of the “content” of the file itself.

So to circle back to your question, without knowing anything about how you created your images or where they came from, it seems unlikely they have a searchable text layer and you probably will want to process them to have that if you need to search in them. It’s easy enough to gather them up: set up a smart group that finds files with less than X number of characters. It’s handy to keep this smart group, because occasionally you may come across a file that’s not been OCR’d properly, and this smart group would find it.

I’ve not discussed PDFs yet because it depends on their source. Most modern PDFs that are created from text-editing software do have a text layer embedded already, so you shouldn’t need to do anything. But I don’t know the source of your files - perhaps you have old PDFs, or PDFs that are essentially images of old texts. They might not have a text layer, in which case DT Pro can create one for you. The smart group mentioned above will pick up all files that are missing a text layer (unless you exclude specific file types from the criteria).

Any OCR you do in DT Pro creates a new file that has the text layer. This means if you move the file in future to another app, the text layer will still exist.

CoachKidd · September 8, 2023, 5:11pm

Thanks, that’s a really helpful start for me. Clearly I need to RTFMs, but this does create a significant amount of friction to engaging with DT and being (somewhat) productive.

You have given me a few pointers to consider as I wade my way through the documentation. Thanks.

MsLogica · September 8, 2023, 5:15pm

May I ask the source of your PDFs and images? We have for example historians in the forum who handle a lot of old texts and have lots of experience processing documents that don’t have OCR. They will be able to advise you in how to streamline this if it’s likely to be an ongoing piece of work.

MsLogica · September 8, 2023, 5:16pm

(Also I don’t know what you’re doing with your PDFs, but just be aware that if there is no text layer you cannot highlight and annotate them until this is created, as there is no “text” for the annotation tools to interact with.)

CoachKidd · September 8, 2023, 5:33pm

@MsLogica - thanks again. As I consider your questions I’m realising how little I know.

I’m just stepping into research from many years as an OD practitioner, over which time I’ve collected a great many PDFs from a variety of sources, and never stopped to consider that there might be variations in format. WIth a background in engineering - marine and software development - I believed that ‘PDF’ was a standard, and never stopped to consider the storage of annotations as I’ve never done it.

I’m happy to convert my PDFs into a newer format with text layer, as I’m not going to lose anything.

And if I understand your guidance correctly, I can use smart groups as queries (there’s my SQL Server background) to process records (PDFs) so that the annotation tools can work effectively.

Do I need to upgrade to Pro to do this?

rmschne · September 8, 2023, 6:07pm

open the files with Preview and if you can search for text they are already OCR’d

While there is a standard (I think ISO or other, cannot remember) the implementations of software to make and view PDF’s are variable.

BLUEFROG · September 8, 2023, 6:24pm

Welcome @CoachKidd

People take many things for granted – not including things outside work and computing – and this often leads to unrealistic expectations when faced with new environments.

Reiterating some of the astute previous responses:
Yes, you will need OCR done on PDFs that have no text layer.
See this blog post:

Yes, you will need the Pro or Server edition of DEVONthink to do OCR in-application.

Clearly I need to RTFMs, but this does create a significant amount of friction to engaging with DT and being (somewhat) productive.

You need to define productive and your expectations here. You are dealing with an entirely new setup, so I wouldn’t be too concerned about productivity yet.
That being said, DEVONthink’s learning curve is more bark than bite and it can be used in very simple ways. As is often said, If you can use the Finder, you can use DEVONthink. DEVONthink just offers much more power to deal with your documents and information.

Start with the appropriately named Getting Started chapter in the Help or manual.

MsLogica · September 9, 2023, 6:37am

Indeed. Don’t be overwhelmed! If you can afford to, upgrade to Pro so you have OCR, set up your smart group to watch for blank files (mine is set to less than 200 characters in a file), give some thought to the organisation structure you might want in your database(s), and then just start putting things in there and using it.

I probably don’t even use 10% of DT’s capabilities currently. I took the attitude to learn as I go (I’ve not read the manuals completely either, though do read the first few chapters as there are basics you need to know) and as a new use case pops up, I figure it out. I bought DT to address a specific problem, but my use has grown far beyond what I intended as I realised how many problems DT could solve (or at least improve).

I also find the forum extremely helpful, which is why I try to be helpful when I can. I read much of what is posted so I learn about the things I’ve not yet tried and can consider how they might apply to my work.

As an aside, I have also found that learning about DT has taught me many useful skills applicable to other things, and it’s made me think far more carefully about workflows across several spheres of my life, which I feel has been to my betterment (though my colleagues may disagree, especially when I am querying why they do something the way they do ).

BLUEFROG · September 9, 2023, 1:21pm

We always enjoy hearing of more far reaching effects of our work, so thanks for sharing this!

NickLowe · September 9, 2023, 1:29pm

No. There are very cheap third-party tools such as Harry Shamansky’s Elucidate that will do this for far less than the price of the upgrade to Pro. If you do it a lot, sooner or later you’ll migrate to Pro for the superior quality and performance of the ABBYY engine and the convenience of having everything in-app, plus additional Pro features such as e-mail archiving. But the Standard edition is all you need for now if third-party PDF OCR tools get the job done.

chrillek · September 9, 2023, 3:01pm

I suspect that’s just an interface to Apple’s Vision framework, given the number of languages supported. And it requires, as the author says, high quality b/w scans to work reliably.

NickLowe · September 9, 2023, 3:18pm

Absolutely. But I used it for years with mostly acceptable results and can cheerfully vouch for it as a good-enough baseline solution. When I originally upgraded to DT Pro it was for convenience and integration rather than any dissatisfaction with the quality of output. Because the upgrade to Pro is just the price difference from Standard, you don’t lose by deferring it until you’re sure you need it.

More importantly for the OP, though, do not purchase any version of DT till you’ve maxed out the trial period on the trial version, which has all the Pro and Server features unlocked for the duration of the trial, and will be your only chance to check those features out without paying for the upgrade. Server in particular is well worth exploring while you have the chance; it’s a very expensive upgrade that most users wouldn’t casually contemplate, but is phenomenally useful while you have it for free.

CoachKidd · September 11, 2023, 10:07pm

Thanks again for the advice, folks. I’ve upgraded to Pro and I’m working my way through the manuals.