PDFs, indexing, importing and OCR - newbie question

Hello everyone,

I’m trying out DT Pro office after reading great reviews of it. Here’s what I want to - I hope someone can advise how I can solve the problem I have in hand.

At present I use Eaglefiler to organise my PDFs and it does a solid job. DT is attractive because I’m an academic, and using the AI to find similar PDFs is potentially very interesting. I have about 2,500 pdfs of my Mac, being mostly articles but a few books as well. About 90% of these are ‘proper’ PDFs in that they have searchable text within them, but the last 10% are effectively scanned images.

I’ve indexed the whole lot at present within DT. This is great - with the 90% that have searchable text I can find similar files, set up groups, and generally organise myself. However, there’s still the 10% of files I’ve indexed that are scanned images, and understandably DT doesn’t do a good job of indexing these.

In an ideal world I’d now get DT to identify these, and reindex them by using OCR. I guess this is asking a lot though. If this isn’t possible, is it possible to manually re-index the files one at a time using the OCR facilities, or do I have to import these files in order to be able to get OCR to work?

Thanks to everyone for any help, and apologies if these kind of dumb questions have been asked before - I’ve had a search through the forums and couldn’t find them.

Best wishes,

Ian

This has been discussed before on the forum. You can make a Smart Group that lists PDFs of type “PDF” and not “PDF+Text” and select them and use “Data > Convert > to Searchable PDF”. Make sure not to set attributes on them (see the OCR preference pane) so that they can be batch processed.

There is also a script that does this on the DEVONacademy.

Ian, you can identify the image-only PDFs that need OCR by sorting by Kind. Image-only PDF Kind = PDF. Searchable PDF Kind = PDF+Text.

You can sort by Kind in the History view (Tools > History) or in the All PDF smart group. If the Kind column doesn’t already exist, add it to the view using View > Columns and check the Kind option.

If you select an Index-captured PDF and use Data > Convert > to Searchable PDF on it (and have set Preferences > OCR to delete the original), a new searchable PDF will be created in the database and the original Index-captured PDF will be deleted. If you wish to retain the PDF as Index-captured, you would need to export the searchable PDF, replace the original PDF in the Finder and reindex it into the database (and delete the searchable copy that already exists). If you have a number of image-only PDFs to convert, you might experiment with various workflows until you find a satisfactory one.

Thanks very much for getting back to me so quickly. This has worked incredibly well - even for my hand-scanned pdfs which were the wrong way around. I’m pretty impressed.

I’ve still got one small problem. Some pdfs for particular journals come with a few arbitrary words stamped on them (all rights reserved 2000, for example, being put at the bottom of each page), but with the rest of the pdf being made up of images only. DTP is recognising these files as PDF+text, which they kind of are, but as a result this misses out all the ‘real’ text. I haven’t found a way of getting DTP to scan these files from scratch to incorporate them properly. Is there a way to do this?

Thanks again for your earlier help.

Ian

You can do Data > Convert > To Searchable PDF even if the file is already of Kind “PDF+Text”.

Nick

Thanks (again). I see now - DT was doing the OCR in the background and converting the file without telling me. When I open the ‘OCR activity’ I can see it doing the work.

Best,

Ian

You can actually make another Smart Group where you use the word count of files of type “PDF+Text” as a way to find possible candidates.