Displaying searched phrase in PDF

Hello all,

My first post… :wink:

I’ve done a search - but several of the “hits” are quite dated, and so figured I’d pop this up, in the hopes of receiving the most contemporary information.

I’m a recent convert to Dthink Pro. I’ve imported most of my PDF’s - doing my PHD in law - and most of these have been OCR’ed using PDFpen. I gather that the “PDF + Text” is confirmation that DtP also knows the pdf’s are OCR’ed.

I’m struggling to get something that I presumed would be very simple, right.

Irrespective of whether I use the “quick search” option, the full search option, or even the Edit/Find option, I cannot seem to get DtP to highlight a “searched phrase”…

By this I mean - assume the citation in my text has the following phrase from an article that I have on PDF, already OCR’ed and imported into DtP: “Union liability was considered”. If I place that text, with the " " in any of the search options mentioned, the correct PDf is identified. But for some or other reason, unlike in something like Mac OSX Preview, that actual phrase in the text of the PDF, is not highlighted in any manner - which sees me manually having to find it?

What am I missing? Surely that would be one of the most basic features that DtP should be able to offer?

In addition hereto - and I guess a fully-related query - if I then open that PDF, using the built-in viewer within DtP (i.e. not using Preview), I cannot seem to locate a simple “search” option in the PDF viewer, to allow me to again search for a particular phrase (i.e. “Union liability”) within ONLY that particular PDF?

Any pointers as to where I am going wrong? I am again presuming that this would be one of the most basic features available, and yet I cannot seem to locate it?

Any help would be greatly appreciated!

Aw shucks… :blush:

Have been playing around with the various search options - and I think I’ve find the problem… And it fills me with dread - appears that some of my PDF’s have been OCR’ed in a not-altogether-comprehensive manner… :confused:

Opened the file in question in Preview - seems some pages have been converted, others have not.

In trying the Edit/Find/Find search option back in DtP again, it highlighted the searched phrase as I was expecting, but only on some of the earlier pages - and not, as Murphy would have it, on the page I was looking at when I did my initial search…

Going to try a few more, and will report back if the “problem” proves to be only OCR-related, and nothing else…

Sounds like you have the answer to the first part of your question. On the second part-searching within a document-command-f will bring up the Find window to allow you to find a string in a document, and also replace text in a text document.

Thanks Greg!

Yes - I found that after having realised the issue lay with the OCR’ing. As soon as I entered a different phrase, from an earlier page, it highlighted it as expected.

So on the one hand I’m relieved that DtP does have the basic search features that I was expecting, but on the other hand - I’m now facing the oh-so-unpleasant realisation of a not-necessarily-100%-OCR’ed-library:cry:

Urg! Good luck getting this sorted out. You may consider trying to reconvert by right-clicking a PDF+Text in DEVONthink and selecting “Convert > To Searchable PDF”.

I gathered that the OP is not using Pro Office, so that will not be an option in this case.

There is a simple, non-destructive method to check the quality or completeness of PDF OCRs. Select a “PDF+Text” document in DEVONthink, control-click it, and choose Convert > to Plain Text from the contextual menu. This creates a new plain text document that contains only the text layer that the OCR process created. It can be helpful to look at this to see how accurate the OCR was. It is the text layer than any program that searches PDFs (including DEVONthink) will look at.

If @Cassady decides to re-do the OCR, I suggest experimenting with the quality settings on whatever OCR software is used and checking the results with the convert-to-text method. Because text appears to be missing from the text layer in several PDFs, I assume that the originals were images. If the image quality is poor, or the images could not be completely de-skewed, etc., then no manner of redoing the OCR will fix the problem of missing text. Most OCR software will skip pages that cannot be recognized or will output garbage strings when the software encounters images that cannot be resolved. Be careful not to re-OCR files that are already OCRd. That will never fix anything – it will just make the results worse.

Missed this. Thanks, Greg.

Thank you for your comprehensive answer and suggestions. The OCR results not being as complete as expected, has surprised me somewhat. The particular PDF in question, was downloaded from one of the main Library databases, and the quality is (presumably) good. Having said that - it is an “old” article, so the typesetting/font is a bit old-school, which might explain PDFpen having a few problems - although, that fails to explain why it performed its task as expected on some pages, and not on others.

I have a second OCR software application called Prizmo, which generates a completely separate text file, of the PDF. It’s interface is not as slick as PDFpen, which by all accounts, is a popular and well-written application, so I am loathe to use it in lieu of PDFpen - but it might be worth considering. As it stands, and given your valid point about not subjecting a PDF to repeated OCR events, I might just shrug “Bygones”, and leave it be…

Hmmm, did you download the file from that library and then OCR them with PDFpen? If the downloaded document was already “PDF+Text” when it was downloaded, then the PDFpen OCR step might have introduced errors.

Nope - to be honest, I’m not even sure this particular PDf was OCR’ed by PDFpen on my Mac - but since I would presume the OCR process that is applied by the various Databases would be faultless [ :unamused: ], kind of made that assumption…

I have Hazel installed, and having realised that many of my PDF’s were already OCR’ed, I wanted to avoid duplicating work, since I have many thousands of PDF’s in my library. I had Hazel invoke a script that would run a textedit search of a PDF, to look for the frequency in occurrence of the word “encoding”… My ‘investigations’ (and I use that term in the loosest possible manner :wink: ) had shown that where the term appeared more than twice, there was a high probability that the PDF had not been OCR’ed…

Hazel identified PDF’s on the above terms, and then invoked PDFpen where needed… Took a few hours, but I managed to process hundreds of PDF’s in this manner, so was worth it…

Had I known about the possible dangers of re-OCR’ing an already OCR’ed file, I would have possibly been a bit more circumspect in my ‘investigation’, but alas - what is done, is done… :unamused:

If batch handling is needed in the future, I’d suggest scripting PDFpen to do the “needs/doesn’t need OCR” test. PDFpen’s AppleScript dictionary includes a “needs ocr” property in its document object.

Well now. That would’ve have saved me from weeks of work, had I known! :laughing:

Still - it was a useful exercise in learning some Apple/scripting code…

And since I only have a vague idea of the uses of the dictionary, I clearly still have a way to go! :slight_smile:

Regardless, thanks for the tip. I have managed to acquire several Scripting/Applescript books, but have not yet had the chance to get to grips with anything in them yet. When I do, I will keep the above in mind…