How to select, highlight and underline in a pdf?

Timotheus · July 5, 2010, 4:27pm

The question will be stupid, and the answer probably obvious, yet I don’t understand a very basic thing.

I have made some scans (pdf) of a document, and imported them into DT.

And now I would like to underline or highlight some passages; but for some reason the system refuses to select them. What am I doing wrong? Or is this simply not possible with pdf’s which are made this way?

Bill_DeVille · July 5, 2010, 4:48pm

You can select text in a searchable PDF, then, if you wish, highlight that text.

But unless OCR has been applied to a PDF resulting from a scan, the PDF doesn’t contain searchable or selectable text – it’s just a picture, and highlighting isn’t possible.

DEVONthink Pro Office has built-in OCR capability.

Timotheus · July 5, 2010, 6:46pm

Thanks, Bill, for your quick answer, which confirms what I already suspected.

I do own DT Office, and I have seen that turning an image pdf into a readable pdf is actually quite simple: just right-click the document title, select “convert to searcheable pdf”, and you’re done.

But I have some more questions / problems:

I have seen that the OCR engine doesn’t recognize parts of hyphenated words as parts of one and the same word: it doesn’t recognize “Ce-cilia” as an occurrence of “Cecilia”. Is there anything that can be done about this?
How can I check whether or not the OCR engine has read the text correctly? And how can I correct misreadings?
Is it possible to teach the OCR engine not to take into consideration certain parts of the text (headers, footers, page numbers etc.)?
I have many pdf documents in which every pdf page consists of two opposing book pages. When I try to select a part of one of these pages, the software stubbornly also selects the corresponding part of the opposite page. In other words: the software doesn’t recognize the left page and the right page as two distinct entities. Is there something that can be done about this, or can this only be avoided by separating the left book page and the right book page into two autonomous pdf pages?

Thanks!

Bill_DeVille · July 5, 2010, 9:18pm

Because that’s the way the characters are shown in the PDF image. As a practical matter, there’s nothing that can be done about this within the PDF.

You can review the text resulting from OCR by choosing the PDF, then select Data > Convert > to rich (or plain) text.

But as a practical matter, there’s no way to correct errors in the PDF itself., although you can of course correct errors in the converted text document.

No.