"find and replace" - fixing OCR orthography errors from early modern docs

MichiganUser · May 14, 2022, 6:27pm

My notetaking often involves copying text from 17th and 18th century documents. Luckily, the PDFs I have are either OCR’d or I can get DevonThink to OCR them. Less luckily, between scan quality and (especially) different orthographic conventions, it comes out a little garbled. Like, you can figure it out, but it slows you down and (especially) it may interfere with search efficacy.

Here’s an example. First, the original PDF:

Now, the pasted text in a RTF notecard:
palain text

To be clear, I’m amazed the OCR does as well as it does! But both for efficiency of future reading and especially for future search/"see also"ing, I’m trying to think of a way to fix it at scale–at least for the orthography issues.

Take the simplest example of orthographic problems: many S’es actually appear as F’s. Fo it lookf like thif, pretty filly, huh? The best idea I have for fixing this is search-and-replace by word, because obviously I don’t want to replace ALL the F’s with S’es, some of them are actually F’s!

Going through a bunch of old posts on the topic, it seems that my best solution is:

make a smart folder with criteria that capture all my notecards that have transcribed text
select all files in the smart folder
go to Scripts > Edit > Replace Text in Documents…
4a. for “Enter text to find:” enter e.g. “Conftitution”
4b. For “Enter replacement text:” enter e.g. “Constitution”
Repeat this for about 20 common important words that get screwed up in a consistent way across all OCRs

I’ve tested this (using nonsense words) on a subset of documents, and it seems to have worked simultaneously as a batch operation across multiple files–at least so long as all of the highlighted files are next to each other in the list. So here’s my question: Am I missing something about why this might be a bad idea? Or am I right that this should be a pretty straight forward “find and replace” operation?

Sorry for the long post, just wanted to be really certain I’m getting these steps right – some old posts suggest that the DT team was (understandably) reluctant to add this functionality because of risks to people’s databases. If this batch operation threatens hidden problems, I might not know about it for months to come–at which point it would be a nightmare to try to untangle everything.

Any other clever ideas for how to handle this are of course also welcome!

Thanks as always
MichiganUser

mdbraber · May 15, 2022, 6:58pm

Interesting challenge. As you’ve found out - PDF files are ‘just’ plain text files under the hood (try opening one in a text editor and you’ll see the structure). Replacing single characters should be rather innocent - but there might be consequences I don’t know. I would at least suggest doing the search-and-replace before importing your document, so DT always indexes the right (text) content. I’d use some standard text manipulation tools such as grep, sed, awk or others.

chrillek · May 15, 2022, 8:27pm

That doesn’t sound right to me. Or rather: there are certainly lots of PDF documents out there that are anything but plaintext. Just think of a photo saved as PDF.

And modifying a PDF in a text editor is probably a bad idea.

mdbraber · May 16, 2022, 8:54am

Christian - you’re right. It’s definitely not true 100% - I was mostly referring to the structure of PDF which is a followup to Postscript and other text-based formats. But funny enough - changing the text contents is probably doing exactly that: changing the internal text and it seems to work in this case. But you probably want to check your approach such that you’re not corrupting your PDFs in the process.

chrillek · May 16, 2022, 9:36am

I suppose that it might be possible to do something along these lines (JavaScript):

const txt = record.plainText();
record.plainText = txt.replace(/onftitution/g,'onstitution').replace(....).replace(....)

In fact, I’d define a dictionary up front like so

const dict = { conftituion: 'constitution',
                manifeft: 'manifest',
…};
const REString = Object.keys(dict).join('|');
/* Now RE string is 'conftitution|manifeft|...' */
const RE = new RegExp(`(${REString})`, "g");
/* RE is /(conftitution|manifeft|...)/g, i.e. it uses a capturing group for all the keys in the dictionary */
const txt = record.plainText().replaceAll(RE,(match) => dict[match]);
/* That's the fun part: replaceAll uses a _function_ to replace the dictionary key with its value */
record.plainText = txt;

So we’ve solved the problem in basically four lines of code (plus the dictionary, of course). And the last line is not even strictly necessary.

That doesn’t touch the PDF itself (i.e. the graphics commands to draw dots on paper/screen) at all, it just modifies the text layer. Which is what DT needs for indexing and searching, so it should be just what the doctor ordered.

Indeed, PDF is derived from PS. But PS already is a programming environment. You can actually define your own functions to draw text or anything, and these functions can work on base64 encoded strings (or whatever). So while PS/PDF can be human readable, it can also be utter gibberish

mdbraber · May 16, 2022, 9:44am

Not sure, but does this change the PDF in the end? Or just the ‘indexed’ text from the record. I don’t think this e.g. changes the text when you selected something inside the PDF? If this fine with the OP than we’re there ofcourse, but if we want to ‘sync’ back the changes in text, maybe something more is needed.?

chrillek · May 16, 2022, 9:59am

I just checked. It doesn’t. All it does is give me a warm, fuzzy feeling about having been so clever
It does not even change anything about DT’s index. Well, would’ve been nice.

MichiganUser · May 23, 2022, 2:57pm

well i still appreciate the heck out of your taking a cut at figuring it out! thank you : )

system · May 22, 2025, 2:58pm

This topic was automatically closed 1095 days after the last reply. New replies are no longer allowed.