My notetaking often involves copying text from 17th and 18th century documents. Luckily, the PDFs I have are either OCR’d or I can get DevonThink to OCR them. Less luckily, between scan quality and (especially) different orthographic conventions, it comes out a little garbled. Like, you can figure it out, but it slows you down and (especially) it may interfere with search efficacy.
Here’s an example. First, the original PDF:
Now, the pasted text in a RTF notecard:
To be clear, I’m amazed the OCR does as well as it does! But both for efficiency of future reading and especially for future search/"see also"ing, I’m trying to think of a way to fix it at scale–at least for the orthography issues.
Take the simplest example of orthographic problems: many S’es actually appear as F’s. Fo it lookf like thif, pretty filly, huh? The best idea I have for fixing this is search-and-replace by word, because obviously I don’t want to replace ALL the F’s with S’es, some of them are actually F’s!
Going through a bunch of old posts on the topic, it seems that my best solution is:
- make a smart folder with criteria that capture all my notecards that have transcribed text
- select all files in the smart folder
- go to Scripts > Edit > Replace Text in Documents…
4a. for “Enter text to find:” enter e.g. “Conftitution”
4b. For “Enter replacement text:” enter e.g. “Constitution”
- Repeat this for about 20 common important words that get screwed up in a consistent way across all OCRs
I’ve tested this (using nonsense words) on a subset of documents, and it seems to have worked simultaneously as a batch operation across multiple files–at least so long as all of the highlighted files are next to each other in the list. So here’s my question: Am I missing something about why this might be a bad idea? Or am I right that this should be a pretty straight forward “find and replace” operation?
Sorry for the long post, just wanted to be really certain I’m getting these steps right – some old posts suggest that the DT team was (understandably) reluctant to add this functionality because of risks to people’s databases. If this batch operation threatens hidden problems, I might not know about it for months to come–at which point it would be a nightmare to try to untangle everything.
Any other clever ideas for how to handle this are of course also welcome!
Thanks as always