If I edit OCR text, then attach this to PDF+Text file as metadata, will this screw up DT's scoring of search results?

Ryan_N · June 19, 2020, 7:04pm

Simply put:

I want to correct OCR text in hundreds of short (2-3 sentence) PDF files I have from 17th and 18th-century newsprint (this is genealogy related).
Acrobat’s Correct Recognized Text is not an option. I get a “there were no errors” popup when I attempt this.
As a workaround: I can convert to plain text file in DT, which produces a copy of the PDF as a .txt file, then edit that so as to build a verbatim transcription of the original. I could then paste those three sentences into the original, i.e. in metadata field created for this purpose, but this is where my concern emerges. I’m worried about certain words/phrases now being counted twice; those that were correctly OCR’ed, and those from my manual edit, of same.

For instance, if the original PDF actually did recognized Mr. Manfrengensen’s surname in 1690 newsprint, but now that same word is pasted into metadata, does DT end up now scoring this entire file higher-up in future search results, because that word is now getting two hits within the same file? (my goal is to avoid this!)

Thanks fellas! I’m open to any and all ways of pulling this off, so if a better workflow exists, I’d love to hear of it. Thanks in advance!

BLUEFROG · June 19, 2020, 8:19pm

What application did the OCR initially?

Ryan_N · June 19, 2020, 9:22pm

ABBYY FineReader Engine 12 apparently, though some of these files, I instead see “macOS Version 10.15.2 (Build 19F96) Quartz PDFContext” (under properties>creator on the right plane of DT). This explains why I can’t use Adobe to correct the text --but from a more generalized standpoint, I am still curious about the search results (scoring) question.

I’m only just beginning a transition to DT. I’m doing this very slowly, in order to audit OCR and metadata of ALL my existing files, and to also carefully think out my directory structure (groups)–which is already a labyrinth on Google Drive.

This is all hobby-related so I can basically take as long as I want to do it in a way that satisfies me. Cheers!