With a lot of help and patience from Bluefrog, I’ve finally identified the issue I’m having with OCR. I’m a historian and have downloaded thousands of files from newspapers.com. These are not scanned images - they are downloaded files and no longer have an OCR layer, which apparently only exists on the servers of Newspapers.com. In the downloads, they add a text header and footer, which causes ABBYY all kinds of trouble, since it is an image processing program and does not know quite what to do with documents that have both text and image. Sometimes, the text layer is pretty good and other times, not so good. It is a hit or miss proposition. I conclusively proved what Bluefrog had been trying to get me to understand for a long time by cropping the downloaded file to eliminate the text header and footer. The file was processed with 100% accuracy. With the text included, not so good - anywhere from 10% to 80%.
If you have any experience with this issue, I’d love to hear how you solved it. Manually cropping the thousands of clippings I have is out of the question. I’ve looked for a Mac program that will remove text from photos but have not found anything. These files are pdfs with embedded images - they are not scanned documents. Newspapers.com has pulled quite a trick on unsuspecting customers by adding those headers, which include the title of the newspaper, date of publication, page number and copyright information. Valuable information, to be sure, but a pdf with embedded images is not a good candidate for an OCR engine.
I’m probably missing context but I don’t have this issue with newspapers.com downloads. The PDFs do put the newspaper article image between a header and footer, but the same text gets OCRed as if I choose Save as JPG and OCR that image to searchable PDF. If I were willing to crop the headers and footers of the PDFs, I’d just download the jpgs and save that step.
(No question it’d be nice if newspapers.com would include the ocr layer in the PDFs.)
I’ve been saving the clippings as .pdf files from the get-go, not being educated about OCR at the time. I just thought it was logical to select pdf because that is what OCR operated on, right? Well, yes, but only if the pdf is scanned, not downloaded from newspapers.com. I did a little experiment after being educated about OCRs by Bluefrog: I downloaded the pdf file, OCRed it in DT and then downloaded the same file as a jpg and OCRed it in DT. Try that and compare what you see when you convert both outputs to plain text so you can view the text layer. For me, working with pdf files produced much poorer results than working with jpg files. This issue is crucially important because a poor text layer is going to be worthless when doing a search using a keyword in DT. Please let me know what your experience is. Maybe I’m doing something wrong.
One issue with downloading the clippings as jpg files: The headers and footers are not included so if, when you have a multi-part clipping and do a merge-and-delete, you need to remember what the file name is because there is no reference to guide you.
I get the same search results with an OCRed PDF and an OCRed jpg, both downloaded from newspapers.com. Very similar word clouds, acan search the same phrases and see them highlighted in the right places, etc. And, I’d expect good OCR results—these aren’t particularly challenging OCR exercises for ABBYY.
I’m not sure what to suggest doing differently, unfortunately. I have DPI of the generated PDFs set high but that shouldn’t matter. I’m sure @BLUEFROG is providing good help in your ticket.
I did a rigorous test of my theory and I have come to the conclusion that newspapers.com clippings need to be subjected to a two-stage process. ABBYY first processes the text added to the pdf file, the headers and footers, the text that has the newspaper title, date, page number and copyright verbiage. That results in just the header and footer in the plain text file. You then have to re-process the file. The second time around, ABBYY processes the image file. DT gives you a warning dialog box about processing the file again at this stage. The text layer, after two passes, is the same as for the jpg download. Processing a newspapers.com file, in my experience, requires two passes. If you are getting good results with one pass, I don’t know what to say, other than that perhaps ABBYY sometimes processes everything in one shot. ABBYY was not designed to deal with this scenario - it is designed to work with scanned documents, not downloaded pdf files that contain embedded images.
As the saying goes, though, Your Mileage May Vary.
The above OCRs fine the first time. ABBYY has no issue OCRing embedded images in PDF files. Every scanned document PDF contains an embedded image of the document!
If what you are doing works for you, that’s wonderful! I don’t have a particular article “that’s giving me trouble” as I experience this issue with every article I download from newspapers.com. If you are curious to see what I’m referring to, download an article, move it from your Downloads folder to the Inbox in Finder and then go to DT and execute, from the context menu, Convert - to plain text, on the file in the Global Inbox. What you then see is what newspapers.com has added to the file. The image is embedded in that pdf file and the reason you get the Kind type “PDF+Text” for the file in the Global Inbox is because of that text. “PDF+Text” does not mean that the image has been processed - only the header and footer. Now, instead of running OCR on the file in the Global Inbox, move it to its final destination and repeat the convert to plain text command. You will see the text layer. Moving the file to the final destination apparently triggers ABBYY again and it processes the embedded image. I don’t know - perhaps one of the developers can comment on that. Do the same routine with a jpg file and then do a stare-and-compare between the two displayed text layers. In some cases, for me, the jpg text layer is much more accurate than the pdf text layer. I attribute that to the fact that ABBYY is not designed to process pdf files with embedded images. It is designed to work with scanned images.
You are correct that selecting the pdf download option on newspapers.com generates a pdf file with an embedded image. There is no text layer in the embedded image, thus the need to process it with DT and ABBYY. As I wrote before, though, Your Mileage May Vary. Perhaps your text layers are close to 100% accurate. Sometimes, mine are and sometimes they are not. Why? I don’t know.
Moving a file in DT does not trigger OCR in general. Unless you have a smart rule that handles this situation.
Also, moving files about (from Download to Inbox or whatever) does not change what is in them.
Of course it is. I scan PDFs every day and have them processed with Abbyy. No problem at all. And what would a scanned PDF be other than an image?
What would be the difference between a “scanned” and an “embedded” image? IMO, introducing arbitrary terminology just obscures the issue. A PDF can contain an image, and it doesn’t matter a bit if that image was created by a scanner or in some other way. In fact, whatever these newspaper guys have on their site must have been scanned.
Good question re: difference between a scanned and embedded image. Newspapers.com offers two ways of downloading a clipping: jpg and pdf. With a jpg download, that’s all you get: an image. With a pdf download, which is what I’ve been doing for years, you get a document that has added text containing the name of the newspaper, the date of publication, page number and also copy right information of newspapers.com because they have created this document. It is their intellectual property. The jpg file is not - it is public domain because it is (in my case, anyway) over 75 years old. When I use the word “embedded” I mean the pdf document that newspapers.com created.
In most cases, the newspapers have probably been processed from microfilm images, not paper. So there is a sort of “scanning” involved but it is way more sophisticated than your ordinary 3-1 office copy/print/scan machine.
This is “arbitary terminology” to those who are not familiar with the technology used in digitizing newspapers. I’m not an expert but I’ve dealt with a top-notch digitization firm and have learned a lot.
I’ll post the screenshots of an example of what I’m writing about soon. Perhaps that will help you understand what I’m referring to.
Here is a snippet from a pdf download of an article from newspapers.com
The Miami News (Miami, Florida) · Thu, Apr 30, 1914 · Page 6
D o w n l o a d e d o n F e b 2 6 , 2 0 2 3
Downloaded on Feb 26, 2023
M. V . S p e n c e r a n d f a m i l y h a v e
been v i s i t o r s t o t h i s s e t t l e m e n t : i x
t w o o c c a s i o n s r e c e n t l y. M r . S p e n -
cer h a s a p r o m i s i n g g r o v e s e t o u t
and w i l l b u i l d a h o m e o n i t t h i s
s u m m e r
And here is that same snippet from a jpg download of the same article:
M. V. Spencer and family have
been visitors to this settlement rn
| two occasions recently. Mr. Spen- 1
| cer has a promising grove set out
I and will build a home on it this
| summer
I am willing to bet that a search on the word “settlement” will not be found in the pdf snippet.
Bottom line: Like me, you may have thought that OCR was working just fine and that there was no difference between ABBYY processing a pdf file and a jpg file. I think that if you start looking, you will learn a lot. I sure have!!
If manually moving a file from the Global Inbox to its final destination “does not trigger OCR in general,” please explain this:
I convert a file in the Global Inbox to plain text and just get the header information from the newpapers.com article. Then, I manually move the file (no Smart Rule involved) to its final destination, convert it once again, and I get the header information and the text layer for the entire article.
I think we have waaay too many crossing wires here.
I have provided a workflow and components to @Athirne to proceed with his project, both with current documents and future ones.