Clipping Newspaper Articles

Athirne · September 10, 2025, 2:20pm

No doubt this has been shared before but I couldn’t find a mention of it.

I clip a lot of articles from newspapers that are not carried by newspapers.com The clippings are png files and if there is a way DT can convert PNG files to PDF files, I have yet to discover that. My workaround is to clip the article, open it with Preview and export it Inbox. I then move it to the group I want to save it in and process it with OCR by clicking the icon in my customized tool bar. Works like a charm!

chrillek · September 10, 2025, 2:37pm

“Convert/To PDF(paginated)”
“Convert/To PDF (one page)”
is offered by the context menu here. And I just tried it with a PNG screenshot. No apparent problem.

BLUEFROG · September 10, 2025, 3:12pm

This is far more effort than you need to exert. Just select the PNG in DEVONthink and choose Data > OCR > To Searchable PDF. There is absolutely no need to convert them to PDF first.

OCR was not made for processing PDFs. It was made for doing optical character recognition on images. As PDFs evolved to support images, OCR was extended to process them also.

Athirne · September 10, 2025, 3:12pm

Wow!! That’s really impressive. This will save me a lot of work going forward. Would DT’s webpage clipping feature work for clipping a newspaper article?

BLUEFROG · September 10, 2025, 3:15pm

I’m not sure who you are talking to at this point.

And “a newspaper article” is subjective. Newspapers.com are hosting images with a bespoke OCR layer. Other sites may be hosting text representations.

Athirne · September 10, 2025, 3:16pm

Seems like there are several ways to skin the proverbial cat. Chrillek’s suggestion works well also. I’ll try your sugggestion next time. Either way, this is a whole lot better than what I’ve been doing for years!

BLUEFROG · September 10, 2025, 3:25pm

True, but with 32 years in graphic arts and printing, I’ve built intra- and inter-departmental (actual) workflows including what you’re talking about. So while an explicit conversion to PDF first isn’t detrimental, it’s also completely unnecessary in this case. If you were to build a smart rule to accomplish the OCR’ing, you would literally only use the OCR > Apply command to process the image files.

PS: No slight on @chrillek, who also has a lot of knowledge and experience in his own right

Athirne · September 10, 2025, 4:42pm

Bluefrog:

You explained your method before but at the time, I was struggling with thousands of files that I had processed with an OCR engine that was not ABBYY. DT’s ABBYY apparently did not like what it saw and essentially told me, “do it over again.” That’s done - out of the way. The more I use DT, the more impressed I am with it. I wish i had known about it 2 years ago when the size of my file collection had ballooned beyond my capability to figure out where anything was.

I do have a smart rule to process image files but I’d still have to move the file to its final destination. What I’ve done, when I have multiple clippings for one article, is to save them to the Inbox, move them to their final destination, merge and delete, and then execute the Data - OCR - To Searchable PDF.

I suppose I could just toss the files into one giant folder but I still like to have at least some semblance of structure in my databases.

BLUEFROG · September 10, 2025, 4:59pm

No worries.

We hear that often, from document collection to email archiving.

Athirne · September 10, 2025, 5:01pm

Bluefrog:

I’m not sure if clippings downloaded from newspapers.com have an OCR layer or not. I thought they didn’t, so I ran them through an OCR program. Perhaps I didn’t need to do that. All I can tell you is that about 50% of the files I thought had been OCRed were not, in fact OCRed. DT did not like about half of my files so I had to reprocess them. The smart rule I finally figured out how to create, with your help, took care of that problem.

The files I’m concerned with here do not come from newspapers.com. They are png files and your method works like a charm but a smart rule would not work for me in this instance. I only have a few files at a time and their final destination is a specific folder. I suppose I could just toss them all into a giant folder and execute a search with DT to find something but I do have a certain fondness for some order in my databases.

BLUEFROG · September 10, 2025, 5:05pm

As I had mentioned on another thread, it’s likely they keep the text layer on their servers, similar to what Evernote did/does.

I only have a few files at a time and their final destination is a specific folder.

Filing items is also something that can be done via a smart rule. In fact, if it’s just a specific group, it’s even easier. You can use the Move action and explicitly choose that group. If it’s more than one group, the File action may be useful, depending on what criteria is used to funnel a document into a specific group.

chirurgean · September 17, 2025, 7:39am

I too take screen shots of magazine articles. Then I use a shortcut to convert to pdf and file in my database. It’s simple to do using tags.

I have found that the OCR engine in DTTG produces larger files than those from DT3. As I do all the production work on my iPad Pro, I rely on bonjour sync back to DT3 which has a rule to OCR new pdfs without a text layer. Auto sync back to my iPad gives me my final desired outcome.

As you say, there are lots of ways to skin a cat