How to deduplicate articles from RSS feeds?

I’ve started using rss feeds in DEVONthink more, and am running into the following problem: due to the origins of the feeds, it sometimes happens that multiple feeds will have the same article. Is there a straightforward way to deduplicate the feeds?

Each article has a URL (in the corresponding metadata field in DEVONthink), so it is possible to tell that two items are identical without having to do more complicated content diff’ing. Does anyone already have a method to deduplicate RSS items in DEVONthink based on their URLs? Or if not specifically for RSS items, perhaps a solution for other kinds of items in DEVONthink could be adapted.

This question was asked in 2019 but did not see a useful answer then. Jim Neumann has a blog posting from 2020, but it involves a utility that searches folders outside of DEVONthink, not existing items created in DEVONthink. Another more recent blog posting of Jim’s about how to use RSS in DEVONthink did not address the issue of duplication.

Just an idea, not sure if that works:

  • loop over all recent RSS items, for each item
    • search in the same group (?) for the identical URL.
    • If you have more than in hit, remove all but one of them

I was always bugged by that in Feedly, though there it was the same content, different URLs. Since I’m not using RSS that much anymore, it’s less of an issue for me.

due to the origins of the feeds, it sometimes happens that multiple feeds will have the same article.

Two feed URLs that produce duplicate articles would be helpful.

I assume you mean “would not be helpful”?

The duplicates arise because the feeds are the outputs of searches. Obviously I would try to avoid duplicates if I could …

No. I meant If you had two feed URLs that were sending the same RSS article. It’s unclear to me what the environment and conditions are that got you to this place.

The duplicates arise because the feeds are the outputs of searches.

Please clarify what you’re actually doing and where.

The feeds are hashtag searches in Mastodon. Sometimes the outputs have the same articles because posters put multiple hashtags in their postings, and consequently, sometimes searches for different hashtags will pick up the same article. Two different feeds will thus occasionally have duplicates.

I assumed this was something I’ve had to kind of tolerate in RSS feeds from newspaper sites.

i.e. the Guardian - the same article may show up in more than one feed (world news, UK news)

When large news sites allow you to pick your feeds from a wide selection of general to specific topics you’re going to get some duplication.
Really fast, responsive RSS readers like NetNewWire make this a non-issue for me but I can see how if you’re doing research it would be a pain.

1 Like