DeDuplication of RSS feeds

BlueKnight · July 10, 2019, 12:30pm

I was wondering if the RSS feed and the Twitter feed importer follow the Duplication identification.

As an example lets say I bring in news from Google using alerts with a particular search string, lets say “Widgets”. Then I am also monitoring the Widgets company via RSS. Lets say the widget company puts out an announcement, and google just captures it and puts it out as well.

What I would like to know if it would be possible or if already done, that the Duplication identification feature as present for documents would work for the RSS feeds as well?

cgrunenberg · July 10, 2019, 12:40pm

The recognition uses only the indexed contents (or thumbnails in case of images) of the items. If the text is identical in this case (separators, white spaces and case don’t matter), then it’s marked as a duplicate.

BlueKnight · July 10, 2019, 12:55pm

What about looking at similar content in Documents. That way I can click on the article and see if there are any similar articles in the RSS feeds.

Ultimately it would make research for things a lot easier if the RSS feeds are able to look at similar content in your documents.

cgrunenberg · July 10, 2019, 1:52pm

Thanks for the suggestion! Looking up similar contents is unfortunately too slow to automatically perform it in the background. On demand you should be able to view similar contents of course via the See Also & Classify inspector.

BLUEFROG · July 10, 2019, 1:53pm

As Criss’ already noted the mechanism by whcih duplicates are made, you may want to explore the See Also & Classify Inspector. This will show you documents that appear to have related content.

You can find it under Tools > Inspectors or press Control-S to open it.