Comparing PDF Documents

vatolin · May 31, 2025, 11:04am

Has DT4 already the capability to compare two PDF files, particularly by using AI, or - if not - are there planings to make such a feature available in the foreseen future?

chrillek · May 31, 2025, 11:38am

When would two PDFs be considered different? Simple example in pseudo-language:

set color (0,0,0)
moveto 10 10 
lineto 20 20

should draw a black line from 10/10 to 20/20. Now, you change the color to (0.01 0.01 0.01), which is probably not visually discernible from black, and draw the line with the same coordinates. Same PDF? Different PDF?

That’s a simplistic case. You’re probably after text. So, imagine a PDF that has black text and another one with the same text, but also some white text underneath the black one, so it is not visible. Same? Different?

What you consider different depends on your requirements.

If you’re only interested in visible differences, you could convert the PDFs to PNG (or another pixel format) and take a hash on the PNGs. If they are identical, the PDFs should be visually identical.

If you are interested in textual differences, extract the text and compare it.

If you want to know if the PDFs are bytewise identical, run a hash on both and compare that.

If you want to make sure that a PDF has not been tampered with, you should use signed ones.

rkaplan · May 31, 2025, 11:41am

Yes with AI - you just need to experiment with the model and the prompt.

As with anything else with AI, it appears to work well but can have errors so confirm everything.

In this example using Claude 4 it correctly found two changes I made in the text of a document. Interestingly it reported other big-picture differences even though the two documents I provided were identical in every other way.

I am not certain but I suspect this may be because DT4 does not always provide the entire document to the LLM. I am not sure how it makes those determinations.

GGA · May 31, 2025, 12:08pm

Ich habe in DT4 vor Kurzem mit der KI Claude Haiku, Bezahl-KI, zwei PDF-Dokumente mit jeweils 45 Seiten vergleichen lassen. Die KI hat die zwei Unterschiede, es handelt sich um zwei Rechnungen mit Anlagen, die eigentlich identisch sein sollten, einmal einen Zahlendreher, einmal einen Multiplikator mit drei statt zwei Stellen, sehr schnell gefunden und ansonsten darauf hingewiesen, dass die Dokumente identisch sind. Ich habe danach die beiden Dokumente Wort für Wort und Zahl für Zahl selbst abgeglichen. Das Ergebnis der KI stimmt.

chrillek · May 31, 2025, 12:11pm

I took the liberty to let DeepL translate your post to English, since you’re posting in the English part of the forum:

I recently had the AI Claude Haiku, payment AI, compare two PDF documents with 45 pages each in DT4. The AI very quickly found the two differences - two invoices with attachments that should actually be identical, one with a transposed number and one with a multiplier with three instead of two digits - and otherwise pointed out that the documents were identical. I then compared the two documents word for word and number for number myself. The result of the AI is correct.

Translated with DeepL.com (free version)

I’d just like to stress that all you’re comparing here is the part of the PDF you are seeing. If, e.g., it contains an XML attachment as is required now for digital invoices in Germany, there’s nothing you can see nor compare visually. Just saying – it all depends on what you consider “different”.

GGA · May 31, 2025, 3:46pm

Ihre Antwort ist etwas kryptisch. Die menschliche Intelligenz, und genau diese setze ich bei allem, was ich mache, immer ein, kann nur Sichtbares vergleichen. Meine Dateien, von mir selbst erstellt, haben nur sichtbare Elemente, abgesehen von den üblichen Hintergrundinformationen eines PDF-Dokuments. Diese wollte ich mit der KI aber nicht vergleichen, sondern nur den sichtbaren Teil. Dass eine KI immer mit Vorsicht und mit entsprechender menschlicher Intelligenz genutzt werden sollte, das steht für mich außer Frage. Ich nutze KIs immer noch selten und nur in bestimmten Fällen, in denen ich mir eine Unterstützung erwarte. Die Ergebnisse werden von mir soweit es mir möglich ist geprüft.

vatolin · May 31, 2025, 7:46pm

If it had »intelligence« it would assume, oh, this is a letter from »Deutsche Rentenversicherung« and its headline is »Rentenbescheid«. So this file will for sure contain some information about pension payments and will probably differ from the previous letter. But as far as I see no AI has the capability to think in such way.

BLUEFROG · June 1, 2025, 12:33am

What AI provider and model are you using? They are not all the same.

vatolin · June 1, 2025, 8:31am

Le Chat 3.1 von Mistral.

chrillek · June 1, 2025, 9:02am

There are probably very few “Rentenbescheide” out in the open to train “AI” on. OTHO, no one needs any AI at all to compare the text of one PDF with the text of another one. diff can do that on the command line, if one feeds it the text layers.

But that’s of course too old-fashioned, I guess.

vatolin · June 1, 2025, 6:07pm

You are absolutely right. But it’s not about comparing the text. It’s about to get an intelligent answer why the text differs. If it is an intelligence, it should be able to explain the reason for the difference. I thought, a »Rentenbescheid« is such a common issue, there would be thousands if not millions of people who have asked the internet™ »WTF why did they change it?«.

BLUEFROG · June 1, 2025, 6:54pm

It isn’t an intelligence though. It’s a predictive text engine.
Also, I don’t think asking about intent would be useful any more than you asking @chrillek why I am responding. If there were documents outlining the “why”, it may be able to generate a reasonable response, but if there is a much larger volume of conjecture about the “why”, I would trust it far less.

rkaplan · June 2, 2025, 12:26am

That’s a fairly basic approximation of early LLM models. Current models can match/predict tone, and structure and apply principles to new situations. If not then “one shot” prompting would not work.

I have queried Claude quite a bit “Tell me the strengths/weaknesses of this document- Is it persuasive? How can it be improved?” I have done this with documents making pretty high level medical, legal and engineering argumnets. Offten it is off-base, but usually it contributes a few meaningful suggestions.

We surely have not achieved sentience. But current advanced LLMs are more than simply “next word predictors.”