Integrating OCR by ChatGPT into DT3

Indeed it’s possible, especially going that far back.

You can always copy and paste the text into the annotation? Then, when you run a search on it, it’ll show up… or even use metadata fields?
I know your frustration of working with a lot of files… simple things are annoying little buggers… they add up…
But, since you don’t have a clean copy… and it can’t be OCRed to your needs… you’re going to have to figure out a way to “attach” the text to the pdf…
but i’m not sure how you’re retrieving, or locating the data… so that might need to factor into how you work on your files

can you run it through an online ocr converter?

You can also try docling

abbyy is better at recognizing text, but docling is better at “formatting” text to look like what appears on the pdf…

I had a problem similar to what the screenshot shows, and what i did was just delete the ocr layer and re-ocred

Only tangentially relevant to this thread, but I couldn’t resist. Definitely an “Expert Mode” OCR task. (And almost certainly a purpose-built AI, not a general service like ChatGPT.)

I LOVE that story – I’ve been following it for a few years, and figure that in a few more, we may have a whole new pile of ancient texts to contemplate. (Now if only I could get DT to recognize all the words in the ones I have that are 100 years old!)

I did try a bunch of online converters, and results seemed pretty similar to ABBY via DT3. (AI’s ability to “figure out” what a text must be saying gives it a huge edge – it’s like having a living assistant transcribe a document, based on what they know it says, rather than having ABBY figure out what each character must be.)

You didn’t answer whether the PDF you shared was already OCR’ed in DEVONthink? If so, something is clearly malfunctioning.

You might have a good reason, that’s why I asked. It’s not that I recommend rushing to upgrade to the newest major macOS release. I’m still on Ventura myself. Recent major versions of macOS have been full of bugs on launch. Many think Apples yearly release schedule is too much. (I expect my own next upgrade to be Sonoma, not Sequoia, and have been meaning to look into whether it makes sense for me now.)

It’s best to make sure upgrading doesn’t break your workflow[1] – by checking if the software you depend on is compatible with a new OS release. Not all software is compatible on release day. But the software you install depends on the operating system to function… If you keep upgrading software, but not the OS, at some point you will run into issues too. Most software known to be incompatible should give you a warning, but still.

Besides all that, there’s a difference between major and minor OS releases. Minor releases should rarely feature regressions, but rather improve stability and fix bugs. I at least see no reason to not install the latest version of Big Sur. BLUEFROG often recommends keeping up with OS releases – especially minor releases. As always, though, best practice is keeping backups.

On the topic of updates – what version of the ABBY add-on is running on your machine? That also gets updates from time to time. Mine has the version number 1.1.26, which should be the most recent. You can check by finding ‌DTOCRHelper.app in ~/Library/Application Support/DEVONthink 3/Abbyy and using “Get Info” (⌘I or right-click and choose Get Info).


I was just revealing the text layer of the original PDF uploaded by @Blake. If you look further upthread, you’ll see I get a very acceptable result from running it through DEVONthink’s OCR.


  1. PS: If you’re not aware, Apple deprecated plugins for Mail.app in macOS 14 Sonoma. If you like DEVONthink’s Mail plugin, you might not want to upgrade further than Ventura for now. ↩︎

2 Likes

I just ran your pdf through: https://www.ocr.best/

it looks good enough… you might have to edit the ocr layer manually if you want better results… There is only so much recognition an OCR engine can do.

The windows version of ABBYY let’s you “recognize” characters through training, but not DT

after running the same pdf that I used DT ABBYY to ocr I ran it through docling and got:

an easy to use docling UI

I don’t think you need to rush to update straight away. As @troejgaard has mentioned, you’ve not actually confirmed if you ran DT’s OCR on the file. I’m thinking you haven’t as four people now have (I’ve just done it too for fun) and DT/ABBY’s OCR is fine. If your version of DT actually is outputting gibberish that’s a separate issue that needs troubleshooting.

Re: historical PDFs, a lot of them can end up with a gibberish text layer (this is the “invisible” text layer that stores the digital version of the text in the image and is what the computer reads, or what you select when you copy and paste). To be honest, if I’ve got a really old PDF I don’t even bother checking it nowadays, I just ask DT to run OCR. I can’t be bothered to check it’s ok and I know DT won’t do any worse than (and will usually do a near perfect version of) whatever’s already there.

Which moves me to my next point: assuming you hadn’t run DT’s OCR on the file, and you agree that DT’s OCR is fine, you can set up automations to OCR documents automatically when they enter or move in DT, which will save you the hours upon hours of work you’re anticipating as basically DT will do it without your input.

3 Likes

Thanks. I did run the PDF through DT/ABBY, several times, and got a fair amount of gibberish each time – but just now, weirdly, I got rather less.
I can’t figure out how that could be happening. But my Mac HAS been working strangely hard lately – the fan comes on much too often – so I wonder if that could somehow have affected DT’s function.
I guess I’ll give up on the issue, for now – so as not to waste any more time for all you kind commenters! – and see if maybe some upgrading, of some kind, might help.
Blake

I turned that off, because I need to preserve some of my documents at precisely their original, higher resolution, and DT3/ABBY doesn’t allow higher than 300dpi. (And I seem to notice other changes to OCR’d documents – a kind of general softening of the image – beyond just a lower resolution.)

It sounds as if you may have Compress PDF checked in Preferences => OCR. It’s checked by default, I think, though I can’t now remember what the default DPI setting is (I think 300; changing it to 0 will counterintuitively preserve the original resolution). It’s worth experimenting with unchecking this and re-OCRing an original if you still have it around.

1 Like

BTW:
I just tried GPT-4O mini, Claude 3 Haiku, Gemini Flash 8b and Mistral Pixtral to OCR this document. Gemini returned a text that came closest to the original (without any reformatting) whereas only Pixtral returned some nice Markdown output:

### "Art is the perfect recognized through the senses"

(1) Plato implies that the **senses** are the **means** for recognizing the perfect.
(2) but the perfect is **conceived** first by the **intellect**,
(3) since however every man has the **means** that is, the **senses**, this definition is partly democratic.
In his own Greek society the vast majority of people were slaves. They looked therefore the opportunity to develop their intellect. They could therefore not enjoy ART as a conception of the perfect.

(c) We ought to add, therefore, to this definition of Plato this:

"Art is the recognition of the perfect through the senses, under social conditions which make it possible for every member of society to enter to realize this aim."

(d) But one of the purposes of this thesis to show that modern art has been in its most a t-ie or strongly individualistic, that is, not becoming to realize a more democratic ideal of ART. For this purpose, a historical review of works of art in the past will be helpful.

### CHAPTER II - HISTORICAL REVIEW OF WORKS OF ART.

#### DIVISION "A" -

Such a brief historical review of works of art would include such works made during certain epochs:

(a) The hunting and agricultural stage, decoration of personal belongings for purely personal pleasure,

(b) The military stage, likewise characterized by works of art, weapons, arms, personal possessions, artistically fashioned and decorated for personal use,

(c) The ecclesiastical stage, immense structures for ecclesiastical institutions, rich robes and utensils for use of church dignitaries, all this while the majority of the people lived in abject poverty,

(d) The feudal non-industrial non-mercial stage, the building of palaces and castles, emptiness decoration fitted and vacant, all primarily for the glorification of the monarchical ideal - but now more itself - to the total exclusion of the masses. Then that of the court painter and poet, as well as the court priest and court jester.

(e) The industrial and commercial stage. ART now being patronized by new men, the merchant prince, the leader of finance and the captain of industry - but again ART in the service of strong single men with strong individual tendencies - yet we find here in increasing number of men able to command in the field of ART.

#### DIVISION "B" - Tendencies towards Democratization of ART.

(a) These various stages show a gradual broadening of the tendencies and influences of ART. Art becomes more and more democratic.

(1) In the hunting and agricultural stage men created works of art for his own personal enjoyment, perhaps out of pride.

Thanks. I don’t have “compress” selected, but I do have the resolution set at 300 dpi. I’ll try setting to zero, to see if that helps retain the original res.

I honestly don’t know why AI/LLM isn’t automatically built into all OCR routines. By “understanding” the kinds of things English speakers say in their texts, it would eliminate all the totally impossible words and phrases that we now get in our text layers. It might occasionally get a word wrong (though it almost never does, in my experience), but that’s better by far than the absurdities that happen when a standard OCR makes character-by-character guesses, and so often gets them wrong, at least in reading imperfect original documents.

How about the fact that (1) the document is sent to third-party servers, not handled locally, and (2) it isn’t free to use the LLMs mentioned.

PS: Some OCR developers have or are working toward it, but I’ve no specific data on it.

5 Likes

And it returns just the text, not the layout and therefore adding a real, selectable text layer that way is impossible using most models.

3 Likes

Aside from technological constraints, there is also the philosophical problem of authenticity. Think of the following scenario of an AI takeover:

  1. I used to correct my data manually → I now use AI to give me correct data automagically.
  2. I used to analyze the data myself → My AI assistant now analyzes the data and generates great insight.
  3. I used to do consulting work based on my analysis of the data → My AI assistant, armed with great knowledge of the data and impeccable communication skills, now communicates with my clients directly.
  4. I used to play video games for fun after work → My AI assistant has played the game for me, so I can get all the in-game achievements (even those too difficult for the amateur of me) without ever touching my device. All I need to do is to watch my AI play.

Who am I then?

Should we have the technology to accomplish all these, I will have to either (1) accept all four or (2) accept none. There is no obvious, consistent reason to accept some of these AI options while rejecting others.

4 Likes

Well, the resolution setting only operates when Compress is checked, so I wouldn’t expect that to make a difference. If Compress PDF isn’t checked, OCR just adds a text layer to the existing page images. (Depending on the PDF, this can greatly increase the document size, which is why the compression option is a useful one to have.)

1 Like

False dichotomy.
Choices aren’t a binary tree… they’re more like a rhizome… Leisure activities are completely different, and outside the realm of “work”.