Apple "Live Text" / Vision Framework OCR

Apple’s “Live Text”, also known as “Vision” for OCR and object recognition is impressively good, and actually (I think) better than ABBYY FineReader even for basic OCR (there’s no competition when it comes to object or handwriting recognition, which ABBYY simply doesn’t support at all).

Would greatly appreciate it if this could be enabled in the built-in image/PDF viewer, even just at the level of allowing its use for quick text selection, copy-and-paste, as you presently can do from the Preview app (on Mac OS) or from the built-in document viewer on iOS.

Looking forward, this could even be leveraged as the default OCR source for text recognition in the DEVONthink database.

What did you test Apple‘s OCR with? And does it work without internet connection?

It’s built in to both Mac OS and iOS as “Live Text.” Apparently I’m not allowed to post links, but if you search for “apple live text” you can find multiple videos demonstrating the feature.

I’m aware of the marketing material. And I’ve tried the technique, too. Since you’re asking for support of it in DT, I assumed that you have experiences with the framework which would allow you to compare its results with those of DT‘s OCR.

That’s why I asked for your impressions.

I use it with a keyboard maestro macro to extract the text from images to the clipboard. It is actually quite good. For the few languages it supports, it is certainly the best option available. I used it in texts where ABBYY, Adobe and Tesseract had done a poor job and it was flawless nearly all the time. I am not sure it would work for adding a text layer to a pdf file though.

1 Like

Right. I’m aware of the abilities of the vision Framework for short pieces of text, and that’s what I use, too.
But for a real document, I have no idea. And I haven’t heard anything about it in this context. For example, a normal two column text from a newspaper.
Also: does it work locally or does it require an Internet connection?
Edit: Answering this last question, Apple claims that everything happens on the local device. And Apple also keeps suspiciously mum about anything remotely resembling OCR in documents. They’re only talking about images, and only about apparently very short sequences of text (business cards and the like).

And just for the heck of it, sample some JavaScript code that uses the vision framework for text recognition. I tried that with 2 JPEGs, i.e. real fotos. Not with any PDFs yet. The script can be run with osascript -l JavaScript \<filename> or copy/pasted into Script Editor and run there.

I tried it out with a JPEG that was converted from PDF in Preview, and the results where actually quite good. Only a table derailed it a bit, but that is to be expected. So the script could be used for OCR, but it would require amendments for PDFs: They usually consist of more than one page, and the script would have to loop over all pages, converting each one to a NSImage object and then running character recognition on it.


(() => {
  const error = $();
  const directory = "/Path/to/folder/with/images";
  const images = ["Image1.png","Image2.png"];
  images.forEach(i => {
    const path = `${directory}${i}`;
    const fileURL = $.NSURL.fileURLWithPathIsDirectory(path, false);
    const request = $.VNRecognizeTextRequest.alloc.init;
    const reqArray = $.NSArray.arrayWithObject(request);
    const imageRequestHandler = $.VNImageRequestHandler.alloc.initWithURLOptions(fileURL,{});
    const success = imageRequestHandler.performRequestsError(reqArray, error);
    if (!success) {
    } else {
      const successArray = request.results.js;
      successArray.forEach(segment => {

Got it- was not intending to be rude, simply didn’t quite get what you were asking. I’ve used it with Preview on Mac OS to extract recipe text for example from the Moosewood Cookbook, which is published in (immaculate) handwriting. The recipes come out 99.9% correct, with proper formatting.

Yes, it would be difficult to turn this into a tool for creating a formal OCR layer on a PDF (a la hocr2pdf), but:

  • it should actually be fairly easy to use the recognition to create searchable metadata inside DEVONthink
  • and, I’m wondering if if would be possible for the default viewer in DEVONthink (and DEVONthink2go) to enable the same copy-and-paste functionality that Preview e.g. has, for images that lack an OCR layer.

I’ve been using the live text function a bit, for lifting text from images (I have an annoying habit of screengrabbing slides from presentations, then I have lots of images with useful information on them that need moving to a document).

Whilst the new function is definitely much better than me having to re-write the text manually, I have found two (very minor) bugbears which could swiftly become irritating if applied to large numbers of documents:

  1. copying text retains line breaks. Whilst I understand the principle behind this, if you’re on a device with a small screen, there can be a lot of line breaks in a screenshot. Removing them is a nuisance.

  2. Apple often reads the letter “y” as a “v”. Again, I understand why this may be happening, and I’m used to it now and scan any text for typos accordingly. BUT, this would be super-annoying if applied to a lot of text!

As far as I’ve noticed, lifting OCR’d text from files in DT (processed by DT) doesn’t haven’t either of these two issues.

I absolutely love the new Apple function, but I think they still have a couple of niggles to work out. Even as it is though it’s 100% better than not having the function at all!

Using the script I posted as a starting point, it should be fairly easy to (for example) copy the text to the comments of a document. Which would make it accessible to spotlight, too.

What if live text could be used with images (not PDF) for searching purposes? I would love to be able to add images and then search on the text in the image.

Currently, that’s not possible. And I doubt that it will be in the near future. Text recognition is a relatively expensive procedure, it would be a lot better (performancewise) to perform it once on an image and then store the text for searching (or any other purpose) with the image.

1 Like