Apple "Live Text" / Vision Framework OCR

Apple’s “Live Text”, also known as “Vision” for OCR and object recognition is impressively good, and actually (I think) better than ABBYY FineReader even for basic OCR (there’s no competition when it comes to object or handwriting recognition, which ABBYY simply doesn’t support at all).

Would greatly appreciate it if this could be enabled in the built-in image/PDF viewer, even just at the level of allowing its use for quick text selection, copy-and-paste, as you presently can do from the Preview app (on Mac OS) or from the built-in document viewer on iOS.

Looking forward, this could even be leveraged as the default OCR source for text recognition in the DEVONthink database.

2 Likes

What did you test Apple‘s OCR with? And does it work without internet connection?

It’s built in to both Mac OS and iOS as “Live Text.” Apparently I’m not allowed to post links, but if you search for “apple live text” you can find multiple videos demonstrating the feature.

1 Like

I’m aware of the marketing material. And I’ve tried the technique, too. Since you’re asking for support of it in DT, I assumed that you have experiences with the framework which would allow you to compare its results with those of DT‘s OCR.

That’s why I asked for your impressions.

I use it with a keyboard maestro macro to extract the text from images to the clipboard. It is actually quite good. For the few languages it supports, it is certainly the best option available. I used it in texts where ABBYY, Adobe and Tesseract had done a poor job and it was flawless nearly all the time. I am not sure it would work for adding a text layer to a pdf file though.

2 Likes

Right. I’m aware of the abilities of the vision Framework for short pieces of text, and that’s what I use, too.
But for a real document, I have no idea. And I haven’t heard anything about it in this context. For example, a normal two column text from a newspaper.
Also: does it work locally or does it require an Internet connection?
Edit: Answering this last question, Apple claims that everything happens on the local device. And Apple also keeps suspiciously mum about anything remotely resembling OCR in documents. They’re only talking about images, and only about apparently very short sequences of text (business cards and the like).

And just for the heck of it, sample some JavaScript code that uses the vision framework for text recognition. I tried that with 2 JPEGs, i.e. real fotos. Not with any PDFs yet. The script can be run with osascript -l JavaScript \<filename> or copy/pasted into Script Editor and run there.

I tried it out with a JPEG that was converted from PDF in Preview, and the results where actually quite good. Only a table derailed it a bit, but that is to be expected. So the script could be used for OCR, but it would require amendments for PDFs: They usually consist of more than one page, and the script would have to loop over all pages, converting each one to a NSImage object and then running character recognition on it.

ObjC.import('Foundation');
ObjC.import('Vision');


(() => {
  const error = $();
  
  const directory = "/Path/to/folder/with/images";
  const images = ["Image1.png","Image2.png"];
  images.forEach(i => {
    const path = `${directory}${i}`;
    const fileURL = $.NSURL.fileURLWithPathIsDirectory(path, false);
    
    const request = $.VNRecognizeTextRequest.alloc.init;
    request.setRecognitionLanguages(ObjC.wrap([$.NSString.alloc.initWithString('de-DE')]));
    const reqArray = $.NSArray.arrayWithObject(request);
    
    const imageRequestHandler = $.VNImageRequestHandler.alloc.initWithURLOptions(fileURL,{});
    
    const success = imageRequestHandler.performRequestsError(reqArray, error);
    if (!success) {
      console.log($(error.localizedDescription).js)
    } else {
      const successArray = request.results.js;
      successArray.forEach(segment => {
        console.log(segment.text.js);
      })
    }
  })
})()
2 Likes

Got it- was not intending to be rude, simply didn’t quite get what you were asking. I’ve used it with Preview on Mac OS to extract recipe text for example from the Moosewood Cookbook, which is published in (immaculate) handwriting. The recipes come out 99.9% correct, with proper formatting.

1 Like

Yes, it would be difficult to turn this into a tool for creating a formal OCR layer on a PDF (a la hocr2pdf), but:

  • it should actually be fairly easy to use the recognition to create searchable metadata inside DEVONthink
  • and, I’m wondering if if would be possible for the default viewer in DEVONthink (and DEVONthink2go) to enable the same copy-and-paste functionality that Preview e.g. has, for images that lack an OCR layer.
1 Like

I’ve been using the live text function a bit, for lifting text from images (I have an annoying habit of screengrabbing slides from presentations, then I have lots of images with useful information on them that need moving to a document).

Whilst the new function is definitely much better than me having to re-write the text manually, I have found two (very minor) bugbears which could swiftly become irritating if applied to large numbers of documents:

  1. copying text retains line breaks. Whilst I understand the principle behind this, if you’re on a device with a small screen, there can be a lot of line breaks in a screenshot. Removing them is a nuisance.

  2. Apple often reads the letter “y” as a “v”. Again, I understand why this may be happening, and I’m used to it now and scan any text for typos accordingly. BUT, this would be super-annoying if applied to a lot of text!

As far as I’ve noticed, lifting OCR’d text from files in DT (processed by DT) doesn’t haven’t either of these two issues.

I absolutely love the new Apple function, but I think they still have a couple of niggles to work out. Even as it is though it’s 100% better than not having the function at all!

Using the script I posted as a starting point, it should be fairly easy to (for example) copy the text to the comments of a document. Which would make it accessible to spotlight, too.

What if live text could be used with images (not PDF) for searching purposes? I would love to be able to add images and then search on the text in the image.

1 Like

Currently, that’s not possible. And I doubt that it will be in the near future. Text recognition is a relatively expensive procedure, it would be a lot better (performancewise) to perform it once on an image and then store the text for searching (or any other purpose) with the image.

1 Like

Any updates on this being used for OCR in DevonThink? Just yesterday I converted a pdf image to be text searchable via ABBY OCR and the results were pretty bad. Apple’s builtin Vision Framework though? much better.

1 Like

If I were DEVONtechnologies (which I’m not), I’d shun away from that at the moment.

First, anecdotal evidence is not very reliable – that you had a good experience with one PDF does not mean that in general Vision is better than other OCR software.

Second, Abbyy (yes, two y) supports a lot more languages than Vision (see swift - Which languages are available for text recognition in Vision framework? - Stack Overflow and https://support.abbyy.com/hc/en-us/articles/360017231600-OCR-Recognition-Languages). There is more than English, and Vision does not even do Japanese or Korean, nor Russian. Not even Portuguese as used in Portugal, BTW.

Third, it’s an Apple framework. Like bug-ridden PDFKit. Or the mostly abandoned JXA. Or other frameworks that are now deprecated. Yes, it comes without extra cost, but using it means that you’d depend on Apple’s willingness, ability, whatever to support it. Other than Abbyy, they’ll probably not offer a contract for that. A payed-for, third-party tool is, in my opinion, a safer bet.

Fourth: In my experience, Vision does not always assemble lines and paragraphs correctly. But that is only anecdotal evidence, too.

2 Likes

Nothing to report at this time.

I would be happy, if DT would check for upcoming updates all technical possibilities, that are possible for DT. Of course, that are only my few cents on this topic, but it really makes sense to look for this very detailed.

I am now for over 30 years journalist and editor. So I work my whole day with words and paragraphs. By history such text based files are the strength of DT, not image or video etc. But as more research goes from text to all media types (multimedia) in the last decade(s) even in traditional media markets (newspapers etc), as more I found myself in the position of searching solutions to get text content out of non-text media (video, images etc).

By the years it was getting more and more important for my research to have text from of non-text media, because text has still the most strong search possibilities. And of course all ways of extracting text, that are automatic as most as possible, are the most helpful. And automation with text based files is a key strength of DT.

In the last month there were also some other posts in this forum, what addresses this topic/problem. Sometimes there were answers, that multimedia, video and images aren’t the key strengths of DT and it may be better to optimize DTs strengths than to go too far in other (new) fields what DT isn’t really for. I agree completely with such opinions and I like the way DT was optimized and updated in the last 10 years (I use DT for 10 years now).

But the automatic extraction of text from images or video, automatic transcription from video or text annotations for video timestamps are nowadays important key strength for text based research.

Therefore I would appreciate it very much, if all new possibilities of such automatic implementation, would be discussed here very deep and open-minded. Because they belong to the presence and to the future of text based research.

I am Journalist, not technician, but maybe Apples „Live Text“ has new technical futures, that could be helpful for text based research in images or videos. Or maybe not. Of course this could be answered better by app programmers…

Just my opinion from my daily work…

3 Likes

I second that!

Multimedia is the future, perhaps the present. Tools for storing reference, annotating and researching must be able to deal with more than just text base documents in order to remain useful and relevant.

For instance, there are decent tools for annotating YouTube videos, as this one I’ve come across:

Unfortunately, no tool appears to be a jack of all trades at this moment. So you need a Read Later app (Pocket’s not great anymore, but Matter appears to do the job), a Zettelkasten note taking app, different reading and note taking apps for different formats and use cases, an academic references manager, a bookmarks manager (DEVONthink does it well, but Raindrop.oi has its advantages), an aggregator of highlights, like Readwise (because why not), a document manager (DEVONthink fit in here), and maybe Hookmark to keep everything linked and in context (great tool except it’s still macOS only and many important app don’t support deep links properly).

So that’s the state of my workflow currently: it’s broken. I know what I want to accomplish, but the only way to get close to getting it done is by using a ton of apps and services, which generates friction and costs a bunch.

Ideally, DEVONthink would be able to index all relevant information and make it easily organizable and available through search. Third party apps could supply the UI for dealing with Markdown and PDFs files, if desired by the user, as is the case (although built in support for deep linking PDFs for use with Hookmark wouldn’t hurt, not a lot of alternatives in that department). Lack of proper support for multimedia data is an obstacle in my opinion, though.

I’m just venting my frustrations, it isn’t meant as an attack on anyone.

1 Like