Add Automatic Audio Transcription -- Similar Approach As OCR

I work with a lot of audio. To transcribe audio files, I’ll use something like oTranscribe, (otranscribe com/) autoEdit (autoedit io/), or Otter (otter ai/). These are really useful, but I still find that I have a lot of audio files from past projects and years old voice memos that aren’t searchable. I’d like to have rough transcriptions of these recordings in DEVONthink in a way that fits with my current database organization. OCR’ing old PDF’s can be as simple as dragging them into DEVONthink. I think a similar approach should be offered as an option for audio files. For researchers who love DEVONthink, and who often interview people, I think adding optional cloud or device-based audio transcription could be a wonderful way to round out their workflow.

  • autoEdit 2 and 3 are open source and available for mac
  • DEVONthink could introduce many more users to the benefits of automatic audio transcription
  • Cloud transcription services appear to have good documentation and are much more affordable than you might think
  • This solves the problem with audio – it’s not easily searchable

autoEdit 3 Transcription providers

Mozilla DeepSpeech (github com/mozilla/DeepSpeech)
AssemblyAI (assemblyai com/)
Speechmatics (speechmatics com/)
Pocketsphinx (pypi org/project/pocketsphinx/)

Other transcription providers

Trint (trint com/)
Gentle (github com/lowerquality/gentle)
Rev (rev com/)
IBM Watson Speech To Text (ibm com/cloud/watson-speech-to-text)
Amazon Transcribe ( com/transcribe/)
Google Cloud Speech To Text ( com/speech-to-text)
Simon Says (simonsays ai/)
Happy Scribe (happyscribe co/)
Transcribe Me (transcribeme com/)


Thank you for the suggestion and the links, we’ll consider this for upcoming releases.


I’ve came here exactly by looking for such a feature. That would make Devonthink much more complete to me too! Lots of videos and audio files in my database, imagine being able to index words from it… I can imagine it would be resource-intensive analysis, but for those with new M1 processors (for example), maybe this would work well. Anyway, great suggestion!

1 Like

I would also be really interested in this, as I also use a lot of audio/video. From a privacy standpoint, an on-device solution would be preferable.

1 Like

If there are tools available for macOS that can be run from the command line, this should be trivial to script. Even in the case of online services, a script might be the way to go.

1 Like

So apparently Drafts on MacOS now offers in-built transcription of audio and video files using Apple’s inbuilt dictation functionality. It can even process multiple files simultaneously. Punctuation is missing, of course, but otherwise the recognition is pretty accurate and appears to be processed locally (a plus in my book, even if cloud services may offer even better performance).

Could this give a hint as to how an implementation of audio transcription in DT might look like? As it’s apparently now possible to hook into in-built macOS mechanism for transcription, it would appear that the barriers to potentially even including this as a native feature in DT are becoming less significant. Thinking some steps further, it would be amazing if this could be fully automated similar to OCRing PDF documents.

This could possibly be scripted using the Speech framework - at least to get an idea about quality and practicality (given the limit of 60 seconds per audio-segment…). There seem (!) to be some caveats, though:

  • The user has to grant permission to use speech recognition even if they only want to perform it on audio files
  • There’s no guarantee that the device is capable of performing the task itself, i.e. without internet access.

The first limitation makes this a bit awkward to script (if it’s possible at all, given the wiring with a UI). The second one is such that I would vote against such a feature: Having my audio files transcribed by whatever third entity is out of the question.

Thirdly: The built-in OCR looks nice on the surface and seems to work well in many circumstances. In my opinion, it is not quite up to par with DT’s built-in functionality (compare DT’s with Apple’s result for slightly skewed, slightly blurred PDFs). That’s not only about mis-recognized words/letters but also about not recognizing words on the same line in the correct order (i.e. left to right for English and German). And given the explanation of transcription on the Drafts web page, Apple’s speech recognition seems to be a bit more limited than their OCR technique.

It is very likely this is for short passages only, not entire documents.

I tried a similar process last year with Rogue Amoeba’s Loopback. I contacted their support and they told me it really could only process about 30 seconds.

Very interesting @chrillek!

Just a couple comments regarding what you wrote:

  • Apple has moved to on-device processing in general, and things like dictation use this as a default now. So unless we don’t take them by their word, that would count as a guarantee that the device IS capable of doing this without internet access.
  • The transcription in my testing with Drafts is surely not perfect (lack of punctuation, in particular) but we’ve got to start somewhere, don’t we? :slight_smile: It also is likely to only get better with time.
  • In Drafts, audio/video files of any length can be transcribed. Every 60 seconds there are separators, but the transcription is continuous (and, as mentioned, multiple transcriptions are possible simultaneously).

Nope, there is no such limitation. I tested this today.

A quote from the documentation site:

“The transcription process works by extracting audio content from the media, breaking it into segments suitable for processing by Apple’s speech recognition APIs, and transcribing each of those segments. Due to time limits imposed by speech recognition, content longer than one minute is broken up and a separator (=== ) inserted between segments in the transcription.”

I had a look at their interfaces and there’s still a property to check if the device is capable of speech recognition. This might be a leftover from a previous version, of course. I know that OCR works on device now.

The other question is if that is basically dictation (i.e. transliterating the voice of a single, recurring speaker) or really Speech-To-Text (i.e. transliterating abritrary speakers in arbitrary contexts). In the former case, I still don’t understand why that has to be specially enabled in DT (or any other software), given that dictation is available for some time in macOS:

In the latter case… as I said, it might be interesting to write a script trying that (which could then also be run inside of DT – but why? I’d rather store the small transcript then the big audio file).

1 Like

My understanding (which may be incorrect) is that basically this is a workaround using the dictation feature. So it’s kind of like activating dictation and then playing an audio file next to the microphone, such that this is the input that gets analysed by dictation. That would explain, why in my tests there was no punctuation at all (since using macOS dictation one would say “period” at the end of each sentence).

Well, speaking for myself here the use case would be to make audio and video files’ contents searchable and available e.g. in See Also views.

So the ideal script would run via a smart rule, automatically process any new media files added to the data base and add their transcription to a custom metadata field.

That’s kind of what I was attempting - technically speaking with Loopback.

Got it now after searching “Loopback Mac”. Seems you’re referring to this application. Interesting.

Again, for me the “aha” that I wanted to share here was how this has been implemented in Drafts. It’s quite painless i.e. open media file, click transcribe. No system extensions or anything to install and then it just runs in the background until done.

Yes, that’s the one.

Due to time limits imposed by speech recognition, content longer than one minute is broken up and a separator (===) inserted between segments in the transcription.

Haha! That pretty much verifies what Rogue Amoeba’s crew told me about time constraints :stuck_out_tongue:

Not according to the specification: you can pass an audio file as input to the framework. Which still needs the user to consent, whyever.

That metadata field would have to be huge. I’d opt for a comment attribute to the file (which would also be visible to spotlight).

In any case, the script would have to be written.

I’m with @chrillek on this: That sounds like a problem waiting to happen to me. Development would have to weigh in on whether a character limit would cause issues in any text fields - rich or multi-line.

Understood. Sure, why not use a comment attribute or even the Annotations file instead. Don’t think it is that decisive where the transcript text would be stored, as long as it’s tied to the media file in some way.

Haha okay, I’ll (sort of) hand that one to you :stuck_out_tongue:

Is it a true limitation though, if there is a fully functional workaround (as demonstrated in Drafts)? :wink:

I was thinking the Annotation file may be a possibility. However, those are usually used for dissecting and making notes. There’s surely a ton of commentary in the full text that may be unwanted.

Is it a true limitation though, if there is a fully functional workaround (as demonstrated in Drafts)? :wink:

A workaround is most often a private method, not something accessible to everyone. So whatever Greg at Agile Tortoise is doing it is almost certainly something he came up with.

Development would have to assess the request.

1 Like