I work with a lot of audio. To transcribe audio files, I’ll use something like oTranscribe, (otranscribe com/) autoEdit (autoedit io/), or Otter (otter ai/). These are really useful, but I still find that I have a lot of audio files from past projects and years old voice memos that aren’t searchable. I’d like to have rough transcriptions of these recordings in DEVONthink in a way that fits with my current database organization. OCR’ing old PDF’s can be as simple as dragging them into DEVONthink. I think a similar approach should be offered as an option for audio files. For researchers who love DEVONthink, and who often interview people, I think adding optional cloud or device-based audio transcription could be a wonderful way to round out their workflow.
autoEdit 2 and 3 are open source and available for mac
DEVONthink could introduce many more users to the benefits of automatic audio transcription
Cloud transcription services appear to have good documentation and are much more affordable than you might think
This solves the problem with audio – it’s not easily searchable
Trint (trint com/)
Gentle (github com/lowerquality/gentle)
Rev (rev com/)
IBM Watson Speech To Text (ibm com/cloud/watson-speech-to-text)
Amazon Transcribe (aws.amazon com/transcribe/)
Google Cloud Speech To Text (cloud.google com/speech-to-text)
Simon Says (simonsays ai/)
Happy Scribe (happyscribe co/)
Transcribe Me (transcribeme com/)
I’ve came here exactly by looking for such a feature. That would make Devonthink much more complete to me too! Lots of videos and audio files in my database, imagine being able to index words from it… I can imagine it would be resource-intensive analysis, but for those with new M1 processors (for example), maybe this would work well. Anyway, great suggestion!
Could this give a hint as to how an implementation of audio transcription in DT might look like? As it’s apparently now possible to hook into in-built macOS mechanism for transcription, it would appear that the barriers to potentially even including this as a native feature in DT are becoming less significant. Thinking some steps further, it would be amazing if this could be fully automated similar to OCRing PDF documents.
This could possibly be scripted using the Speech framework - at least to get an idea about quality and practicality (given the limit of 60 seconds per audio-segment…). There seem (!) to be some caveats, though:
The user has to grant permission to use speech recognition even if they only want to perform it on audio files
There’s no guarantee that the device is capable of performing the task itself, i.e. without internet access.
The first limitation makes this a bit awkward to script (if it’s possible at all, given the wiring with a UI). The second one is such that I would vote against such a feature: Having my audio files transcribed by whatever third entity is out of the question.
Thirdly: The built-in OCR looks nice on the surface and seems to work well in many circumstances. In my opinion, it is not quite up to par with DT’s built-in functionality (compare DT’s with Apple’s result for slightly skewed, slightly blurred PDFs). That’s not only about mis-recognized words/letters but also about not recognizing words on the same line in the correct order (i.e. left to right for English and German). And given the explanation of transcription on the Drafts web page, Apple’s speech recognition seems to be a bit more limited than their OCR technique.
Apple has moved to on-device processing in general, and things like dictation use this as a default now. So unless we don’t take them by their word, that would count as a guarantee that the device IS capable of doing this without internet access.
The transcription in my testing with Drafts is surely not perfect (lack of punctuation, in particular) but we’ve got to start somewhere, don’t we? It also is likely to only get better with time.
In Drafts, audio/video files of any length can be transcribed. Every 60 seconds there are separators, but the transcription is continuous (and, as mentioned, multiple transcriptions are possible simultaneously).
Nope, there is no such limitation. I tested this today.
A quote from the documentation site:
“The transcription process works by extracting audio content from the media, breaking it into segments suitable for processing by Apple’s speech recognition APIs, and transcribing each of those segments. Due to time limits imposed by speech recognition, content longer than one minute is broken up and a separator (=== ) inserted between segments in the transcription.”
I had a look at their interfaces and there’s still a property to check if the device is capable of speech recognition. This might be a leftover from a previous version, of course. I know that OCR works on device now.
The other question is if that is basically dictation (i.e. transliterating the voice of a single, recurring speaker) or really Speech-To-Text (i.e. transliterating abritrary speakers in arbitrary contexts). In the former case, I still don’t understand why that has to be specially enabled in DT (or any other software), given that dictation is available for some time in macOS:
In the latter case… as I said, it might be interesting to write a script trying that (which could then also be run inside of DT – but why? I’d rather store the small transcript then the big audio file).
My understanding (which may be incorrect) is that basically this is a workaround using the dictation feature. So it’s kind of like activating dictation and then playing an audio file next to the microphone, such that this is the input that gets analysed by dictation. That would explain, why in my tests there was no punctuation at all (since using macOS dictation one would say “period” at the end of each sentence).
Well, speaking for myself here the use case would be to make audio and video files’ contents searchable and available e.g. in See Also views.
So the ideal script would run via a smart rule, automatically process any new media files added to the data base and add their transcription to a custom metadata field.
Got it now after searching “Loopback Mac”. Seems you’re referring to this application. Interesting.
Again, for me the “aha” that I wanted to share here was how this has been implemented in Drafts. It’s quite painless i.e. open media file, click transcribe. No system extensions or anything to install and then it just runs in the background until done.
I’m with @chrillek on this: That sounds like a problem waiting to happen to me. Development would have to weigh in on whether a character limit would cause issues in any text fields - rich or multi-line.
Understood. Sure, why not use a comment attribute or even the Annotations file instead. Don’t think it is that decisive where the transcript text would be stored, as long as it’s tied to the media file in some way.