Video and Audio Transcription

Is there a way to let DEVONthink make transcriptions of audio and video files?

So they are searchable…

Ideally when you click on a sentence in the transcription you go to that point in the video or audio file…

It would be an awesome to DEVONthink itself. But I’m looking for a way to do it now.
Good Mac software that can do it on-device?

Maybe an automation between that software and DEVONthink?

2 Likes

Welcome @pjc9

Sorry but no, DEVONthink will not transcribe files for you.
Using an annotation file (See Tools > Inspectors > Annotations & Reminders), you can create your own transcriptions.

Maybe an automation between that software and DEVONthink?

This depends on whether the third-party application has provided any inter-application communication functions. Many apps nowadays don’t.

There are several online services (s. e.g. Add Automatic Audio Transcription -- Similar Approach As OCR) but DEVONthink doesn’t support any of them currently.

Automated audio/video transcriptions would be incredible. +1

4 Likes

I worked with video+annotation file in last months. A transcription with links would be nice but I‘ll be also happy if insert backlink would result into an link that will jump to the current timestamp ;D

What kind of file did you annotate?

This is already possible using the Annotations button > Insert Back Link command.

1 Like

Oh that’s nice! Thanks for the heads up.

You’re welcome :slight_smile:

would it be very hard to tap into Apples Speech Framework for parsing audio files?

I know, the recognition is not the greatest, but at least it would help finding content.

1 Like

Define „hard“. Afaict, it is not possible with scripting (at least not for me). So someone would have to write a Swift (or perhaps Objective-C) program.

There might be examples of such programs available on the web. I didn’t check, though.

Welcome @hansdorsch

I can’t speak to the level of difficulty, however “the recognition is not the greatest” - if an accurate assessment - is not a standard we’d be comfortabe with. We strive to provide a much better experience than “not the greatest” as much as we are currently able. The other issue is, a poor implementation would increase support, not only for tech but development as well.

That all being said, we appreciate the suggestion and will take a look at it.
Cheers!

1 Like

Maybe, I shouldn’t have mentioned the thing with the recognition. because there is no 100 % automatic transcription – and probably never will be.

For me, the level of accuracy the apple framework delivers is definitely “good enough” and way better than “not at all”.

I have been using an app called JustPressRecord for a couple of years now. It lets me record audio and automatically transcribes it.

The quality depends on different factors, mostly on the audio quality.
I use it for Voice Memos up to a couple of minutes.
If the transcription fails to recognize a word, I can always listen back to the audio and correct it.

The transcription uses the Apple Framework, works on device and is free (the app is a one time payment).

For Interviews and Podcast Transcription, I use a web service called Sonix. This is more accurate and gives me an editor, that sticks the audio behind the text. It is charged by the minute and works in the browser.

The use case for the transcription for me would be mostly on the iPhone App:

long tap on the App Icon > “New Media document” > “audio note”.

Then I would record, whatever I want to remember, and tap “done”.

Later on iOS or Mac I would use the option similar to “OCR” but “transcribe audio” and get the transcription as a comment or annotation.

This would be just enough and would make the audio notes so much more useful.

No worries and thanks for the clarification.
Interesting ideas for us to consider.

Cheers!

1 Like

Boosting this old thread.

DaVinci Resolve has pretty magical transcription ability, and you can search. Clicking on any particular word jumps the playhead to that spot in the video. It would be exciting to see this capability brought to DT so video and audio files can be referenced in a similar way as text documents.

At present, I have to jump out of DT and into DVR if I need the transcription feature, and once I’ve located the timestamp, I can then come back to DT and use that info to get a frame link.

But this also means that I can’t include all my video and audio files in my DT search results. Instead have to go search each media file independently. Of course, I could copy and paste the transcript into a companion file with the original media, but then I don’t have the instant reference to the timestamp for any search match. Instead I will still have to take the info over to DVR, repeat the search, get the timestamp, go back to DT. And this entails an additional manual process for every media file.

I’m really not sure what other people are using frame links for in DT, to me it seems like DT was built for researchers who have various source materials that all need to be synthesized and consolidated into original narratives, while still retaining quick access to the source material. If you look at a lot of the video essayists and documentarians on YouTube, I think you would see a significant market for DT. These people need powerful tools to construct their scripts and retain easy access to their sources. This is where DT shines, and I don’t know of anything else that comes close to doing what DT does, as well as it does. I migrated from EverNote, FWIW. But in any case, the impression I have gotten from the people here is that this use case isn’t well understood, so it leaves me a little puzzled wondering what use cases DT was originally intended for.

This is actually planned for a future release.

6 Likes

That’s great to hear!

So this would work with .srt files? That would be amazing.

The capabilities will be similar, this does not necessarily require such files.

I work in the TV/Film industry. If anyone on the coding side needs a wide variety of source files to test I can find a source file somewhere in my archives.
It’s been a long strange trip from dropping 3/4" and VHS tapes off at the stenographer/caption service to uploading video into private YouTube videos to let it do a best-guess first pass and now using dedicated AI services and Premiere Pro to generate captions and transcriptions.
Plus at least one independent film maker I know of that used a CG voice to “read” the described video (DV) script to get their film to meet the required spec.

1 Like