Some questions about audio transcription in DEVONthink 4

suavito · April 12, 2025, 5:08pm

Is it possible to get audio transcription without the time codes? I have no doubt they are useful in longer audio transcriptions but I just want the text. I use Markdown for the transcriptions so it would be possible to create a Smart Rule which removes all links. But it would be nicer if there was a just little box to untick.
I use Just Press Record for audio recordings on my mobile devices, mostly in German and sometimes in English. JPR uses Apple’s speech-to-text system, which works also without an internet connection. So I guess the results should be identical to DEVONthink’s Local Apple Speech transcription. But JPR’s results are better. Actually better than Remote Apple Speech transcription and Remote OpenAl Whisper transcription too, which to my surprise show identical results to the first option. The most noticeable difference is that all three DT audio transcription options almost always ignore all dictated punctuations (no actual punctuation, but also not the word “period”, “comma”, etc.). Why is that?
Is it possible to create a script that removes the audio file and moves its annotation file to a certain destination? My usual audio files are speech memos I just want to keep until I have checked that the transcription is right, and discard after that.

BLUEFROG · April 12, 2025, 7:30pm

No, that’s not an option currently.
I wouldn’t expect identical results any more than I’d expect identical results from a commercial AI query. And we don’t control what the Speech framework transcribes, e.g., its punctuation.
There is nothing currently available, but why not just do this manually since you’re manually verifying the results?

suavito · April 12, 2025, 7:49pm

Too bad.
Well, speech recognition is a bit different than made up answers by a generative AI. But even then it was surprising part that there are no differences. Including the missing punctuations. And assumed Just Press Record uses the same framework and has different results in this specific aspect, i. e. punctuation, and another one: no time codes, does that not suggest that there is some way of setting framework variables, so to speak?
I will do it manually. And because of that and the many voice memos I generate almost every day it would be very helpful if I needed just one click to delete the audio and move its annotation.

cgrunenberg · April 13, 2025, 7:29am

First transcribe to searchable text and afterwards convert the audio/video file to plain text. Or switch to View > Document Display > Text Alternative.

The segmentation and time stamps are handled by DEVONthink but it doesn’t post-process or filter the actual text in any way. But if you could share an audio file and the results of DEVONthink and Just Press Record, then we’ll check this.

suavito · April 13, 2025, 1:35pm

Okay, let’s try this. I attached an .m4a audio file zipped, because unzipped is not allowed.

I used a dummy text spoken by one of macOS’ German voices. Its pronunciation is not always good so there are some mistakes in the transcript but not due to the transcriptions.

One thing I have to add: At some point while trying to get this going—first I had an .mp3 file but Just Press Record did not show it and I searched for an app to convert .mp3 to .m4a before finally re-recording it as .m4a—the speech function of macOS did not work for some time. Meanwhile there was a bigger nsurlsessionid data downstream, and then it worked again. Maybe some audio transcription related files were added or replaced?

Anyhow. This was the original text (“Punkt”, “Komma”, “Fragezeichen”, “neuer Absatz” = “full stop”, “comma”, “question mark”, “new paragraph”):

Ich ging los Komma bevor es noch zu regnen begann Punkt Einen Schirm trug ich nicht bei mir Punkt Das lag daran Komma dass ich schon viel Gepäck bei mir hatte Punkt Neuer Absatz Wieso Komma dachte ich Komma als mich der Regen bereits durchnässt hatte Komma habe ich nicht dieses bisschen Gepäck noch mitgenommen Fragezeichen

This became the audio file which Just Press Record transcribed to this:

Ich ging los, bevor es noch zu regnen begann. Einen Schirm trug ich nicht bei mir. Das lag daran, dass ich schon viel Gepäck bei mir hatte.

Wieso, dachte ich, als mich der Regen bereits hatte, habe ich nicht dieses bisschen Gebäck noch mitgenommen?

And DEVONthink 4 set to Local Apple Speech transcription to this:

00:00 - Ich ging los bevor es noch zu regnen begann Einen Schirm trug ich nicht bei mir Das lag daran dass ich schon viel Gepäck bei mir hatte
00:10 - Wieso dachte ich als mich der Regen bereits durch hatte habe ich nicht dieses bisschen Gebäck noch mitgenommen

The actual text is almost identical, including the funny “Gebäck”/“Gepäck” (“baggage”/“pastry”) error. DEVONthink 4 reduced the almost inintelligible “durchnässt” (“soaked through”) to “durch” (“through”) while Just Press Record omitted it completely. I therefore withdraw my claim that JPR generally delivers better results than DT 4. At least not since the presumed audio update, see above.

What is striking is that DT 4 obviously recognised the beginnings of the sentences, because they all start correctly with a capital letter. The punctuation marks are just not displayed. Are the time stamps set per paragraph?

And all three options of DT 4 show the same results. Does OpenAI Whisper require an API key and if there isn’t any—I don’t have one—the transcription falls back on one of Apple’s? If so, there should be a note.
Test Audio.m4a.zip (778.0 KB)

cgrunenberg · April 13, 2025, 3:11pm

Only after a short pause in the speech, meaning that DEVONthink’s segmentation creates these paragraphs.

Yes, an API key is required.

cgrunenberg · April 14, 2025, 2:05pm

The next beta will improve the punctuation support when using Apple Speech and also include an option to disable the timestamps.

suavito · April 14, 2025, 5:16pm

Amazing, thank you!

RobH · May 4, 2025, 11:06pm

Just installed DT4b2, and I’m having issues with transcribing videos. The local and server Apple speech transcriptions don’t seem to work at all. Using Whisper works, but it’s odd where it puts the timestamps. I also did run a transcription into Searchable Text, but there were no timnestamps in that at all.

Setting the destination to Annotation and running it on a 10-minute video a few times, the results ranged from only the 00:00 timestamp, to a few timestamps separated by several minutes.
BTW, even though it was selected, I had to toggle the checkbox to get the timestamps.

I tried again on a 20 minute video and had some better results, but had to run it several times before it placed in more timestamps. The last time I ran it, I got a timestamp at 00:00, then one at 4:48, then none until 10:19, at which point it begins putting in timestamps much more often.

For example:

11:08 - Okay, close enough to 51 for my liking
11:12 - Okay, we’re going to allow the curds now to sink to the bottom and settle for five minutes So take your big pot over to the sink area, and we’re going to drain that through a cheesecloth lined basket Now the basket I’m using is the largest one I have, which is
11:29 - 165 millimeters across and
11:32 - Which is the same as 6.5 inches So just pour that through. That also does two purposes. It heats up the basket Just warms it a bit so cold curds don’t hit it It just warms it a bit so cold curds don’t hit it. I’ve also sprayed the
11:50 - cheesecloth with a very fine mist of white vinegar. Now this helps avoid the Parmesan from …

I tried asking both ChatGPT and Claude to do the transcribing from the chat, but they both said they don’t have access to the file (file was selected). I must not be understanding how that works, though they seemed to have access to other files from within the chat.

I don’t know if this is possible, but it’d be nice to have a setting to tell the AI to add a timestamp every X minutes for when it doesn’t seem to do well on its own. Or, if I’m doing this wrong, how do I achieve better results?

BLUEFROG · May 5, 2025, 2:30am

No, there is no way for you to control the timestamps. And even if you could tell it, “set a timestamp every minute”, it’s likely you’d be dissatisfied with the end result of that too. Remember, this is best guesses and heuristics.

BTW, even though it was selected, I had to toggle the checkbox to get the timestamps.

What was selected and you had to toggle what checkbox?

PS: ChatGPT and Claude are not speech-to-text AI engines. The ones provided in the settings are very specific models.

cgrunenberg · May 5, 2025, 5:53am

Did you enable Siri and download the necessary languages first? This is required for the local Apple Speech transcription, see Settings > AI > Transcription.

Timestamps are inserted after short breaks, not periodically.

This option is only available when transcribing to comments or annotations.

RobH · May 5, 2025, 3:22pm

This is likely true. Although, I was thinking of the use case where my college kids could transcribe a lecture and then having no timestamp for a 5 or 10 minute block of time. Having a “No more than every X Minutes” for a timestamp would be helpful when you want to refer back to the audio or video (assuming the AI doesn’t do a good job).

The “Add timestamps to transcription” checkbox under Settings > AI > Transcription.

Probably so, but I was attempting to try and figure it out as much as possible before posting my questions here.

BLUEFROG · May 5, 2025, 3:27pm

Did you see the Documents > AI and Your Documents > Speech-to-Text?

RobH · May 5, 2025, 3:32pm

Ah! No, I didn’t read the fine grey print close enough. I made the assumption that Siri enabled and my default language was what was needed and they were set; but I see I have to download a language under the Transcription Languages button. Which seems like that should be a default setting in the OS - if I set my language to English, it stands to reason that I’d want transcriptions in English, too.

Okay. I’ll have to play with this more, and pay attention to see what the audio is doing and how that relates to an inserted timestamp.

RobH · May 5, 2025, 3:33pm

Probably. I’ve tried to read all of the documentation regarxing AI in DT and am using the help file, too. I will review it again.

BLUEFROG · May 5, 2025, 3:35pm

No problem! It was just a practical example I had added and wondered if it had inspired your looking into it.

RobH · May 5, 2025, 6:27pm

Thanks for the help above.

After setting the transcription language in the OS settings, and restarting DT, the transcribing using Apple’s transcription, local and remore, now works. After running some tests, I didn’t really notice a differnce between doing so locally or on Apple’s servers. However, the insertion of a timestamp seems almost arbitrary - for all of the transcription services. Tests on some files yielded a good number of timestamps while other files yielded very few. I realize this isn’t a DT issue; I’m just making observations.

cgrunenberg · May 5, 2025, 6:32pm

The timestamps are actually inserted by DEVONthink.

RobH · May 5, 2025, 7:38pm

Huh, good to know. Thanks. Thinking about that, I guess that does make sense.

I know you’re already swamped with feature requests and managing the beta, but it would be nice to have a “insert timestamp every X minutes” for those transcriptions that don’t have many. Part of the amazingness of transcriptions is being able to click the link and go to that part of the audio or video.

cgrunenberg · May 6, 2025, 5:43am

Do you have an audio or video file (or link) that you could share? Maybe it’s possible to optimize the current approach before adding even more options