-
Is it possible to get audio transcription without the time codes? I have no doubt they are useful in longer audio transcriptions but I just want the text. I use Markdown for the transcriptions so it would be possible to create a Smart Rule which removes all links. But it would be nicer if there was a just little box to untick.
-
I use Just Press Record for audio recordings on my mobile devices, mostly in German and sometimes in English. JPR uses Apple’s speech-to-text system, which works also without an internet connection. So I guess the results should be identical to DEVONthink’s
Local Apple Speech transcription
. But JPR’s results are better. Actually better thanRemote Apple Speech transcription
andRemote OpenAl Whisper transcription
too, which to my surprise show identical results to the first option. The most noticeable difference is that all three DT audio transcription options almost always ignore all dictated punctuations (no actual punctuation, but also not the word “period”, “comma”, etc.). Why is that? -
Is it possible to create a script that removes the audio file and moves its annotation file to a certain destination? My usual audio files are speech memos I just want to keep until I have checked that the transcription is right, and discard after that.
- No, that’s not an option currently.
- I wouldn’t expect identical results any more than I’d expect identical results from a commercial AI query. And we don’t control what the Speech framework transcribes, e.g., its punctuation.
- There is nothing currently available, but why not just do this manually since you’re manually verifying the results?
- Too bad.
- Well, speech recognition is a bit different than made up answers by a generative AI. But even then it was surprising part that there are no differences. Including the missing punctuations. And assumed Just Press Record uses the same framework and has different results in this specific aspect, i. e. punctuation, and another one: no time codes, does that not suggest that there is some way of setting framework variables, so to speak?
- I will do it manually. And because of that and the many voice memos I generate almost every day it would be very helpful if I needed just one click to delete the audio and move its annotation.
First transcribe to searchable text and afterwards convert the audio/video file to plain text. Or switch to View > Document Display > Text Alternative.
The segmentation and time stamps are handled by DEVONthink but it doesn’t post-process or filter the actual text in any way. But if you could share an audio file and the results of DEVONthink and Just Press Record, then we’ll check this.
Okay, let’s try this. I attached an .m4a audio file zipped, because unzipped is not allowed.
I used a dummy text spoken by one of macOS’ German voices. Its pronunciation is not always good so there are some mistakes in the transcript but not due to the transcriptions.
One thing I have to add: At some point while trying to get this going—first I had an .mp3 file but Just Press Record did not show it and I searched for an app to convert .mp3 to .m4a before finally re-recording it as .m4a—the speech function of macOS did not work for some time. Meanwhile there was a bigger nsurlsessionid data downstream, and then it worked again. Maybe some audio transcription related files were added or replaced?
Anyhow. This was the original text (“Punkt”, “Komma”, “Fragezeichen”, “neuer Absatz” = “full stop”, “comma”, “question mark”, “new paragraph”):
Ich ging los Komma bevor es noch zu regnen begann Punkt Einen Schirm trug ich nicht bei mir Punkt Das lag daran Komma dass ich schon viel Gepäck bei mir hatte Punkt Neuer Absatz Wieso Komma dachte ich Komma als mich der Regen bereits durchnässt hatte Komma habe ich nicht dieses bisschen Gepäck noch mitgenommen Fragezeichen
This became the audio file which Just Press Record transcribed to this:
Ich ging los, bevor es noch zu regnen begann. Einen Schirm trug ich nicht bei mir. Das lag daran, dass ich schon viel Gepäck bei mir hatte.
Wieso, dachte ich, als mich der Regen bereits hatte, habe ich nicht dieses bisschen Gebäck noch mitgenommen?
And DEVONthink 4 set to Local Apple Speech transcription
to this:
00:00 - Ich ging los bevor es noch zu regnen begann Einen Schirm trug ich nicht bei mir Das lag daran dass ich schon viel Gepäck bei mir hatte
00:10 - Wieso dachte ich als mich der Regen bereits durch hatte habe ich nicht dieses bisschen Gebäck noch mitgenommen
The actual text is almost identical, including the funny “Gebäck”/“Gepäck” (“baggage”/“pastry”) error. DEVONthink 4 reduced the almost inintelligible “durchnässt” (“soaked through”) to “durch” (“through”) while Just Press Record omitted it completely. I therefore withdraw my claim that JPR generally delivers better results than DT 4. At least not since the presumed audio update, see above.
What is striking is that DT 4 obviously recognised the beginnings of the sentences, because they all start correctly with a capital letter. The punctuation marks are just not displayed. Are the time stamps set per paragraph?
And all three options of DT 4 show the same results. Does OpenAI Whisper require an API key and if there isn’t any—I don’t have one—the transcription falls back on one of Apple’s? If so, there should be a note.
Test Audio.m4a.zip (778.0 KB)
Only after a short pause in the speech, meaning that DEVONthink’s segmentation creates these paragraphs.
Yes, an API key is required.
The next beta will improve the punctuation support when using Apple Speech and also include an option to disable the timestamps.
Amazing, thank you!