Script to Transcribe the Media with the OpenAI Whisper

Silverstone · August 19, 2024, 2:28pm

I’ve written a script lately. It works just fine, but it is a way too slow on my macbook. I thought if those new-non intel processors would help it. So, you can check it out )

General installation instructions

First you need to install openai-wisper command-line utility. You may just use the brew: type in Terminal: brew install openai-whisper. The description of this utility you may find here on Github. You may want to play around with parameters and language models, so see the github and help: whisper -h. Script currently uses large model (~3Gb). It downloads it automatically, just indicate the name of the model.
See if you have ffmpeg installed. If not - type in Terminal: brew install ffmpeg
Save script and tweak it as you need:

-- Script to Transcribe any media with sound, and adding the transcription to finder comments of this record
-- Language is detected automatically and added to the custom meta data of the record
-- Using openai-whisper cli
-- Created by Silverstone on 17.08.2024
-- 17.08.2024 - Added Timer

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions
use framework "Foundation"

-- Local variables
set theOutputFolder to "/Users/ilya/Documents/DevonthinkTempItems/" -- Temporary path you will use for transcription TXT files
set ShellPath to "PATH=$PATH:/usr/local/bin:Users/ilya/.local/bin/; " -- Locations for script to find used utilities (whisper, ffmpeg etc, see instructions on the forum)
set theLanguageModel to " --model large" -- Choose your model here - https://github.com/openai/whisper

tell application id "DNtp"
	set theRecords to (get selection)
	set RecordCount to (count of theRecords)
	if RecordCount > 0 then
		show progress indicator "Transcribing Media…" steps RecordCount with cancel button
		set GlobalStartTime to (current date) -- Timer start
		set theNumber to 0
		set GoodNumber to 0
		set theLanguage to ""
		
		repeat with theRecord in theRecords
			step progress indicator "(" & (theNumber + 1) & " of " & RecordCount & ") - " & ((name of theRecord) as string)
			set StartTime to (current date) -- Timer start
			
			--Constructing arguments
			set theInput to path of theRecord
			set baseName to (current application's NSString's stringWithString:(theInput))'s lastPathComponent()'s stringByDeletingPathExtension() as text
			set TXToutput to theOutputFolder & baseName & ".txt"
			
			--Trascribing using OpenAI Wisper model
			set theText to do shell script ShellPath & "whisper " & quoted form of theInput & theLanguageModel & " -f txt -o " & quoted form of theOutputFolder
			set theTranscription to do shell script "usr/bin/iconv -t UTF-8 " & quoted form of TXToutput
			
			--Detecting the language of the media from theText (uses first 30 seconds of media)
			set saveTID to AppleScript's text item delimiters
			set AppleScript's text item delimiters to "Detected language: "
			set theList to text items of theText
			set theText to item 2 of theList
			set AppleScript's text item delimiters to "["
			set theList to text items of theText
			set theLanguage to item 1 of theList
			set AppleScript's text item delimiters to ""
			set theLanguage to (characters 1 thru -2 of theLanguage) as string
			set AppleScript's text item delimiters to saveTID
			
			if theLanguage is not "" then
				add custom meta data theLanguage for "languageofcontent" to theRecord
			else
				add custom meta data "unknown" for "languageofcontent" to theRecord
			end if
			
			-- Setting Timer strings (nedded for log)
			set EndTime to (current date) -- Timer stop
			set theElapsed to my secondsToTimeString(EndTime - StartTime)
			set theDuration to do shell script "/usr/local/bin/ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 " & quoted form of theInput
			set theDuration to (current application's NSString's stringWithString:theDuration) as real --Converting string to a number
			set theDurationText to my secondsToTimeString((my roundThis(theDuration, 0)) div 1)
			set theRatio to my roundThis((EndTime - StartTime) / theDuration, 2)
			set LogText to "Trancsribing Media '" & (name of theRecord) & "': "
			set InfoLogText to "Start: " & (StartTime as text) & return & "End: " & (EndTime as text) & return & "Elapsed: " & theElapsed & " | Duration: " & theDurationText & " | Ratio: " & theRatio
			
			if theTranscription is not "" then
				set the comment of theRecord to theTranscription
				add custom meta data "true" for "transcribed" to theRecord
				log message LogText & "Success!" info InfoLogText
				set GoodNumber to GoodNumber + 1
			else
				log message LogText & "Nothing to transcribe..."
			end if
			
			set theNumber to theNumber + 1
			if cancelled progress then exit repeat
		end repeat
		hide progress indicator
		
		-- Setting Global Timer strings
		set GlobalEndTime to (current date) -- Global Timer stop
		set GlobalElapsed to my secondsToTimeString(GlobalEndTime - GlobalStartTime)
		
		display notification ((GoodNumber) as string) & " of " & ((RecordCount) as string) & " record(s) was successfuly transcribed." & return & "Elapsed: " & GlobalElapsed with title "Transcribing Media"
	end if
end tell

-- Getting time string from seconds
on secondsToTimeString(t)
	-- Comment the code if t's likely to be less than a day: 'set d', 'set t' and last 'Set timeString'.
	set d to t div days
	set t to t mod days
	tell (1000000 + (t div hours) * 10000 + (t mod hours div minutes) * 100 + t mod minutes) as text
		set timeString to (text 2 thru 3 & ":" & text 4 thru 5 & ":" & text 6 thru 7)
	end tell
	set timeString to text 2 thru 4 of ((1000 + d) as text) & ":" & timeString
	return timeString
end secondsToTimeString

-- Rounding the number
on roundThis(n, numDecimals)
	set x to 10 ^ numDecimals
	tell n * x to return (it div 0.5 - it div 1) / x
end roundThis

What this script does:

It transcribes the media file and saves the transcription in Finder comments field of the Record. It is good if you want to full-text-search your audio or video records. It even correctly exports these records (you can see the transcription in Finder).
Additionally it defines the Language of the speech and saves it to appropriate custom metadata field (currently languageofcontent, you may set it up as you wish at any time, or use yours).
It also saves the boolean custom metadata field transcribed to mark the record as transcribed.
Script has progress indicator to see the progress if you process a bunch of records.
Script has a timer, so you can see the time it took to transcribe (Elapsed), the Duration of media file and a special figure: Ratio, which is Elapsed, divided by Duration. All these data you can find in process of transcribing and after that - in DevonThing’s standard log window.

Tweaking the script

All you need to tweak is at the beginning of the script. You will need a Script Editor to do this (or Script Debugger, if you have one):

Setup the path to your temporary folder. Script will use it to save transcription TXT files
Setup locations to command-line utilities for do shell script command to run without errors. The matter is that the AppleScript uses another shell, which is not aware of your PATH variable and if you run the same command in Terminal, it doesn’t mean there will be no problems running it in AppleScript. In most cases it is enough to just indicate explicitly the path to executable, like: do shell script "usr/bin/iconv". But if this command uses other executables, which are in other locations (e.g. ffmpeg) you will get error. So you need to ask Terminal about locations: which ffmpeg, and add this location to the string above, using “:” as divider. See the example in the script. If shell script still runs with error, you need to see which executable it can’t find, ask Terminal the path and write it in this string.
Language model. You can experiment with them. See the descriptions and names here.

That’s all!
You are all set up. Happy transcribing!

PS
What is interesting for me is what Ratio you get on your machines, because I get too big figures, like 20 times (means transcribing takes 20 times more time than the duration of the media)! Yes, the transcription is very nice, with all the punctuation and sentences, but 20 is a bit too much ))
Whether it is because of my intel MacBook pro (16,2), or Large model…

Don’t know, share your Ratio