Script: Convert RTF to MultiMarkdown

pete31 · September 27, 2019, 12:17am

Hi there, here’s a script to convert RTF to MultiMarkdown.

Textutil or pandoc produces a header

CocoaVersion: 1671.5  
Generator: Cocoa HTML Writer

which can be turned off by deleting “-s” in the pandoc part.

But I couldn’t find a way to get rid of it without losing first lines that end with a colon (I think it’s because they have the metadata style of key: value and I’m new to textutil and pandoc…). If you’re sure you don’t have colons at the end of a first line remove the “-s” option. In the end I removed the header afterwards with TextSoap, but pretty sure this is not the normal way to go.

See pandoc manual for more options.

Happy converting!

--  Convert RTF (→ textutil → HTML → pandoc →) to MultiMarkdown

tell application "Finder"
	try
		set theTempFolder to make new folder in desktop with properties {name:"TEMP - RTF to MultiMarkdown"}
	on error
		display notification "Folder already exists!"
		return
	end try
end tell


tell application id "DNtp"
	try
		set windowClass to class of window 1
		if {viewer window, search window} contains windowClass then
			set currentRecord_s to selection of window 1
		else if windowClass = document window then
			set currentRecord_s to content record of window 1 as list
		end if
		
		set theTempGroup to indicate (POSIX path of (path to desktop) & "TEMP - RTF to MultiMarkdown/") to incoming group
		set theOutputGroup to display group selector "Output to:"
		
		repeat with thisRecord in currentRecord_s
			if type of thisRecord = rtf then
				try
					
					tell thisRecord
						set theRTFURL to URL
						set theRTFCreationDate to creation date
						set theRTFAdditionDate to addition date
						set theRTFModificationDate to modification date
						set theRTFComment to comment
					end tell
					
					set thePath to path of thisRecord
					
					set theName to name of thisRecord
					set theNameWithoutExtension to my Basename(theName)
					if theNameWithoutExtension contains "/" then set theNameWithoutExtension to my encode_Text(theNameWithoutExtension, true, false)
					if (count of characters in theNameWithoutExtension) > 250 then set theNameWithoutExtension to (characters 1 thru 250 in theNameWithoutExtension as string)
					
					set theOutputPath to (POSIX path of (path to desktop) & "TEMP - RTF to MultiMarkdown/") & theNameWithoutExtension & ".md"
					
					set theShellScript to "textutil '" & thePath & "' -strip -convert html -stdout | /usr/local/bin/pandoc -t markdown_mmd --wrap=preserve -s -o '" & theOutputPath & "' -f html-native_divs-native_spans"
					set convertToMultiMarkdown to do shell script theShellScript
					
					repeat with i from 1 to 20
						try
							set theIndexedRecord to (child 1 of theTempGroup)
							exit repeat
						on error
							delay 1.5
						end try
					end repeat
					
					set moveIntoDatabase to consolidate record theIndexedRecord
					set moveToOutputGroup to move record theIndexedRecord to theOutputGroup
					
					set theMultiMarkdownRecord to (child -1 of theOutputGroup)
					
					tell theMultiMarkdownRecord
						set URL to theRTFURL
						set creation date to theRTFCreationDate
						set addition date to theRTFAdditionDate
						set modification date to theRTFModificationDate
						set comment to theRTFComment
					end tell
					
				on error
					set label of thisRecord to 1
				end try
			end if
		end repeat
		
		set cleanUpDEVONthink to delete record theTempGroup
		
		open window for record theOutputGroup
		activate
		
	on error error_message number error_number
		if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
	end try
end tell


tell application "Finder" to set cleanUpFinder to delete theTempFolder -- delete TEMP folder


on Basename(filename)
	set revName to reverse of characters of filename as string
	set revNameWithoutExtension to characters ((character offset of "." in revName) + 1) thru -1 in revName as string
	set theBasename to reverse of characters of revNameWithoutExtension as string
end Basename

on encode_Text(theText, encodeCommonSpecialCharacters, encodeExtendedSpecialCharacters)
	set theStandardCharacters to "abcdefghijklmnopqrstuvwxyz0123456789"
	set theCommonSpecialCharacterList to "$+!'/?;&@=#%><{}\"~`^\\|*"
	set theExtendedSpecialCharacterList to ".-_:"
	set theAcceptableCharacters to theStandardCharacters
	if encodeCommonSpecialCharacters is false then set theAcceptableCharacters to theAcceptableCharacters & theCommonSpecialCharacterList
	if encodeExtendedSpecialCharacters is false then set theAcceptableCharacters to theAcceptableCharacters & theExtendedSpecialCharacterList
	set theEncodedText to ""
	repeat with theCurrentCharacter in theText
		if theCurrentCharacter is in theAcceptableCharacters then
			set theEncodedText to (theEncodedText & theCurrentCharacter)
		else
			set theEncodedText to (theEncodedText & encodeCharacter(theCurrentCharacter)) as string
		end if
	end repeat
	return theEncodedText
end encode_Text

on encodeCharacter(theCharacter)
	set theASCIINumber to (the ASCII number theCharacter)
	set theHexList to {"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "A", "B", "C", "D", "E", "F"}
	set theFirstItem to item ((theASCIINumber div 16) + 1) of theHexList
	set theSecondItem to item ((theASCIINumber mod 16) + 1) of theHexList
	return ("%" & theFirstItem & theSecondItem) as string
end encodeCharacter

pete31 · April 25, 2020, 6:19am

The version above is old!

This script converts RTF to MultiMarkdown.

It needs Pandoc and RegexAndStuffLib installed (put the “RegexAndStuffLib” script in /Users/Username/Library/Script Libraries/).

There’s an option to remove empty lines that pandoc produces (removing unwanted lines is not ideal but couldn’t find the option in pandoc to avoid them …). If the resulting markdown record in unrendered view doesn’t look similar to the rtf record try again with removeEmptyLines set to false.

Make sure to uncomment / add all properties you’d like the markdown record to take over from the rtf.

-- Convert RTF to MultiMarkdown (via textutil and pandoc)
-- This script needs Pandoc (https://pandoc.org/installing.html) and RegexAndStuffLib (https://latenightsw.com/support/freeware/) installed.
-- It does not support RTFD

use scripting additions
use script "RegexAndStuffLib" version "1.0.6"

property removeEmptyLines : true

tell application id "DNtp"
	try
		set windowClass to class of window 1
		if {viewer window, search window} contains windowClass then
			set currentRecord_s to selection of window 1
		else if windowClass = document window then
			set currentRecord_s to content record of window 1 as list
		end if
		
		set theOutputGroup to display group selector
		
		set displaySuffix to do shell script "defaults read com.devon-technologies.think3 DisplaySuffix"
		
		show progress indicator "Converting... " steps (count of currentRecord_s) with cancel button
		
		repeat with thisRecord in currentRecord_s
			if type of thisRecord = rtf then
				try
					if displaySuffix = 0 then
						set theName to name of thisRecord
					else
						set theName to my basename(name of thisRecord)
					end if
					
					step progress indicator theName
					
					if theName contains "/" then
						set theName to my encode_Text(theName, true, true) -- encode in case the name contains e.g. an url
						set encodedName to true
					else
						set encodedName to false
					end if
					
					set thePath to path of thisRecord
					set theOutputPath to (POSIX path of (path to temporary items folder) & theName & ".md") as string
					
					set convertToMultiMarkdown to do shell script "textutil " & quoted form of thePath & " -convert html -stdout | /usr/local/bin/pandoc -f html-native_divs-native_spans -t markdown_mmd --wrap=preserve -o " & quoted form of theOutputPath
					
					set newRecord to indicate theOutputPath to theOutputGroup
					consolidate record newRecord
					
					tell application "Finder" to delete file (POSIX file theOutputPath as alias)
					
					tell newRecord
						set URL to (URL of thisRecord)
						set comment to (comment of thisRecord)
						#set creation date to (creation date of thisRecord)
						#set addition date to (addition date of thisRecord)
						#set modification date to (modification date of thisRecord)
						
						set theText to plain text
						set firstLine to paragraph 1 in theText
						
						if firstLine contains ":" then
							set escapedFirstLine to regex change firstLine search pattern (":") replace template ("\\\\:")
							set escapedText_List to ((escapedFirstLine as list) & paragraphs 2 thru -1 in theText) as list
							set escapedText to my string_From_List(escapedText_List, linefeed)
							set plain text to escapedText
							set theText to plain text
						end if
						
						if removeEmptyLines = true then
							set cleanText_1 to regex change theText search pattern ("\\n\\n") replace template (space & space & linefeed)
							set cleanText_2 to regex change cleanText_1 search pattern ("^ +$") replace template ("")
							set plain text to cleanText_2
						end if
						
						if encodedName = true then
							set name to my decode_Text(name)
						end if
					end tell
					
				on error
					set label of thisRecord to 1
				end try
			end if
		end repeat
		
		hide progress indicator
		
		open window for record theOutputGroup
		activate
		
	on error error_message number error_number
		hide progress indicator
		if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
		return
	end try
end tell

on basename(filename)
	set revName to reverse of characters of filename as string
	set revNameWithoutExtension to characters ((character offset of "." in revName) + 1) thru -1 in revName as string
	set theBasename to reverse of characters of revNameWithoutExtension as string
end basename

on encode_Text(theText, encodeCommonSpecialCharacters, encodeExtendedSpecialCharacters)
	set theStandardCharacters to "abcdefghijklmnopqrstuvwxyz0123456789"
	set theCommonSpecialCharacterList to "$+!'/?;&@=#%><{}\"~`^\\|*"
	set theExtendedSpecialCharacterList to ".-_:"
	set theAcceptableCharacters to theStandardCharacters
	if encodeCommonSpecialCharacters is false then set theAcceptableCharacters to theAcceptableCharacters & theCommonSpecialCharacterList
	if encodeExtendedSpecialCharacters is false then set theAcceptableCharacters to theAcceptableCharacters & theExtendedSpecialCharacterList
	set theEncodedText to ""
	repeat with theCurrentCharacter in theText
		if theCurrentCharacter is in theAcceptableCharacters then
			set theEncodedText to (theEncodedText & theCurrentCharacter)
		else
			set theEncodedText to (theEncodedText & encodeCharacter(theCurrentCharacter)) as string
		end if
	end repeat
	return theEncodedText
end encode_Text

on encodeCharacter(theCharacter)
	set theASCIINumber to (the ASCII number theCharacter)
	set theHexList to {"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "A", "B", "C", "D", "E", "F"}
	set theFirstItem to item ((theASCIINumber div 16) + 1) of theHexList
	set theSecondItem to item ((theASCIINumber mod 16) + 1) of theHexList
	return ("%" & theFirstItem & theSecondItem) as string
end encodeCharacter

on decode_Text(theText)
	local str
	try
		return (do shell script "/bin/echo " & quoted form of theText & ¬
			" | perl -MURI::Escape -lne 'print uri_unescape($_)'")
	on error eMsg number eNum
		error "Can't urlDecode: " & eMsg number eNum
	end try
end decode_Text

on string_From_List(theList, theDelimiter)
	set theString to ""
	set theCount to 0
	
	repeat with thisItem in theList
		set theCount to theCount + 1
		set thisItem to thisItem as string
		if theCount ≠ (count of theList) then
			set theString to theString & thisItem & theDelimiter
		else
			set theString to theString & thisItem
		end if
	end repeat
	
	return theString
end string_From_List

ngan · April 26, 2020, 2:51pm

I have tested the script.

THANK YOU! It works smoothly for RTF file that has no image - and that’s already perfect for my purpose.
I learnt quite a few good tricks on system-level file manipulation from reading the code lines!
This is the first time I see Pandoc at works and it is powerful.

I probably understand most of your program flow but hope you won’t mind me asking two questions:

Why the script needs “DisplaySuffix” and uses it as a condition for whether or not to change “theName” by using basename()? Perhaps it’s more for your specific settings?

set displaySuffix to do shell script "defaults read com.devon-technologies.think3 DisplaySuffix"

I wonder why the scripts needs regex and this block if there are “:” in the “plain text”? The reason that I am asking is the script still works as expected and can retain/convert all DT-Links to MD format when I comment out the block. EDITED: to avoid markdown to interpret any first line with “:” as meta data?

				
						if firstLine contains ":" then
							set escapedFirstLine to regex change firstLine search pattern (":") replace template ("\\\\:")
							set escapedText_List to ((escapedFirstLine as list) & paragraphs 2 thru -1 in theText) as list
							set escapedText to my string_From_List(escapedText_List, linefeed)
							set plain text to escapedText
							set theText to plain text
						end if

Thank you again

pete31 · April 26, 2020, 7:27pm

Glad to hear it works for you.

Edit:

The script doesn’t need to check “DisplaySuffix”.

Now using this handler to get the name without suffix.

Yes to avoid interpretation as metadata. For other readers:

MultiMarkdown treats a first line containing a : as metadata and hides it in rendered view (see MultiMarkdown Syntax Guide). In context of converting from RTF we don’t want a first line that contains a : to be hidden, escaping prevents this. This capture makes it clear

If the first line in the resulting markdown record contains a : and contains formatting there’s no problem.
If the first line isn’t formatted it will be treated as metadata if we don’t escape :.
Easiest way to handle this is to always escape if there’s a colon.

It is, there are so many options one can use, I didn’t get to read the whole User’s Guide yet. There might be formatting in your RTFs that isn’t covered from the script so it’s a good idea to read the guide and add everything that you might need.

I’ve found an option that might make it possible to convert RTFDs too:

--extract-media=DIR
       Extract  images  and other media contained in or linked from the
       source document to the path DIR, creating it if  necessary,  and
       adjust  the  images  references in the document so they point to
       the extracted files.  If the source format is a binary container
       (docx,  epub, or odt), the media is extracted from the container
       and the original filenames are used.   Otherwise  the  media  is
       read  from  the file system or downloaded, and new filenames are
       constructed based on SHA1 hashes of the contents.

I’ll try that now

ngan · April 26, 2020, 7:54pm

Thanks. It’s been a great learning experience.

pete31 · April 27, 2020, 2:25pm

The version above is old!

This script converts RTF and RTFD to MultiMarkdown.

In case of RTFDs only Images are preserved.

It needs Pandoc and RegexAndStuffLib (see above).

-- Convert RTF to MultiMarkdown (via textutil and pandoc)
-- This script needs Pandoc (https://pandoc.org/installing.html) and RegexAndStuffLib (https://latenightsw.com/support/freeware/).
-- This version converts RTF and RTFD - but only images are preserved, other attachments are not supported!

use scripting additions
use script "RegexAndStuffLib" version "1.0.6"

property moveMarkdownRecord : false -- set to true if you want markdown and image records in one group
property removeEmptyLines : false

tell application id "DNtp"
	try
		set windowClass to class of window 1
		if {viewer window, search window} contains windowClass then
			set currentRecord_s to selection of window 1
		else if windowClass = document window then
			set currentRecord_s to content record of window 1 as list
		end if
		
		set theDestinationGroup to display group selector
		set tempPath to POSIX path of (path to temporary items folder)
		
		show progress indicator "Converting... " steps (count of currentRecord_s) with cancel button
		
		repeat with thisRecord in currentRecord_s
			if (type of thisRecord) is in {rtf, rtfd} then
				
				set theName to my recordName(name of thisRecord, filename of thisRecord)
				
				step progress indicator theName
				
				set tempName to do shell script "date \"+%Y%m%d%H%M%S\""
				
				if (type of thisRecord) = rtf then
					try
						set thePath to path of thisRecord
						set theOutputPath to (tempPath & tempName & ".md") as string
						
						set convertToMultiMarkdown to do shell script "textutil " & quoted form of thePath & " -convert html -stdout | /usr/local/bin/pandoc -f html-native_divs-native_spans -t markdown_mmd --wrap=preserve -o " & quoted form of theOutputPath
						
						set newRecord to indicate theOutputPath to theDestinationGroup
						tell application "Finder" to delete file (POSIX file theOutputPath as alias)
						
					on error
						set label of thisRecord to 1
					end try
					
				else
					try
						set theSource to source of thisRecord
						set theSourcePath to (tempPath & tempName & ".html") as string
						set theOutputPath to (tempPath & tempName & ".md") as string
						set theExtractionPath to (tempPath & tempName) as string
						set createExtractionFolder to do shell script "mkdir -p " & quoted form of theExtractionPath
						
						set sourceFile to open for access theSourcePath with write permission
						write theSource as «class utf8» to sourceFile
						close access sourceFile
						
						set convertToMultiMarkdown to do shell script "/usr/local/bin/pandoc -f html-native_divs-native_spans -t markdown_mmd --wrap=preserve --extract-media=" & quoted form of theExtractionPath & " -o " & quoted form of theOutputPath & " " & quoted form of theSourcePath
						
						set newRecord to indicate theOutputPath to theDestinationGroup
						set theGroup to indicate theExtractionPath to theDestinationGroup
						
						tell application "Finder"
							delete folder (POSIX file theExtractionPath as alias)
							delete file (POSIX file theSourcePath as alias)
							delete file (POSIX file theOutputPath as alias)
						end tell
						
						set name of theGroup to (theName & ".md") as string
						
						if moveMarkdownRecord = true then move record newRecord to theGroup
						
						set theText to plain text of newRecord
						set theParagraphs to paragraphs of theText
						
						set theText_List to {}
						
						repeat with thisParagraph in theParagraphs
							set thisParagraph to thisParagraph as string
							if thisParagraph contains theExtractionPath then
								set theFilename to item 1 of (regex search thisParagraph search pattern "(?<=/)[a-z|0-9]{40}\\.(.*?)(?=\\))" as string)
								repeat with thisChild in (children of theGroup)
									if (filename of thisChild) = theFilename then
										set replaceLink to regex change thisParagraph search pattern "(?<=!?\\[\\]\\()(.*?)" & theFilename & "(?=\\))" replace template (reference URL of thisChild)
										set end of theText_List to replaceLink
										exit repeat
									end if
								end repeat
							else
								set end of theText_List to thisParagraph
							end if
						end repeat
						
						set plain text of newRecord to my string_From_List(theText_List, linefeed)
						
					on error
						set label of thisRecord to 1
					end try
				end if
				
				tell newRecord
					set name to (theName & ".md") as string
					
					set URL to (URL of thisRecord)
					set creation date to (creation date of thisRecord)
					set addition date to (addition date of thisRecord)
					set modification date to (modification date of thisRecord)
					set comment to (comment of thisRecord)
					
					set theText to plain text
					set firstLine to paragraph 1 in theText
					
					if firstLine contains ":" then
						set escapedFirstLine to regex change firstLine search pattern ("(?<!\\\\):(?!//)") replace template ("\\\\:")
						set escapedText_List to ((escapedFirstLine as list) & paragraphs 2 thru -1 in theText) as list
						set escapedText to my string_From_List(escapedText_List, linefeed)
						set plain text to escapedText
						set theText to plain text
					end if
					
					if removeEmptyLines = true then
						set cleanText_1 to regex change theText search pattern ("\\n\\n") replace template (space & space & linefeed)
						set cleanText_2 to regex change cleanText_1 search pattern ("^ +$") replace template ("")
						set plain text to cleanText_2
					end if
				end tell
				
			end if
		end repeat
		
		hide progress indicator
		
		open window for record theDestinationGroup
		activate
		
	on error error_message number error_number
		hide progress indicator
		tell application "Finder"
			try
				delete folder (POSIX file theExtractionPath as alias)
				delete file (POSIX file theSourcePath as alias)
				delete file (POSIX file theOutputPath as alias)
			end try
		end tell
		if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
		return
	end try
end tell

on recordName(theName, theFilename)
	set revName to reverse of (characters of theName) as string
	set suffixName to reverse of (characters 1 thru ((character offset of "." in revName) - 1) in revName) as string
	set revFileName to reverse of (characters of theFilename) as string
	set suffixFileName to reverse of (characters 1 thru ((character offset of "." in revFileName) - 1) in revFileName) as string
	if suffixName = suffixFileName then set theName to reverse of (characters ((character offset of "." in revName) + 1) thru -1 in revName) as string
	return theName
end recordName

on string_From_List(theList, theDelimiter)
	set theString to ""
	set theCount to 0
	
	repeat with thisItem in theList
		set theCount to theCount + 1
		set thisItem to thisItem as string
		if theCount ≠ (count of theList) then
			set theString to theString & thisItem & theDelimiter
		else
			set theString to theString & thisItem
		end if
	end repeat
	
	return theString
end string_From_List

kbecker · May 5, 2020, 6:08pm

It doesn’t keep the Tags from the original RTFD Note, does it? Otherwise thanks - great job!

pete31 · May 5, 2020, 6:20pm

Thanks. Add this line below “set comment to (comment of thisRecord)”

set tags to (tags of thisRecord)

kbecker · May 5, 2020, 7:55pm

Thanks!

kbecker · May 5, 2020, 8:33pm

If I would want the markdown file to be created next to the original rtf file I guess I’d have theDestinationGroup to be the same as the currentRecord_s Group - any chance you could show me how to modify the code accordingly?

pete31 · May 5, 2020, 8:47pm

Try this

set theDestinationGroup to parent 1 of thisRecord

This line has to be inside the repeat block.

pete31 · November 22, 2020, 1:04am

In case you’re trying to run the script I posted in this thread you’ll find that it doesn’t work in DEVONthink 3.6.

That’s due to DEVONthink’s new handling of “invalide arguments”. After the release of DEVONthink 3 I decided to continue to use “search window” in scripts so that DEVONthink 2 users could use them in, well, search windows. With version 3.6 that’s not possible anymore.

If you want to use the script you’ll have to replace this voluminous block …

set windowClass to class of window 1
if {viewer window, search window} contains windowClass then
	set currentRecord_s to selection of window 1
else if windowClass = document window then
	set currentRecord_s to content record of window 1 as list
end if

… with this neat line …

set currentRecord_s to selected records

… which does what the six lines have done. Wow, that’s great!