Markdown import with metadata

anonny · July 21, 2024, 7:05am

I’m having trouble completing the importation of a markdown file with metadata. I have Devonthink Pro 3.9.6
Could you help? I’m attaching a test md and the script I’m trying.
Script
markdown import.zip (6.9 KB)

I’ve tried several ways to automate importing of tags, but it’s not so straightforward.
If I import a markdown file with # like “#Philosophy/Transcendentalism, #introspection, #divine immanence, #self-reliance, #rejection of external authority” then only the adjecent word to the # is converted to a tag eventhough the delimeter should be ", ".
I’ve been raking my brain, but can’t come up with anything. I keep getting tag errors. This is the latest script

use AppleScript version "2.5"
use scripting additions
use framework "Foundation"

on run
    -- We assume DEVONthink exists and it is version 3
    set devonThink to application id "com.devon-technologies.think3"
    
    tell devonThink
        set theDatabase to current database
        if theDatabase is missing value then
            display alert "Please open a database in DEVONthink 3."
            return
        end if
        
        set theFiles to choose file with multiple selections allowed
        
        repeat with aFile in theFiles
            try
                -- Extract content from a file and then get the metadata and cleaned content
                set theContent to readFileAsUTF8(aFile)
                set {metadata, cleanContent} to extractMetadataAndContent(theContent)
                
                -- Import the cleaned content to DEVONthink
                set theRecord to import {name:(getFileName(aFile)), type:markdown, content:cleanContent} to theDatabase
                
                -- Process metadata and Rename the record
                tell theRecord
                    processMetadata(it, metadata)
                    renameRecord(it, metadata)
                end tell   
                
            on error errorMessage number errorNumber
                display alert "Error processing file " & (POSIX path of aFile) message errorMessage
            end try
        end repeat
    end tell
end run

on readFileAsUTF8(aFile)
    set fileURL to current application's NSURL's fileURLWithPath:(POSIX path of aFile)
    set {theContent, theError} to current application's NSString's stringWithContentsOfURL:fileURL encoding:(current application's NSUTF8StringEncoding) |error|:(reference)
    if theContent is missing value then
        if theError is not missing value then
            set errorMessage to theError's localizedDescription() as text
        else
            set errorMessage to "Unknown error"
        end if
        error "Failed to read file: " & (POSIX path of aFile) & " - " & errorMessage
    end if
    return theContent as text
end readFileAsUTF8

on extractMetadataAndContent(theContent)
    set prevTIDs to AppleScript's text item delimiters
    set AppleScript's text item delimiters to "---"
    set contentParts to text items of theContent
    set AppleScript's text item delimiters to prevTIDs 
    if (count of contentParts) < 3 then error "Invalid file format"
    
    set metadata to {}
    set rawMetadata to item 2 of contentParts
    set metadataLines to paragraphs of rawMetadata
    repeat with aLine in metadataLines
        if aLine contains ": " then
            set {key, value} to my splitString(aLine, ": ")
            set metadata to metadata & {{key, value}}
        end if
    end repeat
    
    return {metadata, item 3 of contentParts} -- Clean content is the 3rd item in the list
end extractMetadataAndContent

on processMetadata(theRecord, metadata)
    repeat with metaPair in metadata
        set {key, value} to metaPair
        if key is "tags" then
            set tagList to my splitString(value, ", ")
            add tagList as tags
        else
            set custom meta data key to value
        end if
    end repeat
end processMetadata

on renameRecord(theRecord, metadata)
    set newName to ""
    repeat with metaPair in metadata
        set {key, value} to metaPair
        if key is "Title" then
            set newName to value
            exit repeat
        else if key is "Author" and newName is "" then
            set newName to value
        end if
    end repeat
    if newName is not "" then set its name to newName
end renameRecord

on splitString(theString, theDelimiter)
    set prevTIDs to AppleScript's text item delimiters
    set AppleScript's text item delimiters to theDelimiter
    set theArray to every text item of theString
    set AppleScript's text item delimiters to prevTIDs
    return theArray
end splitString

on getFileName(aFile)
    return (name of (info for aFile))
end getFileName

chrillek · July 21, 2024, 1:03pm

First off: your test.md contains a weird character on the first line (before the line beginning with ---. The same character appears after the metadata. I’d get rid of that.

Second: I’d not use AppleScript to handle this kind of rather advanced string processing. It might be possible, but it’s not pretty, and quite complicated. Instead, use JavaScript.

Next, the steps you perform: They’re too complicated. Just import the files into DT and clean them up there. No need for ASObjC just for that. type is an obsolete parameter for import.

Your getFileName handler uses info for, which is deprecated. I guess that name returned by that command is just the filename. Which can be obtained instead by a simple string operation (see below in the JS code).

The JavaScript code could look similar to the following. I didn’t bother much with error checks and am not sure that the decision on a new name follows your logic.

// I leave out the file selection steps, assuming their path's are in the `paths` array
const app = Application("DEVONthink 3");
const database = app.currentDatabase();
/* Regular expression to find the metadata block */
const MDregEx = /^---\n(.*?)---$/ms;

/* Loop over all files */
paths.forEach(p => {
  const record = app.import(p,{name: p.split('/').pop(), to: database.root()});
  const txt = record.plainText();

  // get the metadata and remove them from the text
  const match = txt.match(MDregEx);

  if (!match) return; // continue with next record if no metadata found
  
  // Remove metadata from MD file 
  record.plainText = txt.replace(MDregEx, ''); 
   
  // Build a string array containing one entry for each line of metadata
  const metadata = match[1].split('\n');  
  let newName = undefined;

  // Loop over the metadata
  metadata.forEach(md => {
    const [key, value] = md.split(':');
    if (value === undefined) return; // Skip over lines without a colon
    if (key === 'tags') {
      record.tags = value.split(',');  
    } else {
      app.addCustomMetaData(value, {for: key, to: record});
      if (!newName && (key === 'Title' || key === 'Author')) {
        newName = value;
      }
    }
  })
  if (newName) {
    record.name = newName;
  }
})

anonny · July 21, 2024, 1:32pm

Thanks… I’ll see how I could do it in javascript.
The weird characters are actually unicode characters and I used them because I don’t imagine i’d ever use them in a real text file, and it was an easy way to “encapsulate” what I would then be able to easily delete through regex or applescript…
In essence, I just wanted to import these things into the metadata…
Then I kept running into errors, and just tweaking until I gave up.

Edit
I looked up the available applescript commads from devonthink, and even referenced the stable 2.5 applescript but it seems that devonthink isn’t using standard applescript in certain areas.
if you have an updated reference, please share.

chrillek · July 21, 2024, 1:56pm

What do you mean by that, do you have examples? I wouldn’t know, since I avoid AS whenever I can.

BLUEFROG · July 21, 2024, 1:57pm

You need to clarify this. DEVONthink certainly uses standard AppleScript but applications have independent implementations of the various functions they want to make scriptable.

Also, according to the script in the ZIP, you are trying to write metadata to Markdown documents. That is not supported, as is evidenced by the lack of functionality in the Info > Properties inspector when a Markdown document is selected.

set devonThink to application id “com.devon-technologies.think3”

The recommended form is application id "DNtp" and there’s no need to set this to a variable. Just tell it directly.

Also, the language you use is a matter of comfort and experience. JavaScript is not for everyone, just as AppleScript or ASOC is not. That being said, this applies custom metadata and tags from a selected Markdown document’s content…

tell application id "DNtp"
	repeat with theRecord in (selected records whose (type is markdown))
		set src to plain text of theRecord
		set mdMarker to false
		set od to AppleScript's text item delimiters
		repeat with theParagraph in (paragraphs of src)
			if (theParagraph is "---") and (not mdMarker) then
				set mdMarker to true
			else if (theParagraph is "---") and mdMarker then
				exit repeat
			else
				if theParagraph contains ":" then
					set AppleScript's text item delimiters to ":"
					set {theKey, theValue} to (text items of theParagraph)
					log {theKey, theValue}
					if (theKey is not "Tags") then
						add custom meta data theValue for theKey to theRecord
					else
						set AppleScript's text item delimiters to ","
						set tagList to (text items of theValue)
						set tags of theRecord to (theRecord's tags & tagList)
					end if
					set AppleScript's text item delimiters to od
				end if
			end if
		end repeat
	end repeat
end tell

It could easily be implemented as a smart script for use with batch processing and smart rules.

anonny · July 21, 2024, 2:03pm

I’m wondering if it would just be easier to set IndexRawMarkdownSource to true…
I’m ok with reindexing, or just building a new db … but I don’t want the metadata in the md file after it’s been imported…

BLUEFROG · July 21, 2024, 2:05pm

You mean removing the actual text in the document? If so, why?

anonny · July 21, 2024, 2:07pm

to improve search. I don’t want to sift through a lot of similar matches.

BLUEFROG · July 21, 2024, 2:09pm

I don’t want to sift through a lot of similar matches.

This is not a guaranteed occurrence. It depends on what you’re searching for and in what scope / context.

If you are planning to use or transmit the Markdown document outside DEVONthink, it would be wise to retain that text. The applied custom metadata in DEVONthink is not going to be used by an app like Typora, etc.

anonny · July 21, 2024, 2:13pm

I wasn’t planning on leaving Devonthink
Also, I can export the metadata with the text in csv… Most of my data is not long… less than 1k words so I could just add it back.
The difficult thing is how to add the metadata from the text.

I think I’m going to have to rethink this.
I was able to do the tags effectively, but the other stuff is tricky to do in one go.

BLUEFROG · July 21, 2024, 2:18pm

I’m not sure what’s difficult or needs rethinking when my script already adds the custom metadata and tags.

anonny · July 21, 2024, 2:19pm

Yes… I scrolled up and saw it after I posted

Where were you all day!!? I could have just asked a lot sooner!!!

Problem solved…
Phew!
I’m soooo thankful! thank you @BLUEFROG

BLUEFROG · July 21, 2024, 2:50pm

You’re welcome

By the way, your symbol ۞ as the first line keeps the metadata from being invisible.

PS: Here is a small modification, including stripping the metadata from the text (though I still don’t care for the idea )…

tell application id "DNtp"
	repeat with theRecord in (selected records whose (type is markdown))
		set src to plain text of theRecord
		set {incr, hasMetadata, docModified, mdMarker} to {1, false, false, 1}
		set od to AppleScript's text item delimiters
		repeat with theParagraph in (paragraphs of src)
			if (text of theParagraph is "---") and (not hasMetadata) then
				set hasMetadata to true
				set mdMarker to incr
			else if (text of theParagraph is "---") and hasMetadata then
				exit repeat
			else
				if theParagraph contains ":" then
					set docModified to true
					set AppleScript's text item delimiters to ":"
					set {theKey, theValue} to (text items of theParagraph)
					if (theKey is not "Tags") then
						add custom meta data theValue for theKey to theRecord
					else
						set AppleScript's text item delimiters to ","
						set tagList to (text items of theValue)
						set tags of theRecord to (theRecord's tags & tagList)
					end if
					set AppleScript's text item delimiters to od
				end if
			end if
			set incr to incr + 1
		end repeat
---------- Remove this section if you don't want to remove the metadata text in the content.
		if docModified and (mdMarker is not 1) then
			set AppleScript's text item delimiters to linefeed
			set modText to ({paragraphs (mdMarker + incr) thru -1 of src} as string)
			set AppleScript's text item delimiters to od
			set plain text of theRecord to modText
		end if
----------
	end repeat
end tell

anonny · July 21, 2024, 2:55pm

Why can I only give you one heart per reply??

need to send a ticket

BLUEFROG · July 21, 2024, 2:57pm

Haha! One is more than sufficient.
Make sure you duplicate a few documents and test the script, especially for the content stripping, before committing to using it on production files.

And more importantly to me: do you understand what’s going on in the script, the reasons for the bits and bobs in it?

anonny · July 21, 2024, 3:10pm

Oh!
I see… There’s no undo… It’s gone for good!
Yes…
Thank you for the warning.
I’m in Hanoi time, and my brain doesn’t work now.
But, in case anyone is wondering what I’m doing, and why I’m even rambling on about this is because I do Literary Analysis. I have an ai script which allows me to take a text excerpt, and process it, and spit out a template like in the test file. This is the prompt. The section can be set to any categories someone is working with.

You are a master archivist tasked with helping to organize and retrieve excerpts from various readings, primarily in literature but potentially covering a wide range of interests. Your goal is to distill and describe the essence of each excerpt through carefully chosen tags that will facilitate easy retrieval based on concepts.

Here is the excerpt to analyze:

<excerpt>

{{EXCERPT}}

</excerpt>

Your task is to create a set of tags that accurately capture the main concepts and themes of this excerpt. These tags should follow a specific structure and meet certain requirements:

1. Tag Structure:

- Only for the first tag, use the format: tag/subtag

- Separate multiple tags with a comma and space ", "

2. Tag Requirements:

- Only the first tag must be a tag/subtag type

- It must include the most appropriate category from the provided list of Academic Disciplines

- The subtag must be a keyword, or keyword-phrase that best fits the concept in the text excerpt

- Create up to 4 additional tags that continue the pattern of identifying the concept from broad to narrow

B. Metadata Requirements:

1. Provide a title for the excerpt (if not obvious, create a brief descriptive one that could aid in memorization in a declarative sentence form)

2. Provide an author (if known, otherwise leave blank)

3. Provide a Reference (if known, otherwise leave blank)

Here is the list of Academic Disciplines to choose from:

<AcademicDisciplines>

# Philosophy

Aesthetics

Applied philosophy

Philosophy of economics

Philosophy of education

Philosophy of engineering

Philosophy of history

Philosophy of language

Philosophy of law

Philosophy of mathematics

Philosophy of music

Philosophy of psychology

Philosophy of religion

Philosophy of physical sciences

Philosophy of biology

Philosophy of chemistry

Philosophy of physics

Philosophy of social science

Philosophy of technology

Systems philosophy

Political Philosophy

Epistemology

Justification

Reasoning errors

Ethics

Applied ethics

Animal rights

Bioethics

Environmental ethics

Meta-ethics

Moral psychology, Descriptive ethics, Value theory

Normative ethics

Virtue ethics

Logic

Mathematical logic

Philosophical logic

Meta-philosophy

Metaphysics

Philosophy of Action

Determinism and Free will

Ontology

Philosophy of mind

Philosophy of pain

Philosophy of artificial intelligence

Philosophy of perception

Philosophy of space and time

Teleology

Theism and Atheism

Philosophical traditions and schools

African philosophy

Analytic philosophy

Aristotelianism

Continental philosophy

Eastern philosophy

Feminist philosophy

Islamic philosophy

Platonism

Social philosophy and political philosophy

Anarchism

Feminist philosophy

Libertarianism

Marxism

</AcademicDisciplines>

To create the tags:

1. Carefully read and analyze the excerpt

2. Identify the main concepts, themes, and ideas presented

3. Select the most appropriate Academic Discipline that aligns with the excerpt's content

4. Choose a specific subtag that best represents the core concept of the excerpt

5. Create up to 3 additional tags that further refine and narrow down the concepts, moving from broad to specific

Format your final output as follows:

۞

---

Author: [insert author if known]

Title: [brief descriptive, and memorizable title in declarative sentence form]

Reference: [insert if known]

tags: [Insert your tags here, following the specified format and requirements]

---

۞

1. Note that there must be two return characters.

1a. Formatting is markdown YAML.

Ensure that your tags accurately reflect the content of the excerpt and would be useful for retrieving this information later based on its concepts and themes.

Do not comment on the tags.

BLUEFROG · July 21, 2024, 3:17pm

anonny:

Format your final output as follows:

۞

---

Author: [insert author if known]

Title: [brief descriptive, and memorizable title in declarative sentence form]

Reference: [insert if known]

tags: [Insert your tags here, following the specified format and requirements]

---

I strongly recommend you remove the first ۞ and make sure the metadata starts at the very first line of the document. Include the ۞ after the metadata block, if desired. I would also recommend you excise the code removal in the last version of the script I posted.

PS: Get some sleep. Some problems are better to rest on and come back to afresh.

PPS: The result without the content cutting and the change I recommended…

PPPS: The Philosophy/Transcendentalism tag is created as a set of nested tags. Just so you’re aware.

anonny · January 2, 2025, 3:20pm

In case anyone is interested, or if I somehow delete the script, this new script will work as a YAML thingy which will update the fields when changed, and also it will respect the “—” boundary. You can also add your own #tags manually which will autopopulate if you have the create tags from # setup. Also, it will change the filename to whatever is set in the title.

Title:something
tags:tag1, tag2, etc
metafield:one
metafield:two
#manualtag1 #manualtag2

tell application id "DNtp"
	repeat with theRecord in (selected records whose (type is markdown))
		-- First clear all existing custom metadata
		set custom meta data of theRecord to {}
		-- Clear existing tags
		set tags of theRecord to {}
		
		set src to plain text of theRecord
		set mdMarker to false
		set od to AppleScript's text item delimiters
		set foundTitle to "" -- Variable to hold the title for renaming
		set paraCount to count paragraphs of src
		set i to 1
		
		repeat until i > paraCount
			set theParagraph to paragraph i of src
			if (theParagraph is "---") and (not mdMarker) then
				set mdMarker to true
			else if (theParagraph is "---") and mdMarker then
				exit repeat
			else if mdMarker then
				if theParagraph contains ":" then
					set AppleScript's text item delimiters to ":"
					set {theKey, theValue} to (text items of theParagraph)
					log {theKey, theValue}
					if (theKey is "title") then
						set foundTitle to theValue
					end if
					if (theKey is not "Tags") then
						add custom meta data theValue for theKey to theRecord
					else
						set AppleScript's text item delimiters to ","
						set tagList to (text items of theValue)
						set tags of theRecord to tagList
					end if
					set AppleScript's text item delimiters to od
				end if
			end if
			set i to i + 1
		end repeat
		
		if (foundTitle is not "") then
			set name of theRecord to foundTitle
		end if
	end repeat
end tell