Best way to import some text with metadata?

grantfan · December 4, 2021, 7:14pm

Hi!

I have a bunch of plain-text news articles from LexisNexis. I have an R script that will let me read them in as data, so then I can export them - and all related metadata like author, etc - into whatever format I want.

I’m trying to figure out the best way to format these files to get them into Devonthink because I greatly want to take advantage DevonThinks file tools to highlight and track whatever I think is interesting.

I’ve only had DevonThink for a few months and I’m not sure where to get started here. I don’t know a lick of AppleScript or JavaScript, but I did see smart rules and you can search for text inside? Could I format the text file with value:key pairs for easy metadata filling?

Any help - links to resources or whatever would be helpful. I have searched the forum some, and all signs really point to me learning AppleScript, but I also want to get started on this project while I learn it -so thank you all in advance.

chrillek · December 4, 2021, 7:28pm

In a markdown file (which is essentially a clear text file): yes. Consult the Multimarkdown documentation for the details.

If you’ve done programming in R, JavaScript should not be too challenging, I think.

grantfan · December 6, 2021, 12:09am

I formatted my markdown documents with some yaml metadata - which is basically just key/value pairs with three dashes up top, three on the bottom.

Like this.

---
title: "Withholding weddings"
source: "Coshocton Tribune"
author: "By, Jessie Balmert"
date: 2015-08-08
word_count: 446 words
---

Here is my rule.

It looks like the date works okay since there are actually two in the document.

But there’s nothing in the company field, which should come from the "source: " area of the markdown doc. Dates fine, company is blank.

Can anyone give it a once over? Thank you!

chrillek · December 6, 2021, 7:46am

How exactly should it come from there? You use source to set the “Author” field, but you do nothing (at least not that I can see) to set the company field. DT might provide some AI, but I doubt that it goes this far.

grantfan · December 6, 2021, 2:33pm

That’s fair, I had the wrong field selected. So I fixed it, and nothing shows in company still after running the rule. I’m curious if I’m getting the concept right, because it’s not working and I’m not sure what I’m doing wrong other than confusing posting.

chrillek · December 6, 2021, 2:38pm

It seems that the company is a list field. It should probably be a text field. See your definition of company in the custom meta data.

grantfan · December 6, 2021, 2:45pm

Thank you for looking, @chrillek. Unfortunately it’s single-line text. I think that’s just a string. I feel like I need to RTFM but I did. I feel like I need to R_a_different_FM. Anyone got a suggestion?

chrillek · December 6, 2021, 3:16pm

I suppose that ‘Scan Text’ does not consider the Markdown meta data. But @cgrunenberg would know about that.

That the ‘Date’ is set in your example is obvious, because you’re using the document date, not the date in the meta data section.

cgrunenberg · December 6, 2021, 3:21pm

That’s right unless the hidden preference IndexRawMarkdownSource was used to import/index the document.

chrillek · December 6, 2021, 3:27pm

Thanks for confirming. Here’s a small sample script that could be used, too:

(() => {
  const app = Application("DEVONthink 3")
  app.includeStandardAdditions = true;
  app.selectedRecords().forEach(r => {
    const p = r.plainText();
  //  console.log(p);
    const companyRaw = p.match(/\nsource:\s+(\S+)/);
    if (companyRaw) {
      app.addCustomMetaData(companyRaw[1], {for: "company", to: r});
    }
  })
})()

grantfan · December 6, 2021, 3:45pm

Do I just set that with
defaults write com.devon-technologies.think3 IndexRawMarkdownSource -string yes?

I did that and reindexed the files and it’s still not working for me.

grantfan · December 6, 2021, 3:45pm

Thanks for the time it took to do this. I got a syntax error on that first >.

chrillek · December 6, 2021, 3:47pm

Set the language in the script editor to JavaScript

grantfan · December 6, 2021, 4:00pm

chrillek:

(() => {
  const app = Application("DEVONthink 3")
  app.includeStandardAdditions = true;
  app.selectedRecords().forEach(r => {
    const p = r.plainText();
  //  console.log(p);
    const companyRaw = p.match(/\nsource:\s+(\S+)/);
    if (companyRaw) {
      app.addCustomMetaData(companyRaw[1], {for: "company", to: r});
    }
  })
})()

Got it. Perfect for me to mess around with. Thank you so much.

cgrunenberg · December 6, 2021, 4:11pm

It’s a boolean, not a string (see Help > Appendix > Hidden Preferences)

defaults write com.devon-technologies.think3 IndexRawMarkdownSource -bool TRUE

In addition, DEVONthink 3 can’t run at the same time. Afterwards rebuild the database or reimport the documents.

grantfan · December 6, 2021, 4:32pm

I did that and confirmed it’s “1” when I run defaults read com.devon-technologies.think3 IndexRawMarkdownSource -bool.

It’s still reading the second date in the file as document date. Is this expected behavior? While I’m here - I don’t need to be fancy about markdown, is there a better format for a group of text with metadata into DevonThink?

cgrunenberg · December 7, 2021, 8:58am

document date does not necessarily return the first date.

Stephen_C · December 7, 2021, 10:15am

I am following this thread with interest because (as some here will know) I’ve had fun trying to extract dates from documents to feed into custom metadata.

In the current case would Scan Text > Date: * followed by use of the Document String placeholder achieve what @grantfan needs?

Edit: Ah - but then some manipulation would be required (by script?) to turn that into a date format: bother!

Stephen

cgrunenberg · December 7, 2021, 10:19am

The Document Date placeholder should work if the action looks like this:

Scan Text > Date > Date: *

Stephen_C · December 7, 2021, 10:38am

Ah yes, of course: thanks!

Stephen