Best way to import some text with metadata?

Hi!

I have a bunch of plain-text news articles from LexisNexis. I have an R script that will let me read them in as data, so then I can export them - and all related metadata like author, etc - into whatever format I want.

I’m trying to figure out the best way to format these files to get them into Devonthink because I greatly want to take advantage DevonThinks file tools to highlight and track whatever I think is interesting.

I’ve only had DevonThink for a few months and I’m not sure where to get started here. I don’t know a lick of AppleScript or JavaScript, but I did see smart rules and you can search for text inside? Could I format the text file with value:key pairs for easy metadata filling?

Any help - links to resources or whatever would be helpful. I have searched the forum some, and all signs really point to me learning AppleScript, but I also want to get started on this project while I learn it -so thank you all in advance.

In a markdown file (which is essentially a clear text file): yes. Consult the Multimarkdown documentation for the details.

If you’ve done programming in R, JavaScript should not be too challenging, I think.

1 Like

I formatted my markdown documents with some yaml metadata - which is basically just key/value pairs with three dashes up top, three on the bottom.

Like this.

---
title: "Withholding weddings"
source: "Coshocton Tribune"
author: "By, Jessie Balmert"
date: 2015-08-08
word_count: 446 words
---

Here is my rule.

It looks like the date works okay since there are actually two in the document.

But there’s nothing in the company field, which should come from the "source: " area of the markdown doc. Dates fine, company is blank.

Can anyone give it a once over? Thank you!

How exactly should it come from there? You use source to set the “Author” field, but you do nothing (at least not that I can see) to set the company field. DT might provide some AI, but I doubt that it goes this far.

That’s fair, I had the wrong field selected. So I fixed it, and nothing shows in company still after running the rule. I’m curious if I’m getting the concept right, because it’s not working and I’m not sure what I’m doing wrong other than confusing posting.



It seems that the company is a list field. It should probably be a text field. See your definition of company in the custom meta data.

1 Like

Thank you for looking, @chrillek. Unfortunately it’s single-line text. I think that’s just a string. I feel like I need to RTFM but I did. I feel like I need to R_a_different_FM. Anyone got a suggestion?

I suppose that ‘Scan Text’ does not consider the Markdown meta data. But @cgrunenberg would know about that.

That the ‘Date’ is set in your example is obvious, because you’re using the document date, not the date in the meta data section.

That’s right unless the hidden preference IndexRawMarkdownSource was used to import/index the document.

Thanks for confirming. Here’s a small sample script that could be used, too:

(() => {
  const app = Application("DEVONthink 3")
  app.includeStandardAdditions = true;
  app.selectedRecords().forEach(r => {
    const p = r.plainText();
  //  console.log(p);
    const companyRaw = p.match(/\nsource:\s+(\S+)/);
    if (companyRaw) {
      app.addCustomMetaData(companyRaw[1], {for: "company", to: r});
    }
  })
})()

Do I just set that with
defaults write com.devon-technologies.think3 IndexRawMarkdownSource -string yes?

I did that and reindexed the files and it’s still not working for me.

Thanks for the time it took to do this. I got a syntax error on that first >.

Set the language in the script editor to JavaScript

Got it. Perfect for me to mess around with. Thank you so much.

It’s a boolean, not a string (see Help > Appendix > Hidden Preferences)

defaults write com.devon-technologies.think3 IndexRawMarkdownSource -bool TRUE

In addition, DEVONthink 3 can’t run at the same time. Afterwards rebuild the database or reimport the documents.

I did that and confirmed it’s “1” when I run defaults read com.devon-technologies.think3 IndexRawMarkdownSource -bool.

It’s still reading the second date in the file as document date. Is this expected behavior? While I’m here - I don’t need to be fancy about markdown, is there a better format for a group of text with metadata into DevonThink?

document date does not necessarily return the first date.

I am following this thread with interest because (as some here will know) I’ve had fun trying to extract dates from documents to feed into custom metadata.

In the current case would Scan Text > Date: * followed by use of the Document String placeholder achieve what @grantfan needs?

Edit: Ah - but then some manipulation would be required (by script?) to turn that into a date format: bother!

Stephen

The Document Date placeholder should work if the action looks like this:

Scan Text > Date > Date: *

Ah yes, of course: thanks!

Stephen