Converting web articles to Markdown suddenly producing corrupted documents

I’m not sure what has changed, but this started happening yesterday.

My current workflow involves clipping articles that I’m interested in to DEVONthink, and having a Smart Rule convert the web articles into Markdown. This was working fine for a long time, but now it has started to generate Markdown documents with lots of spurious data at the beginning and end, plus various typographic symbols such as quotes and dashes are getting mangled.

This is happening no matter where I clip from — Firefox, Safari and NetNewsWire, in my use cases.

Perhaps the Markdownifier system that DEVONthink uses is just feeling poorly? In any case, I thought it best to raise the matter here to raise awareness. I can send an example file if that would help.

Send a URL that is not working?

Any reason you don’t clip directly to Markdown?

That’s how I normally do it (although it does not work on all web sites due to how the web site is configured … probably deliberately so to make it hard for me to save as markdown). For those sites, even saving as PDF doesn’t work.

I’m seeing no problems today saving direct to Markdown today.

1 Like

Here’s one of the links that was producing mangled Markdown: Tabs - Chris Coyier

I normally clip to Web Archive because that seems to capture more information that Markdown most of the time. And yes, I’m only too aware that some sites make it nigh-on-impossible to do any clipping. :frowning:

I just tried the above link using both Web Archive and Markdown options in the DEVONthink clipper, the latter produced clean text, but the former still has malformed text at the top of the article.

I’ve noticed that the longer the article is, the more detritus shows up in the Markdown conversion. Hoping the screenshot above may shed more light for someone on what is going on, right now I’m stumped (and turning off the conversion Smart Rule for time being.)

1 Like

Well, sorry to report on that URL, I saved as

  • Web Archive Clutter free
  • Web Archive No Clutter Free
  • Markdown
  • Converted both Web Archive into Markdown files

All are of course slightly different in how presented (normal), but no “?” icons. Down to something technical with fonts or something on how my Mac is setup vs. yours, I guess. I don’t know much more than that.

MacOS 13.0.1 and DEVONthink Pro 3.8.7.

I could send you my five files, but I don’t really think that useful to you.

Hmm, that is really odd! The only change I made yesterday was updating my Mac to macOS 13.1, so maybe that changed something?

It’s not a deal-breaker for me, but definitely requires more work using BBEdit to clean up the text. Clipping to Markdown should be a sufficient work-around for now, though I’ll need to check to make sure it captures all the article text.

A pretty big clue, IMHO.

I can reproduce the problem here. DT Pro 3.8.7, macOS 13.1 (so there’s one difference to @rmschne’s set up and a similarity to @AlanRalph’s).

The “weird” stuff at the start of the MD file looks like a binary plist. In fact, if I have a look at the webarchive file (i.e. …database/files.noIndex/webarchive/xxx/yyy.webarchive, it seems to be a binary plist:

 webarchive/4 > file Tabs\ -\ Chris\ Coyier.webarchive
Tabs - Chris Coyier.webarchive: Apple binary property list

But that’s just as it should be: a webarchive is a plist according to Wikipedia. So, presumably Apple has changed something under the hood breaking conversion to Markdown, @cgrunenberg?

2 Likes

Is Safari able to display the web archive as expected?

Yes, as is DT. It’s the conversion to MD that goes off the rail.

And how exactly did you create the Markdown document? Thanks!

Just opened the above web page in Safari on macOS 13.1, then opened the Sorter’s Clip to DEVONthink tab, selected Safari and clipped a Markdown document, a decluttered webarchive and a non-decluttered webarchive. No issue at all. But the conversion of the webarchives to Markdown fails indeed on 13.1 but not on older releases. Looks like a new bug of Ventura.

2 Likes

I used „convert to markdown“ in the context menu of the webarchive.

And fixed for the next release. It’s actually an issue of the initial conversion to rich text.

2 Likes

Thank you for the quick response! I thought for a while that DEVONthink was suffering database corruption when I first saw this happening yesterday. Phew!

Clipping directly to Markdown seems to be working for most sites I’ve visited today, though there’ll always be the odd one that — either through poor design or lack of care — make that task a lot harder than it need be…

I thought for a while that DEVONthink was suffering database corruption when I first saw this happening yesterday. Phew!

To put your mind at ease, actual database corruption is a very uncommon thing. :slight_smile:

Indeed — the few times I’ve had any problems, they’ve turned out to be self-inflicted, usually through either indexed folders and/or items moving. :sweat_smile:

1 Like