The open source project mineru is highly recommended as a tool for pdf to markdown

I use DEVONthink 4 to convert pdf to markdown, but there are still many formatting issues, even if it’s just text without images or tables. I highly recommend an open source project tool mineru that works amazingly well. Hopefully DEVONthink can integrate it. https://mineru.net

How is that open source if they don’t provide the source code? Or did I just not see it on their website.

1 Like

Isn’t it at GitHub - opendatalab/MinerU: A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。?

However, it is under the AGPL-3 license, which is not compatible with integrating it into closed source software.

I asked a software developer friend of mine to see if I could write an Apple script combined with Devonthink’s smart rules to do this.

Tx, found it. So, one can run the software locally. Didn’t try that yet.

As for the downloadable client (which sends everything to a server presumably in the PRC): It has “arm-64” in its name, although it’s for Apple silicon. Opening the DMG gives this lovely image.

One of the “do not loc” buttons on the right does indeed install the program. I tried it with one publicly available PDF. The MD is put in a folder, where one has to retrieve it manually. It contains only level-one headings, which is quite unusual.

Run another test with an account statement – same useless result as with ABBYY, as it extracts the text by columns – first all the dates, then all the details, then all the amounts. Not usable (but not worse than what ABBYY does).

To install the code locally, one needs conda, which is (again) a packet manager – no idea why they can’t use homebrew. I’m not going there.

They do provide an API, which again requires you to send your files to a server. The website is run by a PRC government domain. I would not feel comfortable sending my data there. And although they offer to register with GitHub, the button does nothing. Does not seem to be a very mature project.

If someone manages to install the thing locally, they could perhaps report here. Apparently, it has a command line interface, so it should be scriptable.

1 Like

Weird I’m using his web version and I think it’s pretty good.

Perhaps. I will not use it because I have no interest to upload my documents to a server under the control of the PRC’s government. YMMV, of course.

1 Like

I just downloaded the abbyy plugin in DEVONthink, will this make the pdf to markdown better

Since it’s AGPL, there’s no license you have to agree to. Unfortunately DMGs don’t seem to support that. Or at least, I wasn’t able to find a way to bypass the license screen.

The need for Conda and Python for a local install is a deal-killer for me.

1 Like