OpenDataLoader and Devonthink?

I came across OpenDataLoader recently: https://opendataloader.org/

It’s an open-source tool that parses PDFs into structured formats like Markdown or JSON, while preserving layout, tables, and reading order. The focus is clearly on making PDFs usable for AI workflows, rather than just extracting plain text.

Could be relevant for DEVONthink users as a preprocessing step, especially when working with more complex documents.

Has anyone here tried it or something similar?

In case of complex PDF documents (e.g. containing lots of graphs and formulas or having a garbled text layer) I use the Vision document mode (e.g. in the Options of the chat assistant and also in case of Chat - Query smart actions) frequently so that the AI does not use the poor text layer of each page but images instead.

Note that extracting tabular and especially image data in a way that LLMs can make use of is an unsolved and very challenging problem. Figuring out that “this text, this table, and this image all elaborate on the same idea” is a basic literacy skill for humans, but very difficult for machines.

There’s some in depth discussion from my field here, but the same issues are going to pop up in any discipline. Applications of natural language processing and large language models in materials discovery | npj Computational Materials

1 Like