XML

More and more of the data I deal with is stored in XML files, and I guess I am not alone with this. While DT handles XML pretty well as just another file format, I would love to see more XML specific support. Here are some first ideas.

  • Opening XML files: it would be nice if “open” would run some program appropriate for the XML data in a particular file. That would obviously require some configurability, perhaps through scripts.

  • Indexing XML data: at the moment, DT seems to index just the text values of nodes. This is fine in many cases, but it would also be nice to be able to search for specific tag/value combinations. Moreover, I would like to be able to have certain tags not indexed at all, for example tags that occur in huge numbers but refer to data that is not human-friendly.

  • Identifying XML formats: both of the features I mentioned would require some way to identify the type of XML file. DTDs or schemas might seem like the obvious choice, but many XML files don’t use any. For most purposes, the set of tags used, and/or the set of tags used most frequently, is enough to decide what to do with an XML file, and DT is pretty good at doing that kind of classification.

What I would like to see is something like the following: I would select a few XML files and tell DT to make an XML data type definition from them. That type definition would include a list of tags to include or exclude in indexing, an application to open the files with, and a template for displaying it inside DT. New XML files would then be assigned to those types automatically.

XML support would also help with another issue that has been mentioned here before: bibliography management with DT. If the citation data were stored in some XML format (e.g. the risx format used by refDB), then it could be managed/indexed in DT while all the processing and conversion could be left to external tools.

Interesting ideas, we’ll think about them. But – this would propel DEVONthink into the real of XML authoring systems and maybe it would be overkill to people who use it for completely different purposes. We have to balance all feature requests to make a “round” product.

Best,

Eric.

No authoring! I don’t want an all-in-one program suite either. No editing or transforming of XML, just intelligent handling and indexing.

In fact, DT already does a pretty good job at handling XML, and I will probably use it for bibliography management already as it is. My comments are the result of playing with molecular XML files (for those in the biomolecule business: the XML version of PDB files). Those files contain mostly lots of atomic coordinates, plus author, journal, and chemical information. The latter are all I care about within DT. But since the whole file is indexed, “see also” will show me files with similar number values of atomic positions, rather than files that describe similar molecules or were submitted by the same authors.

I like kh’s idea of XML integration very much, and I agree that this format will become overwhelmingly important in the near future. Some comments:

There are cheap apps like EditiX which would do that easily. It seems that you want DT being aware of the DTD or similar documents in order to perform more sophisticated searches:

Perhaps this could be done without XML-awareness, just as a new option “exclude the following strings from search:”. Then one may put in all tags with jokers that can be ignored. This may be integrated in the search window, since phrases to exclude might be another helpful feature in other cases as well.

Your suggestions is more sophisticated and uses the potential of XML, but may be, one can achieve the same goal with similar ease but without DT as an XML validator.

Just some spontaneous thoughts,
Maria

It seems that DT ignores tags completely at the moment (at least they do not show up in the word lists), so any filtering based on tags would require some additional work.

I know that there are good XML tools already, and I don’t expect DT to do anything else than indexing. However, I do expect DT to do whatever it does without requiring me to run every file manually through some other program. Just my laziness :wink:

I guess I could imagine it a tremendously powerful feature if devonthink could somehow allow any file in the database to be wrapped or accompanied by an xml file that would allow keywords, categories, notes, or other metadata creation info to be added to any file in the database, and for this info to be ‘exposed’ or at least exportable as xml. Having xml be the native metadata format for DT would of course allow the easy import of the type of journal articles described above, and would allow the user to add further comments or notes as custom fields if they so desired. If Devonthink were able to read the xml data into into its data store and integrate the management of each file’s metadata, this would be a true two-way system. I can almost imagine a system where the metadata could be modified both inside and outside of the dt universe, by, for example a perl script, and that devonthink could then see the modifications and update its internal stores appropriately. DT would not then be an XML file creator, but a better importer of xml because it would be a better exporter of xml.

Ok, well, I can dream, can’t I?

Eric