Searching for HTML data attributes?

winter · November 23, 2022, 10:42pm

Is it possible to search for HTML attribute names, and or the values of such attributes, in DEVONthink ?

I have a folder of HTML <ul> <li> outline files (generated by Jesse Grosjean’s Bike Outliner) which is indexed by DEVONthink.

This works well for searching the <p> text content of these files (as well as for viewing and printing them with custom CSS, as it happens) but I also need to search through these files for particular HTML user attributes and their values (essentially attributes with a data- prefix in their names).

Does that sound like something that DEVONthink is equipped to do ?

cgrunenberg · November 24, 2022, 8:03am

No. The tags & attributes are not indexed, only the text (and some metadata like title etc.)

chrillek · November 24, 2022, 8:52am

You could perhaps search for these attributes using a script and then mark the records accordingly.

Or use css to make these elements stand out visually.

winter · November 24, 2022, 9:58am

Thanks, and yes – css signalling of the presence of particular attributes is working well.

I suppose, on reflection, that the challenge of HTML user attributes in this context is that they are tied to particular HTML elements (particular outline rows, in this context),
whereas DEVONthink record indexing (if I have understood it correctly) provides pointers to whole documents, rather than to particular lines.

But I’ll think about a script. This may really be a job for XML tooling and XPath expressions, etc.

chrillek · November 24, 2022, 10:21am

If you’re interested in the hardcore way…

DT does index the content of documents, not the markup. Which makes sense, I think.

winter · November 24, 2022, 10:31am

Fair enough – (key, value) custom user data probably does fall somewhere in the twilight zone of that distinction, and the case of:

same attribute
different values in different rows

may not map well onto a document-level record.

It looks as if the CSS attr() function can be cajoled into redefining visible content (to include an attribute value string)

(tho that does leave the “content of documents, not the markup” distinction a bit unclear)

winter · November 24, 2022, 12:16pm

Here, for example, in a DEVONthink preview panel for an HTML (Bike file) record, we see (and print) the text:

“Attribute charlie: Seven”

but we can search for none of those words, though we can search for the word “prior” above them:

Searching for CSS-generated strings is clearly out of range, but I can imagine extracting and indexing user data- attribute values during DEVONthink’s parse of the HTML.

(Perhaps in the same way that DEVONthink can extract and list the hyperlinks in an HTML document)

FWIW the CSS might be something like:

li[data-charlie] {
    background-color: pink;
}

li[data-charlie]:after {
    content: "Attribute charlie: " attr(data-charlie);
}

chrillek · November 24, 2022, 1:16pm

I doubt that DT can find generated text (ie what before and after produce in a browser). This stuff is not „there“ unless the HTML is rendered.

winter · November 24, 2022, 1:23pm

Well, given:

<p data-charlie="Eight">Seven</p>

it might prove hard to define a sense in which the user data Eight is less there than the Seven,

but I do agree with you, of course, about text which is only in a CSS file.

As an aside, I don’t think we would say that in the structurally identical piece of HTML below, the href link attribute value was less “there” than the label text:

<a href="https://discourse.devontechnologies.com/t/searching-for-html-data-attributes/73609/9">Searching for HTML data attributes?</a>

(And, of course DEVONthink indexes both element content and attribute content in that case, hence the Links panel in the Document tab of the DEVONthink inspector)

Attribute values and element contents are clearly both there,
and both equally indexable at HTML parse time.

chrillek · November 24, 2022, 3:09pm

As I said: DT indexes the text of the document, not the HTML elements nor their attributes. Those are just markup. Treating the values of the attributes as if they had any meaning is pointless, in my opinion. If you start with data- attributes, why stop there? Class names could have meaning, too. As might have colors, perhaps.

Which leads to the question of one should perhaps also index CSS files …

If someone wants to convey meaning in HTML, they have abundant possibilities. data attributes are not meant for that purpose (if alone because of accessibility issues). To quote the MDN text on data attributes:

Do not store content that should be visible and accessible in data attributes, because assistive technology may not access them. In addition, search crawlers may not index data attributes’ values.

winter · November 24, 2022, 3:46pm

Well, it does, in fact, index the href attributes.

It just a question of design – which attributes one chooses to index, and which to ignore.

But, for the moment, at least – the way forward is clearly to handle custom user data in Bike files by XPath and script.

If there were enough users indexing Bike files in DT, then, arguably, it might be become worth reviewing the indexing of data- attributes (in addition to the existing indexing of href attributes) by DEVONthink too.

BLUEFROG · November 24, 2022, 4:54pm

First instance AFAIK.

winter · November 24, 2022, 5:59pm

Yes – early days, and Bike row attributes are, for the moment, only accessible through the Bike scripting interface.

My understanding is that that is likely to change in tandem with the planned introduction of stylesheets, which will make the attributes more directly visible – without add-on scripts – in the application itself.

Until then, just scouting around for a good app to use for indexing and searching them. DEVONthink seems a good fit for:

indexing of a cloud of outlines in a given folder, and
viewing and printing both <p> text and custom attributes with custom CSS.

system · November 23, 2025, 5:59pm

This topic was automatically closed 1095 days after the last reply. New replies are no longer allowed.